4 min read

Part 5 - The 80/20 Rule for Scraping Scalability: Why Most Websites Are Predictable (And Why That's Good)

Part 5 - The 80/20 Rule for Scraping Scalability: Why Most Websites Are Predictable (And Why That's Good)

Can you build one scraper that works on 500 different websites? Sort of. Here's the truth: 80% of mid-market PE sites follow the same 2-3 structural patterns. The other 20% will consume 80% of your engineering time. The key isn't building a scraper that handles everything—it's knowing which sites to skip.

After scraping dozens of private equity websites, I've learned that "universal" doesn't mean "works on literally everything." It means "works on everything that's built the same way." And thankfully, most PE firms are building their sites the same way.

Why Structural Similarity Actually Matters

Here's the dirty secret of the PE web design world: most firms are using 3-4 common page templates. Whether it's a $500M fund or a $5B fund, they're following the same playbook. This isn't laziness—it's professionalism. These firms aren't trying to reinvent web design. They're trying to look trustworthy.

URL patterns are predictable:

  • /portfolio or /investments for the main listing
  • /portfolio/company-name for individual companies
  • Sometimes /companies or /our-investments if they're feeling creative

Data hierarchy is consistent:

  • Company name (always front and center)
  • Industry/sector (usually a tag or label)
  • Investment date or year (when you're lucky)
  • Brief description (marketing fluff, occasionally useful)
  • Team members involved (hit or miss)

Layout conventions never change:

  • Grid cards (the overwhelming favorite)
  • List items with hover states
  • Tables (the old guard still uses these)
  • Occasionally tabs or filters by sector

This standardization is exactly what makes generalized scraping possible. When websites follow conventions, patterns emerge. When patterns emerge, automation works.

I'm currently hitting 50-60% success rates across mid-market PE firms without writing a single line of site-specific code. That's not magic—it's pattern recognition at scale.

Where Everything Falls Apart

Of course, there are always outliers. Here's where my scraper consistently struggles:

Custom frameworks that get weird: Some firms hire boutique agencies that build entirely custom HTML/CSS structures. No standard classes, no semantic HTML—just divs wrapped in divs wrapped in more divs with class names like wrapper-container-box-3.

JavaScript-heavy SPAs: Single Page Applications that load everything dynamically are my nemesis. If the page shows a loading spinner for 5 seconds before rendering content client-side with React or Vue, you need headless browser rendering. That costs more API credits and adds complexity.

Gated content: Some firms put portfolios behind login walls or require form submissions. Unless you want to get into credential management (you don't), these are dead ends.

The true mavericks: Every once in a while, you hit a site that's just... different. PDFs instead of pages. Portfolio companies only mentioned in news articles. Embedded iframes. These are the 1% that make you question your life choices.

The Scalability Decision Tree

START BROAD:
- Use AI extraction with generic schemas
- 60-80% of sites work immediately

LAYER IN HEURISTICS:
- Pattern matching for common structures (grids, tables, lists)
- Adaptive depth based on link density
- URL categorization (listing vs detail pages)

ADD FALLBACKS PROGRESSIVELY:
- Markdown + regex for AI failures
- Link harvesting for missed pages
- Minimal capture as last resort

KNOW WHEN TO STOP:
- After 3 failed extraction attempts → flag for manual review
- If a site needs >2 hours custom work → not worth it
- Focus on the 80% that work, skip the 20% that don't

ONLY add site-specific code for:
- High-value targets (large funds, strategic data)
- Repeatable patterns you'll see again
- Modular overrides (config files, not inline conditionals)

Scalable Techniques That Keep Working

The magic of handling structural similarity without hardcoding per-site rules comes down to a few key strategies:

AI extraction schemas tuned to PE data: Firecrawl's extraction API works because I'm asking for the same fields every time—company name, industry, investment role, ownership, year, location, website. The AI learns to recognize these patterns regardless of the underlying HTML structure.

Adaptive heuristics that detect page types:

  • If the main portfolio page lists 10+ companies, I know it's a listing page
  • If a URL contains /portfolio/[slug], it's probably a detail page
  • Link density on a page tells me if I need to crawl deeper or stop

Multi-tier fallbacks that degrade gracefully:

  1. AI extraction (works 60% of the time)
  2. Markdown + regex patterns (adds another 25%)
  3. Link harvesting for missed pages (adds another 10%)
  4. Manual overrides for stubborn cases (the final 5%)

Per-domain difficulty scoring: I track which domains consistently fail extraction. After 3 failed attempts, a domain gets flagged for manual review. This prevents wasting credits on impossible targets and helps me focus engineering effort where it actually matters.

The Practical Approach: Start Narrow, Expand Smart

Here's the strategy that actually works:

Start with what Firecrawl understands best. I didn't strategically pick an archetype—I just started scraping sites and noticed patterns in what worked. The sites that succeeded on the first try had clean HTML, semantic structure, and clear content hierarchy. The ones that failed were JavaScript-heavy SPAs, custom CSS frameworks, or chaotic HTML.

Double down on what's working. Once I found firms where extraction worked cleanly (grid layouts, consistent URL patterns, standard portfolio structures), I looked for more sites that looked similar. Not by checking their tech stack—just by eyeballing the page layout and URL structure.

Know when to stop. When I hit a site that requires completely different extraction logic, I don't try to force it. I add it to a "requires custom handler" list and move on. That list currently sits at about 40% of the sites I've tested—and I'm fine with that.

Key Takeaway: The 80/20 Reality

Here's the honest math: 80% of mid-market PE websites follow 2-3 structural patterns. That means my "universal" scraper handles about half of them without breaking a sweat.

The remaining half? Some can be handled with minor tweaks—adjusting timeout settings, enabling JavaScript rendering, or refining the extraction schema. Others need custom code or just aren't worth the effort.

When I add site-specific overrides:

  • Only after 3+ failed extraction attempts
  • Only for high-value targets (large funds, strategically important data)
  • Overrides are modular—separate config files, not inline conditionals

The goal isn't to scrape every site perfectly. The goal is to scrape the majority of sites well enough, and know when to walk away from the rest.


Next time: I'll show you the decision framework I used to kill a working project—and how to know when something you built isn't worth launching, no matter how well it functions.

Quick exercise: Calculate your 80/20 split. If you need data from 50 websites, how many would fall into the "easy" category vs "nightmare" category? Reply with your numbers—I'm curious how this ratio varies across industries.