5 min read

Part 2 - Building a Scraper That Doesn't Break: The Progressive Fallback System

Part 2 - Building a Scraper That Doesn't Break: The Progressive Fallback System

Most scrapers are brittle—they work perfectly on one site and break completely on the next. Mine assumes every site is different and has a backup plan for the backup plan. Here's why that matters.

I learned this the hard way. My first scraper worked beautifully on the test site I built it for. Then I pointed it at a second private equity website and got... nothing. No data. Just empty fields and error logs.

The problem wasn't my code. It was my assumption that all websites would play nice. They don't.

Every VC firm structures their portfolio differently. Some use clean HTML tables. Others render everything in JavaScript. A few hide company data behind tabs that only load on click. And one firm I tested literally used a PDF embedded in an iframe.

If your scraper relies on a single extraction method, it will fail. Not sometimes—always. The question isn't whether it will break, but how gracefully it fails when it does.

The Three-Phase Architecture That Actually Works

After burning through hundreds of Firecrawl credits testing dozens of investor sites, I landed on a three-phase approach that handles variability without requiring custom code for each site.

Phase 1: Discovery (Find the Investment Pages)

You can't extract data you can't find. The first challenge is identifying which pages on a site actually contain portfolio information.

I use adaptive crawl depth based on link density. If the main /portfolio page has 200+ links, that's probably a listing page showing all companies—I keep the crawl shallow (2-3 levels deep). If it's sparse with only navigation links, the real data might be buried deeper (4-5 levels).

Then I categorize URLs into two types: listing pages (like /investments or /portfolio showing multiple companies) and detail pages (like /portfolio/stripe with one company's information). This matters because I extract data differently depending on which type I'm looking at.

Why pagination is tricky: Some sites use ?page=2 in the URL. Others use infinite scroll that loads content dynamically. A few just don't paginate at all and expect you to scroll through 300 company logos. My scraper looks for "next," "more," or numeric links and follows them—but I set a hard limit of 50 pages because I once accidentally crawled 3,000 pages of a firm's blog archive.

Phase 2: Extraction (Turn HTML Into Structured Data)

This is where hobby projects die. You can crawl all day, but if you can't extract clean data, you've just collected garbage.

I use Firecrawl's AI extraction with two different extraction templates depending on page type. For listing pages, I extract an array of companies with name, industry, stage, and URL. For detail pages, I go deeper: company name, CEO, industry, investment role, founding year, and description.

The AI reads the page like a human would and maps it to these fields. When it works—which is about 60% of the time—it's genuinely magical. You point it at any reasonably structured page and get back clean JSON.

When it doesn't work, that's when the fun begins.

Phase 3: The Multi-Tier Fallback Pipeline

Here's the key insight that separates working scrapers from broken ones: never rely on a single extraction method.

My fallback system has four layers:

Tier 1: AI Template Extraction (60% success rate) Primary method using Firecrawl's LLM to parse structured data. Fast, accurate when it works, but fails on unusual layouts or heavily JavaScript-rendered pages.

Tier 2: Markdown + Text Pattern Matching (adds 25%) Strip the page to markdown and look for patterns. Names followed by titles like "CEO" or "Founder." Industries in headers. Investment context like "led Series B" or "participated in the round."

Is this perfect? No. Does it capture most of what the AI missed? Yes.

Tier 3: Link Harvesting (adds 10%) Sometimes you can't extract the data directly, but you can find where it lives. I grab every internal link that looks promising—URLs containing company names, falling under /portfolio/ or /companies/, or containing "profile," "about," or "investment."

These harvested links get queued for a second extraction pass. I've found incredible data by following breadcrumbs.

Tier 4: Page Title Fallback (last resort 5%) Dead last resort: just grab the page title and URL. At minimum, I know "This VC has a page about Stripe." That's better than nothing and can be manually reviewed later.

The Resilient Scraper Framework

┌─────────────────────────────────────────┐
│   NEVER RELY ON ONE EXTRACTION METHOD   │
│                                         │
│   Primary → Fallback 1 → Fallback 2    │
│           → Last Resort                 │
└─────────────────────────────────────────┘

TIER 1: AI Extraction (60% success)
└─> Uses: Firecrawl + LLM templates
└─> When it fails ↓

TIER 2: Pattern Matching (25% recovery)
└─> Uses: Markdown + text patterns
└─> When it fails ↓

TIER 3: Link Discovery (10% recovery)
└─> Uses: Harvesting internal URLs
└─> When it fails ↓

TIER 4: Minimal Capture (5% recovery)
└─> Uses: Page title + URL for review

RESULT: 70-80% of sites work without custom code

Two Practical Notes on Cost and Organization

Two quick implementation details that made a huge difference:

(1) Cache Firecrawl responses for 48 hours—this alone saved me 60% in testing costs. Investment portfolios don't change daily, so serving slightly stale data was worth the massive API savings.

(2) Organize code into separate modules (discovery, extraction, fallbacks, processing, storage) so when something breaks, you know exactly where to look. Debugging a 2,000-line file is archaeological work. Debugging focused 200-line modules is straightforward.

The Pipeline in Action

Here's how data flows through the system:

First, discover portfolio pages 2-5 levels deep based on link density. Then categorize whether each page is a listing or detail page. Try AI template-based extraction. If that succeeds, clean and save to database. If it fails, try markdown and text pattern matching. Still failing? Harvest links for a second pass. Got nothing? Save the title and URL for manual review.

Every stage has a safety net. That's the point.

Why This Actually Works

Most scrapers are brittle. They're built for one website's current structure. Mine assumes every site is different, so I use adaptive crawling. It assumes HTML is unreliable, so I use AI extraction. And it assumes AI isn't perfect, so I built a fallback pipeline.

The goal isn't perfect extraction—it's graceful degradation. When one method fails, another takes over. When that fails, a third kicks in. And when everything fails, I at least capture enough information to manually review later.

Does this scraper work perfectly? No. Do I manually review outputs? Yes. Have I spent too many hours debugging why certain sites return 403 errors? Absolutely.

But here's the thing: it works well enough. And in the real world, especially when you're a finance guy vibing his way through code, "well enough" beats "perfect but never ships."

Key Takeaway: Embrace Imperfection

┌────────────────────────────────────────┐
│  THE RESILIENT SCRAPER MINDSET        │
│                                        │
│  I'd rather get 70% of the data from  │
│  100 firms than 100% of the data from │
│  7 firms.                             │
│                                        │
│  Progressive fallbacks make this       │
│  possible.                             │
└────────────────────────────────────────┘

Perfect extraction is a myth. Websites are too varied, structures too inconsistent, JavaScript too unpredictable. The goal isn't to handle every edge case—it's to fail gracefully when edge cases appear.

My scraper works on 60-70% of investment sites without custom code. The remaining 30-40% either need minor tweaks or get flagged for manual review. That's enough to be useful. That's enough to ship.

Build for the expected case. Plan for the edge case. Accept that some data will always be messy.


Next time: How I built a confidence scoring system that tells me which scraped results I can trust and which need human review—because clean code doesn't guarantee accurate data.

Quick exercise: Think of a website you need to scrape regularly. Which tier of my fallback system would you need? AI extraction, pattern matching, link discovery, or minimal capture? Reply with your answer.