4 min read

Part 3 - How to Know If Your Data Is Garbage: Building Confidence Into Your Web Scraper

Part 3 - How to Know If Your Data Is Garbage: Building Confidence Into Your Web Scraper

Without a confidence score, every extraction looks equally legitimate. That's dangerous, because you start making decisions based on data that might be 40% guesswork. Here's how I taught my scraper to know when it's guessing.


The Problem: All Scraped Data Looks the Same

Here's what nobody tells you about web scraping: it's probabilistic, not deterministic. You're not querying a clean database with guaranteed fields. You're parsing messy HTML where someone put the CEO's name in a <span class="text-gray-600"> tag because their designer liked that shade of gray.

Sometimes your AI extraction nails it. Sometimes it hallucinates a plausible-sounding company name. Sometimes you fall back to regex and get lucky. Without confidence scoring, all these outcomes look identical in your spreadsheet.

I learned this the hard way during testing when I exported a batch of 50 "successful" extractions. Half had blank investment years. Another quarter were missing industries. Turns out my scraper was returning records with just company names—20% complete data presented as 100% reliable. That's when I realized: my scraper needed to learn humility.


The Solution: A Weighted Confidence System

I built a scoring system based on which fields got successfully extracted. Not all fields matter equally—some are critical, others are nice-to-have.

Here's the breakdown:

Field Weight Why It Matters
Company name 20% Without this, you have nothing
Industry 15% Critical for sector analysis
Investment year 10% Essential for vintage tracking
CEO 10% Signals executive-level detail
Investment role 10% Control vs minority stake
Ownership % 10% Concentration analysis
Location 10% Geographic diversification
Website 10% Enables due diligence
Status 5% Active vs exited tracking

Total: 100%

The math is dead simple:

confidence_score = sum(weights of successfully extracted fields)

A record with name + industry + year = 45% confidence. That's enough to start a pipeline. A record with just a name = 20% confidence. That's a lead, not an investment.


Track How Data Was Extracted, Not Just What Was Extracted

Confidence scoring alone isn't enough. You need to know the extraction method, because method predicts reliability.

My scraper uses six different strategies, ranked by trustworthiness:

  1. ai-detail (80-95% reliable): AI extraction from individual company pages
  2. ai-listing (60-75% reliable): AI extraction from portfolio overview pages
  3. fallback-markdown (40-60% reliable): Regex parsing when AI times out
  4. harvested-internal (70-85% reliable): Links the crawler initially missed
  5. harvested-external (30-50% reliable): Data from company websites
  6. global-fallback-title (<30% reliable): Last resort—just the page title

Every record gets tagged with its extraction_method. When I review results, I can filter by method and spot patterns in failure modes. If fallback-markdown consistently returns low-confidence data, I know the regex needs tuning.


Expose Quality Metrics to Users

After every scrape, I return a detailed quality report. No hiding the mess—users see exactly what happened:

{
  "extractionQuality": {
    "totalPages": 25,
    "successfulExtractions": 20,
    "averageConfidence": 78,
    "incompleteInvestments": 3,
    "extractionMethods": {
      "ai": 18,
      "fallbackMarkdown": 2,
      "harvested": 0,
      "aiAvgConfidence": 82,
      "fallbackAvgConfidence": 55
    }
  }
}

This tells the story:

  • 78% average confidence: Decent, but not bulletproof
  • 18 AI extractions at 82%: The heavy lifter did most work
  • 2 fallback extractions at 55%: These need manual review
  • 3 incomplete records: Missing critical fields

No surprises. No hidden failures. Just transparency.


Add a Validation Layer for Sanity Checks

Confidence tells you what you got. Validation tells you if it makes sense.

I built in basic checks:

  • Name deduplication: Merge "ABC Capital" and "ABC Capital LLC"
  • Industry normalization: Flag "Technology, Software, SaaS, Cloud Computing, Enterprise" as one bloated field
  • Year validation: Flag anything before 1990 or after current year
  • URL verification: Check if website field is actually a URL
  • Ownership bounds: Flag anything >100% or <0%

These don't fix bad data—they flag it for human review. I'd rather surface a questionable record than auto-correct it and introduce new errors.


Framework: Confidence Scoring System

STEP 1: ASSIGN WEIGHTS
- Critical fields (name, URL): 20-25% each
- Important fields (industry, year): 10-15% each
- Nice-to-have fields (CEO, location): 5-10% each

STEP 2: TRACK EXTRACTION METHOD
- AI from detail pages → 80%+ confidence
- AI from listing pages → 60-70% confidence
- Regex fallback → 40-60% confidence
- Page title fallback → <30% confidence

STEP 3: FLAG FOR REVIEW
- Anything <50% = needs human verification
- Anything via fallback = needs spot-checking
- Anything with missing critical fields = incomplete

STEP 4: CALCULATE FINAL SCORE
confidence = sum(extracted_field_weights)


Visual Indicators: Green, Yellow, Red

In the UI, I show a simple badge system:

  • Green (80-100%): High confidence, ready to use
  • Yellow (50-79%): Medium confidence, review recommended
  • Red (<50%): Low confidence, manual verification required

Users can sort and filter by confidence, focusing energy where it matters. No need to manually review pristine 95% records when you've got 40% garbage to triage.


Key Takeaway: Build a Feedback Loop

Confidence scores aren't just for users—they're for you. Every scrape logs which extraction methods worked, which domains failed, and which fields went missing. Over time, you spot patterns. If harvested-external consistently returns <50% confidence, your external link detection needs work. If a specific domain always fails AI extraction, add a site-specific override.

The system doesn't get smarter through ML magic. It gets smarter because you're paying attention.


Next time: How I turned a working scraper into a profitable product without going broke on API credits. Spoiler: unit economics are brutal, and most scraping SaaS businesses are secretly unprofitable.

Your turn: Have you ever shipped a feature that looked like it worked, but the data quality was secretly terrible? I want to hear your confidence scoring horror stories.