10 Nov 2025 4 min read investment-scraper

Part 3 - How to Know If Your Data Is Garbage: Building Confidence Into Your Web Scraper

Without a confidence score, every extraction looks equally legitimate. That's dangerous, because you start making decisions based on data that might be 40% guesswork. Here's how I taught my scraper to know when it's guessing.

The Problem: All Scraped Data Looks the Same

Here's what nobody tells you about web scraping: it's probabilistic, not deterministic. You're not querying a clean database with guaranteed fields. You're parsing messy HTML where someone put the CEO's name in a <span class="text-gray-600"> tag because their designer liked that shade of gray.

Sometimes your AI extraction nails it. Sometimes it hallucinates a plausible-sounding company name. Sometimes you fall back to regex and get lucky. Without confidence scoring, all these outcomes look identical in your spreadsheet.

I learned this the hard way during testing when I exported a batch of 50 "successful" extractions. Half had blank investment years. Another quarter were missing industries. Turns out my scraper was returning records with just company names—20% complete data presented as 100% reliable. That's when I realized: my scraper needed to learn humility.

The Solution: A Weighted Confidence System

I built a scoring system based on which fields got successfully extracted. Not all fields matter equally—some are critical, others are nice-to-have.

Here's the breakdown:

Field	Weight	Why It Matters
Company name	20%	Without this, you have nothing
Industry	15%	Critical for sector analysis
Investment year	10%	Essential for vintage tracking
CEO	10%	Signals executive-level detail
Investment role	10%	Control vs minority stake
Ownership %	10%	Concentration analysis
Location	10%	Geographic diversification
Website	10%	Enables due diligence
Status	5%	Active vs exited tracking

Total: 100%

The math is dead simple:

confidence_score = sum(weights of successfully extracted fields)

A record with name + industry + year = 45% confidence. That's enough to start a pipeline. A record with just a name = 20% confidence. That's a lead, not an investment.

Track How Data Was Extracted, Not Just What Was Extracted

Confidence scoring alone isn't enough. You need to know the extraction method, because method predicts reliability.

My scraper uses six different strategies, ranked by trustworthiness:

ai-detail (80-95% reliable): AI extraction from individual company pages
ai-listing (60-75% reliable): AI extraction from portfolio overview pages
fallback-markdown (40-60% reliable): Regex parsing when AI times out
harvested-internal (70-85% reliable): Links the crawler initially missed
harvested-external (30-50% reliable): Data from company websites
global-fallback-title (<30% reliable): Last resort—just the page title

Every record gets tagged with its extraction_method. When I review results, I can filter by method and spot patterns in failure modes. If fallback-markdown consistently returns low-confidence data, I know the regex needs tuning.

Expose Quality Metrics to Users

After every scrape, I return a detailed quality report. No hiding the mess—users see exactly what happened:

{
  "extractionQuality": {
    "totalPages": 25,
    "successfulExtractions": 20,
    "averageConfidence": 78,
    "incompleteInvestments": 3,
    "extractionMethods": {
      "ai": 18,
      "fallbackMarkdown": 2,
      "harvested": 0,
      "aiAvgConfidence": 82,
      "fallbackAvgConfidence": 55
    }
  }
}

This tells the story:

78% average confidence: Decent, but not bulletproof
18 AI extractions at 82%: The heavy lifter did most work
2 fallback extractions at 55%: These need manual review
3 incomplete records: Missing critical fields

No surprises. No hidden failures. Just transparency.

Add a Validation Layer for Sanity Checks

Confidence tells you what you got. Validation tells you if it makes sense.

I built in basic checks:

Name deduplication: Merge "ABC Capital" and "ABC Capital LLC"
Industry normalization: Flag "Technology, Software, SaaS, Cloud Computing, Enterprise" as one bloated field
Year validation: Flag anything before 1990 or after current year
URL verification: Check if website field is actually a URL
Ownership bounds: Flag anything >100% or <0%

These don't fix bad data—they flag it for human review. I'd rather surface a questionable record than auto-correct it and introduce new errors.

Framework: Confidence Scoring System

STEP 1: ASSIGN WEIGHTS
- Critical fields (name, URL): 20-25% each
- Important fields (industry, year): 10-15% each
- Nice-to-have fields (CEO, location): 5-10% each

STEP 2: TRACK EXTRACTION METHOD
- AI from detail pages → 80%+ confidence
- AI from listing pages → 60-70% confidence
- Regex fallback → 40-60% confidence
- Page title fallback → <30% confidence

STEP 3: FLAG FOR REVIEW
- Anything <50% = needs human verification
- Anything via fallback = needs spot-checking
- Anything with missing critical fields = incomplete

STEP 4: CALCULATE FINAL SCORE
confidence = sum(extracted_field_weights)

Visual Indicators: Green, Yellow, Red

In the UI, I show a simple badge system:

Green (80-100%): High confidence, ready to use
Yellow (50-79%): Medium confidence, review recommended
Red (<50%): Low confidence, manual verification required

Users can sort and filter by confidence, focusing energy where it matters. No need to manually review pristine 95% records when you've got 40% garbage to triage.

Key Takeaway: Build a Feedback Loop

Confidence scores aren't just for users—they're for you. Every scrape logs which extraction methods worked, which domains failed, and which fields went missing. Over time, you spot patterns. If harvested-external consistently returns <50% confidence, your external link detection needs work. If a specific domain always fails AI extraction, add a site-specific override.

The system doesn't get smarter through ML magic. It gets smarter because you're paying attention.

Next time: How I turned a working scraper into a profitable product without going broke on API credits. Spoiler: unit economics are brutal, and most scraping SaaS businesses are secretly unprofitable.

Your turn: Have you ever shipped a feature that looked like it worked, but the data quality was secretly terrible? I want to hear your confidence scoring horror stories.

The Problem: All Scraped Data Looks the Same

The Solution: A Weighted Confidence System

Track How Data Was Extracted, Not Just What Was Extracted

Expose Quality Metrics to Users

Add a Validation Layer for Sanity Checks

Framework: Confidence Scoring System

Visual Indicators: Green, Yellow, Red

Key Takeaway: Build a Feedback Loop

You might also like...

Part 6 - The Decision Framework for Walking Away

Part 5 - The 80/20 Rule for Scraping Scalability: Why Most Websites Are Predictable (And Why That's Good)

Part 4 - When Unit Economics Kill a Working Product: Building for Profitability

Part 2 - Building a Scraper That Doesn't Break: The Progressive Fallback System

Part 1 - The Make-or-Buy Decision: When to Build Your Own Automation (And When to Just Pay For It)