Part 3 - How to Know If Your Data Is Garbage: Building Confidence Into Your Web Scraper
Without a confidence score, every extraction looks equally legitimate. That's dangerous, because you start making decisions based on data that might be 40% guesswork. Here's how I taught my scraper to know when it's guessing.
The Problem: All Scraped Data Looks the Same
Here's what nobody tells you about web scraping: it's probabilistic, not deterministic. You're not querying a clean database with guaranteed fields. You're parsing messy HTML where someone put the CEO's name in a <span class="text-gray-600"> tag because their designer liked that shade of gray.
Sometimes your AI extraction nails it. Sometimes it hallucinates a plausible-sounding company name. Sometimes you fall back to regex and get lucky. Without confidence scoring, all these outcomes look identical in your spreadsheet.
I learned this the hard way during testing when I exported a batch of 50 "successful" extractions. Half had blank investment years. Another quarter were missing industries. Turns out my scraper was returning records with just company names—20% complete data presented as 100% reliable. That's when I realized: my scraper needed to learn humility.
The Solution: A Weighted Confidence System
I built a scoring system based on which fields got successfully extracted. Not all fields matter equally—some are critical, others are nice-to-have.
Here's the breakdown:
| Field | Weight | Why It Matters |
|---|---|---|
| Company name | 20% | Without this, you have nothing |
| Industry | 15% | Critical for sector analysis |
| Investment year | 10% | Essential for vintage tracking |
| CEO | 10% | Signals executive-level detail |
| Investment role | 10% | Control vs minority stake |
| Ownership % | 10% | Concentration analysis |
| Location | 10% | Geographic diversification |
| Website | 10% | Enables due diligence |
| Status | 5% | Active vs exited tracking |
Total: 100%
The math is dead simple:
confidence_score = sum(weights of successfully extracted fields)
A record with name + industry + year = 45% confidence. That's enough to start a pipeline. A record with just a name = 20% confidence. That's a lead, not an investment.
Track How Data Was Extracted, Not Just What Was Extracted
Confidence scoring alone isn't enough. You need to know the extraction method, because method predicts reliability.
My scraper uses six different strategies, ranked by trustworthiness:
- ai-detail (80-95% reliable): AI extraction from individual company pages
- ai-listing (60-75% reliable): AI extraction from portfolio overview pages
- fallback-markdown (40-60% reliable): Regex parsing when AI times out
- harvested-internal (70-85% reliable): Links the crawler initially missed
- harvested-external (30-50% reliable): Data from company websites
- global-fallback-title (<30% reliable): Last resort—just the page title
Every record gets tagged with its extraction_method. When I review results, I can filter by method and spot patterns in failure modes. If fallback-markdown consistently returns low-confidence data, I know the regex needs tuning.
Expose Quality Metrics to Users
After every scrape, I return a detailed quality report. No hiding the mess—users see exactly what happened:
{
"extractionQuality": {
"totalPages": 25,
"successfulExtractions": 20,
"averageConfidence": 78,
"incompleteInvestments": 3,
"extractionMethods": {
"ai": 18,
"fallbackMarkdown": 2,
"harvested": 0,
"aiAvgConfidence": 82,
"fallbackAvgConfidence": 55
}
}
}
This tells the story:
- 78% average confidence: Decent, but not bulletproof
- 18 AI extractions at 82%: The heavy lifter did most work
- 2 fallback extractions at 55%: These need manual review
- 3 incomplete records: Missing critical fields
No surprises. No hidden failures. Just transparency.
Add a Validation Layer for Sanity Checks
Confidence tells you what you got. Validation tells you if it makes sense.
I built in basic checks:
- Name deduplication: Merge "ABC Capital" and "ABC Capital LLC"
- Industry normalization: Flag "Technology, Software, SaaS, Cloud Computing, Enterprise" as one bloated field
- Year validation: Flag anything before 1990 or after current year
- URL verification: Check if website field is actually a URL
- Ownership bounds: Flag anything >100% or <0%
These don't fix bad data—they flag it for human review. I'd rather surface a questionable record than auto-correct it and introduce new errors.
Framework: Confidence Scoring System
STEP 1: ASSIGN WEIGHTS
- Critical fields (name, URL): 20-25% each
- Important fields (industry, year): 10-15% each
- Nice-to-have fields (CEO, location): 5-10% each
STEP 2: TRACK EXTRACTION METHOD
- AI from detail pages → 80%+ confidence
- AI from listing pages → 60-70% confidence
- Regex fallback → 40-60% confidence
- Page title fallback → <30% confidence
STEP 3: FLAG FOR REVIEW
- Anything <50% = needs human verification
- Anything via fallback = needs spot-checking
- Anything with missing critical fields = incomplete
STEP 4: CALCULATE FINAL SCORE
confidence = sum(extracted_field_weights)
Visual Indicators: Green, Yellow, Red
In the UI, I show a simple badge system:
- Green (80-100%): High confidence, ready to use
- Yellow (50-79%): Medium confidence, review recommended
- Red (<50%): Low confidence, manual verification required
Users can sort and filter by confidence, focusing energy where it matters. No need to manually review pristine 95% records when you've got 40% garbage to triage.
Key Takeaway: Build a Feedback Loop
Confidence scores aren't just for users—they're for you. Every scrape logs which extraction methods worked, which domains failed, and which fields went missing. Over time, you spot patterns. If harvested-external consistently returns <50% confidence, your external link detection needs work. If a specific domain always fails AI extraction, add a site-specific override.
The system doesn't get smarter through ML magic. It gets smarter because you're paying attention.
Next time: How I turned a working scraper into a profitable product without going broke on API credits. Spoiler: unit economics are brutal, and most scraping SaaS businesses are secretly unprofitable.
Your turn: Have you ever shipped a feature that looked like it worked, but the data quality was secretly terrible? I want to hear your confidence scoring horror stories.