General-purpose breadth-first crawling produces bloated corpora: pagination loops, tag clouds, low-value archives, parameter permutations. A focused crawler maximizes useful knowledge per fetch by scoring and sequencing URL discovery against explicit objectives (coverage SLAs, freshness targets, noise ceilings). Treat the crawler as an adaptive, metrics-driven subsystem—NOT a throwaway ingestion script. This guide details architecture components, scoring strategies, change detection, politeness, and recovery patterns.
Goals of a Focused Crawler
Primary objectives:
| Goal | Definition | Metric |
|---|---|---|
| Signal Density | Ratio of retained main content tokens / total fetched tokens | density_ratio |
| Coverage Efficiency | % of priority pages fetched within budget window | p1_coverage_rate |
| Freshness | Avg age since last verified crawl | avg_staleness_days |
| Noise Suppression | % of fetched pages later discarded | discard_rate |
| Incremental Cost | Re-fetched pages that changed / total re-fetched | change_yield |
Each crawl run should emit these to enable tuning.
URL Frontier Management
Structure:
- Multi-Queue Frontier: Separate priority lanes (p0 critical, p1 docs, p2 blog, p3 long-tail) implemented via min-heap keyed by (next_fetch_time, -score).
- Adaptive Throttling: Maintain moving average of latency & error rate; reduce concurrency if 4xx/5xx spikes.
- Backpressure: Cap frontier size (e.g., 100k); if overflow, evict lowest score entries or defer to next cycle.
- Checkpointing: Persist frontier state + fetch cursor every N URLs for resumption.
Avoid single monolithic queue—heterogeneous content demands differentiated service levels.
Relevance Scoring Heuristics
Composite score = base_path_weight + anchor_signal + freshness_hint - duplication_risk - trap_risk.
Components:
- base_path_weight: Regex/prefix mapping (/docs, /guide, /api higher than /blog/tag/).
- anchor_signal: TF‑IDF weight of anchor text tokens vs domain vocabulary.
- freshness_hint: Sibling URLs containing dates (e.g., /2025/09/) → slight boost early, decays fast.
- duplication_risk: Query params count, presence of utm/ session patterns.
- trap_risk: Calendar patterns (/2025/09/14/) or page number > threshold.
Normalize scores 0–1; clamp extremes for stability.
Trap & Noise Avoidance
Detectors:
- Calendar Regex:
/(19|20)\d{2}\/(0[1-9]|1[0-2])\/([0-2][0-9]|3[01])\//with low anchor diversity. - Pagination Depth: If param
page> 10 and average new content tokens < threshold → halt. - Infinite Scroll APIs: Repeated JSON pattern with stable hashes—stop after N unchanged.
- Session/Tracking Params: Strip & canonicalize (?utm_, ?sessionid, ?ref)..
Maintain a suppression log (url, reason) for audit.
Duplicate Detection
Layers:
- URL Canonicalization: Lowercase host, strip default ports, sort query params, remove tracking params.
- Normalized Key Hash: SHA256(canonical_url).
- Content Fingerprint: SimHash or MinHash on cleaned text.
- Near-Dup Threshold: Similarity >0.92 → discard; 0.85–0.92 flag for review.
Store duplicate_map to trace suppression rationale.
Change Detection & Recrawl
Signals:
- HTTP Headers: ETag + Last-Modified (conditional requests reduce bandwidth).
- Sitemap lastmod: Baseline schedule + opportunistic prioritization.
- Content Hash Diff: Rehash main content; if unchanged, extend next_fetch_time.
- High-Volatility Paths: Release notes, pricing recrawled more frequently.
Dynamic interval = base_interval(path_type) * volatility_multiplier(change_rate_bucket).
Politeness & Rate Limits
- robots.txt Parser: Cache & revalidate every 12h; enforce disallow before queue insertion.
- Adaptive Concurrency: Token bucket per host; adjust concurrency using EWMA of recent request latency.
- Crawl Delay: Honor explicit directives; fallback to self-calculated delay if absent.
- Retries: Exponential backoff on 429/503 with jitter.
- User-Agent: Include contact URL / email for abuse reporting.
Observability
Metrics & logs:
| Metric | Purpose |
|---|---|
| fetch_success_rate | Detect connectivity / blocking |
| avg_response_time_ms | Latency baseline |
| frontier_depth | Backlog health |
| duplicate_discard_rate | Noise tuning |
| change_yield | Efficiency of recrawls |
| robots_denials | Scope misconfig detection |
Trace sample contains: url, status, bytes, content_hash_prefix, discovered_links_count, suppression_reason.
Failure Recovery
Strategies:
- Frontier Snapshotting: Serialize queues periodically to durable storage.
- Idempotent Processing: Deterministic hash prevents duplicate downstream extract.
- Partial Resume: On restart, rebuild in-flight tasks; skip completed by hash presence.
- Retry Classes: Network vs 5xx vs parse errors; cap attempts; quarantine persistent failures.
- Circuit Breakers: Pause host if consecutive failures exceed threshold.
Key Takeaways
- Explicit goals + metrics convert crawling from guesswork to optimization.
- Multi-queue frontier + composite scoring maximizes high-value coverage early.
- Early trap/noise suppression prevents exponential index bloat.
- Layered duplicate & change detection conserves bandwidth and embed cost.
- Observability & recovery design are core, not optional extras.