Building a Focused Website Crawler

crawler • focus • web • extraction

General-purpose breadth-first crawling produces bloated corpora: pagination loops, tag clouds, low-value archives, parameter permutations. A focused crawler maximizes useful knowledge per fetch by scoring and sequencing URL discovery against explicit objectives (coverage SLAs, freshness targets, noise ceilings). Treat the crawler as an adaptive, metrics-driven subsystem—NOT a throwaway ingestion script. This guide details architecture components, scoring strategies, change detection, politeness, and recovery patterns.

Goals of a Focused Crawler

Primary objectives:

GoalDefinitionMetric
Signal DensityRatio of retained main content tokens / total fetched tokensdensity_ratio
Coverage Efficiency% of priority pages fetched within budget windowp1_coverage_rate
FreshnessAvg age since last verified crawlavg_staleness_days
Noise Suppression% of fetched pages later discardeddiscard_rate
Incremental CostRe-fetched pages that changed / total re-fetchedchange_yield

Each crawl run should emit these to enable tuning.

URL Frontier Management

Structure:

  • Multi-Queue Frontier: Separate priority lanes (p0 critical, p1 docs, p2 blog, p3 long-tail) implemented via min-heap keyed by (next_fetch_time, -score).
  • Adaptive Throttling: Maintain moving average of latency & error rate; reduce concurrency if 4xx/5xx spikes.
  • Backpressure: Cap frontier size (e.g., 100k); if overflow, evict lowest score entries or defer to next cycle.
  • Checkpointing: Persist frontier state + fetch cursor every N URLs for resumption.

Avoid single monolithic queue—heterogeneous content demands differentiated service levels.

Relevance Scoring Heuristics

Composite score = base_path_weight + anchor_signal + freshness_hint - duplication_risk - trap_risk.

Components:

  • base_path_weight: Regex/prefix mapping (/docs, /guide, /api higher than /blog/tag/).
  • anchor_signal: TF‑IDF weight of anchor text tokens vs domain vocabulary.
  • freshness_hint: Sibling URLs containing dates (e.g., /2025/09/) → slight boost early, decays fast.
  • duplication_risk: Query params count, presence of utm/ session patterns.
  • trap_risk: Calendar patterns (/2025/09/14/) or page number > threshold.

Normalize scores 0–1; clamp extremes for stability.

Trap & Noise Avoidance

Detectors:

  • Calendar Regex: /(19|20)\d{2}\/(0[1-9]|1[0-2])\/([0-2][0-9]|3[01])\// with low anchor diversity.
  • Pagination Depth: If param page > 10 and average new content tokens < threshold → halt.
  • Infinite Scroll APIs: Repeated JSON pattern with stable hashes—stop after N unchanged.
  • Session/Tracking Params: Strip & canonicalize (?utm_, ?sessionid, ?ref)..

Maintain a suppression log (url, reason) for audit.

Duplicate Detection

Layers:

  1. URL Canonicalization: Lowercase host, strip default ports, sort query params, remove tracking params.
  2. Normalized Key Hash: SHA256(canonical_url).
  3. Content Fingerprint: SimHash or MinHash on cleaned text.
  4. Near-Dup Threshold: Similarity >0.92 → discard; 0.85–0.92 flag for review.

Store duplicate_map to trace suppression rationale.

Change Detection & Recrawl

Signals:

  • HTTP Headers: ETag + Last-Modified (conditional requests reduce bandwidth).
  • Sitemap lastmod: Baseline schedule + opportunistic prioritization.
  • Content Hash Diff: Rehash main content; if unchanged, extend next_fetch_time.
  • High-Volatility Paths: Release notes, pricing recrawled more frequently.

Dynamic interval = base_interval(path_type) * volatility_multiplier(change_rate_bucket).

Politeness & Rate Limits

  • robots.txt Parser: Cache & revalidate every 12h; enforce disallow before queue insertion.
  • Adaptive Concurrency: Token bucket per host; adjust concurrency using EWMA of recent request latency.
  • Crawl Delay: Honor explicit directives; fallback to self-calculated delay if absent.
  • Retries: Exponential backoff on 429/503 with jitter.
  • User-Agent: Include contact URL / email for abuse reporting.

Observability

Metrics & logs:

MetricPurpose
fetch_success_rateDetect connectivity / blocking
avg_response_time_msLatency baseline
frontier_depthBacklog health
duplicate_discard_rateNoise tuning
change_yieldEfficiency of recrawls
robots_denialsScope misconfig detection

Trace sample contains: url, status, bytes, content_hash_prefix, discovered_links_count, suppression_reason.

Failure Recovery

Strategies:

  • Frontier Snapshotting: Serialize queues periodically to durable storage.
  • Idempotent Processing: Deterministic hash prevents duplicate downstream extract.
  • Partial Resume: On restart, rebuild in-flight tasks; skip completed by hash presence.
  • Retry Classes: Network vs 5xx vs parse errors; cap attempts; quarantine persistent failures.
  • Circuit Breakers: Pause host if consecutive failures exceed threshold.

Key Takeaways

  • Explicit goals + metrics convert crawling from guesswork to optimization.
  • Multi-queue frontier + composite scoring maximizes high-value coverage early.
  • Early trap/noise suppression prevents exponential index bloat.
  • Layered duplicate & change detection conserves bandwidth and embed cost.
  • Observability & recovery design are core, not optional extras.