Building a Focused Website Crawler

9/15/2025

crawler • focus • web • extraction

General-purpose breadth-first crawling produces bloated corpora: pagination loops, tag clouds, low-value archives, parameter permutations. A focused crawler maximizes useful knowledge per fetch by scoring and sequencing URL discovery against explicit objectives (coverage SLAs, freshness targets, noise ceilings). Treat the crawler as an adaptive, metrics-driven subsystem—NOT a throwaway ingestion script. This guide details architecture components, scoring strategies, change detection, politeness, and recovery patterns.

Goals of a Focused Crawler

Primary objectives:

Goal	Definition	Metric
Signal Density	Ratio of retained main content tokens / total fetched tokens	density_ratio
Coverage Efficiency	% of priority pages fetched within budget window	p1_coverage_rate
Freshness	Avg age since last verified crawl	avg_staleness_days
Noise Suppression	% of fetched pages later discarded	discard_rate
Incremental Cost	Re-fetched pages that changed / total re-fetched	change_yield

Each crawl run should emit these to enable tuning.

URL Frontier Management

Structure:

Multi-Queue Frontier: Separate priority lanes (p0 critical, p1 docs, p2 blog, p3 long-tail) implemented via min-heap keyed by (next_fetch_time, -score).
Adaptive Throttling: Maintain moving average of latency & error rate; reduce concurrency if 4xx/5xx spikes.
Backpressure: Cap frontier size (e.g., 100k); if overflow, evict lowest score entries or defer to next cycle.
Checkpointing: Persist frontier state + fetch cursor every N URLs for resumption.

Avoid single monolithic queue—heterogeneous content demands differentiated service levels.

Relevance Scoring Heuristics

Composite score = base_path_weight + anchor_signal + freshness_hint - duplication_risk - trap_risk.

Components:

base_path_weight: Regex/prefix mapping (/docs, /guide, /api higher than /blog/tag/).
anchor_signal: TF‑IDF weight of anchor text tokens vs domain vocabulary.
freshness_hint: Sibling URLs containing dates (e.g., /2025/09/) → slight boost early, decays fast.
duplication_risk: Query params count, presence of utm/ session patterns.
trap_risk: Calendar patterns (/2025/09/14/) or page number > threshold.

Normalize scores 0–1; clamp extremes for stability.

Trap & Noise Avoidance

Detectors:

Calendar Regex: /(19|20)\d{2}\/(0[1-9]|1[0-2])\/([0-2][0-9]|3[01])\// with low anchor diversity.
Pagination Depth: If param page > 10 and average new content tokens < threshold → halt.
Infinite Scroll APIs: Repeated JSON pattern with stable hashes—stop after N unchanged.
Session/Tracking Params: Strip & canonicalize (?utm_, ?sessionid, ?ref)..

Maintain a suppression log (url, reason) for audit.

Duplicate Detection

Layers:

URL Canonicalization: Lowercase host, strip default ports, sort query params, remove tracking params.
Normalized Key Hash: SHA256(canonical_url).
Content Fingerprint: SimHash or MinHash on cleaned text.
Near-Dup Threshold: Similarity >0.92 → discard; 0.85–0.92 flag for review.

Store duplicate_map to trace suppression rationale.

Change Detection & Recrawl

Signals:

HTTP Headers: ETag + Last-Modified (conditional requests reduce bandwidth).
Sitemap lastmod: Baseline schedule + opportunistic prioritization.
Content Hash Diff: Rehash main content; if unchanged, extend next_fetch_time.
High-Volatility Paths: Release notes, pricing recrawled more frequently.

Dynamic interval = base_interval(path_type) * volatility_multiplier(change_rate_bucket).

Politeness & Rate Limits

robots.txt Parser: Cache & revalidate every 12h; enforce disallow before queue insertion.
Adaptive Concurrency: Token bucket per host; adjust concurrency using EWMA of recent request latency.
Crawl Delay: Honor explicit directives; fallback to self-calculated delay if absent.
Retries: Exponential backoff on 429/503 with jitter.
User-Agent: Include contact URL / email for abuse reporting.

Observability

Metrics & logs:

Metric	Purpose
fetch_success_rate	Detect connectivity / blocking
avg_response_time_ms	Latency baseline
frontier_depth	Backlog health
duplicate_discard_rate	Noise tuning
change_yield	Efficiency of recrawls
robots_denials	Scope misconfig detection

Trace sample contains: url, status, bytes, content_hash_prefix, discovered_links_count, suppression_reason.

Failure Recovery

Strategies:

Frontier Snapshotting: Serialize queues periodically to durable storage.
Idempotent Processing: Deterministic hash prevents duplicate downstream extract.
Partial Resume: On restart, rebuild in-flight tasks; skip completed by hash presence.
Retry Classes: Network vs 5xx vs parse errors; cap attempts; quarantine persistent failures.
Circuit Breakers: Pause host if consecutive failures exceed threshold.

Key Takeaways

Explicit goals + metrics convert crawling from guesswork to optimization.
Multi-queue frontier + composite scoring maximizes high-value coverage early.
Early trap/noise suppression prevents exponential index bloat.
Layered duplicate & change detection conserves bandwidth and embed cost.
Observability & recovery design are core, not optional extras.

Next Step

Per-Embed Metrics

Inspect retrieval quality signals

No-Code Configuration

Safely tune prompts & crawl scope

Interactive Demo (5‑Page Crawl)

Spin up a quick test index