High‑quality AI answers begin with disciplined upstream acquisition. Poorly scoped crawling amplifies noise, inflates index cost, and degrades retrieval precision. Over‑aggressive normalization erodes nuance (e.g., table semantics, code annotations). This guide details an end‑to‑end methodology: scoping → focused crawling → dynamic rendering policy → content extraction → normalization → deduplication → freshness and quality scoring → export packaging.
Crawl Scoping
Define scope explicitly and put it under version control.
- Seed Inventory: canonical docs root, pricing, support KB, API reference, changelog, legal, blog pillars.
- Allowlist Patterns: prefix /docs, /pricing, /help; regex for versioned APIs (e.g., /v(\d+)/reference/ ).
- Denylist: marketing campaign params, tag/category listings, infinite scroll collections, search result pages.
- Depth Limits: Hard depth for blog -> avoid pagination bloat; unlimited for docs tree.
- Budget Guard: Max pages per domain per run; track historical delta (spikes may indicate crawler traps).
Scoping metrics: allowed_pages_count, denied_pages_count, % new vs previous run.
Focused Crawler Design
Prioritization queue ranks next URL by composite score:
score = base_priority(path_type) + freshness_need + internal_link_rank - duplicate_risk.
- base_priority: docs > api_reference > pricing > legal > blog.
- freshness_need: higher if last_crawl_age > target_sla.
- internal_link_rank: PageRank‑style weight from in‑site linking.
- duplicate_risk: penalty if URL params present / session IDs.
Maintain a fingerprint set (normalized URL + content hash) to avoid loops. Change detection strategies: ETag/Last‑Modified, sitemap lastmod diff, periodic hash diff on high‑volatility pages (pricing, release notes).
Rendering & Dynamic Content
Adopt a “render only when necessary” policy:
- Heuristics for dynamic need: low initial HTML text ratio, presence of app root div, critical selectors missing.
- Use headless browser pool with max concurrency to protect origin.
- Block non‑essential resources (ads, analytics) via request interception.
- Script Timeout: abort >6s render to prevent queue starvation.
- Snapshot final DOM, not network waterfall.
Log render_rate, avg_render_time, aborted_renders.
Content Extraction
Steps:
- Main Region Identification:
<main>, article, role=main fallback to highest text density subtree. - Boilerplate Removal: nav, footer, aside, cookie prompts, newsletter modals.
- Heading Hierarchy Capture: Record H1..H4 sequence; repair skipped levels.
- Code & Preformatted Blocks: Preserve indentation; attach language tag.
- Table Handling: Extract header row; serialize to markdown table + store raw HTML snapshot for future transforms.
- Link Normalization: Convert relative → absolute; store outbound link count as heuristic for hub pages.
Asset & Media Handling
- Images: capture alt text; skip decorative (alt="" or role=presentation). Optionally OCR diagrams flagged by missing alt + size threshold.
- Video/Audio: associate transcript if available (webvtt / srt). If absent and high priority page type, enqueue async transcription job.
- Large Media (>5MB): metadata only; do not inline.
- Attachment Links (PDF): queue secondary extraction (PDF → text) with separate pipeline for heavier cost.
Language & Locale Handling
Detect locale via URL pattern (/es/, /fr-FR/) + hreflang tags. Group variants under canonical slug. If translation missing, fallback strategy marks coverage gap (used later in retrieval fallback). Suppress indexing of machine‑translated placeholders flagged by extremely low lexical diversity.
Normalization & Cleaning
Transformations:
- Trim excessive whitespace, normalize line breaks.
- Standardize heading ladder (if H4 appears without H3 ancestor, promote/demote as needed).
- Convert smart quotes & unicode punctuation to canonical forms while preserving code spans verbatim.
- Metadata enrichment: page_type, updated_at, locale, product_area, slug_hash.
- Compute content_quality_score (density + structural richness + low boilerplate ratio).
Deduplication
Techniques:
- Text Shingles (size 5–7) → MinHash signature; compute Jaccard similarity.
- Thresholds: if similarity >0.92 treat as duplicate, 0.80–0.92 near‑dup (retain highest authority or freshest).
- Canonical Selection Heuristics: prefer canonical tag, shorter URL path, higher inbound link count, fresher updated_at.
- Store duplicate_map for audit & future regression checks.
Freshness & Recrawl Strategy
Assign recrawl_interval using adaptive decay:
recrawl_interval = base_sla(page_type) * freshness_multiplier(change_rate_bucket).
Collect signals: last_modified header, observed change frequency, sitemap lastmod variance, release cadence (for release notes path). Maintain a priority heap; push pages earlier if linked from “/changelog” recently. Track staleness histogram (P50, P90 days since last refresh). Alert if P90 breaches target (e.g., >30 days for docs).
Quality Scoring
Per page metrics:
| Metric | Purpose |
|---|---|
| text_density | Detect thin / navigation pages |
| boilerplate_ratio | Ensure stripping effectiveness |
| heading_depth | Structural richness |
| duplicate_flag | Index suppression decision |
| freshness_age_days | Recrawl urgency |
| locale_coverage | Translation completeness |
Aggregate weekly to drive ingestion tuning.
Export Formats
Provide deterministic, reproducible outputs:
- Raw Page Snapshot (clean HTML) for forensic debugging.
- Structured JSON: { url, locale, updated_at, sections: […] }.
- Chunk Manifest: array of { chunk_id, url, heading_path, text, tokens, hash, metadata }.
- Embedding Payload: minimal fields needed for vector DB ingest (chunk_id, text, metadata, version markers).
- Quality Report: CSV/JSON with scoring metrics feeding dashboards.
Version all exports; include schema_version in manifest root.
Key Takeaways
- Crawl scope discipline prevents index bloat & relevancy decay.
- Focused prioritization + adaptive recrawl sustains freshness efficiently.
- Semantic chunking + stable IDs enable safe incremental re‑embeds.
- Aggressive boilerplate stripping boosts embedding signal-to-noise.
- Structured exports + quality metrics create an auditable ingestion pipeline.