Website Crawling & Content Structuring for AI

9/15/2025

crawling • extraction • structuring • ai

High‑quality AI answers begin with disciplined upstream acquisition. Poorly scoped crawling amplifies noise, inflates index cost, and degrades retrieval precision. Over‑aggressive normalization erodes nuance (e.g., table semantics, code annotations). This guide details an end‑to‑end methodology: scoping → focused crawling → dynamic rendering policy → content extraction → normalization → deduplication → freshness and quality scoring → export packaging.

Crawl Scoping

Define scope explicitly and put it under version control.

Seed Inventory: canonical docs root, pricing, support KB, API reference, changelog, legal, blog pillars.
Allowlist Patterns: prefix /docs, /pricing, /help; regex for versioned APIs (e.g., /v(\d+)/reference/ ).
Denylist: marketing campaign params, tag/category listings, infinite scroll collections, search result pages.
Depth Limits: Hard depth for blog -> avoid pagination bloat; unlimited for docs tree.
Budget Guard: Max pages per domain per run; track historical delta (spikes may indicate crawler traps).

Scoping metrics: allowed_pages_count, denied_pages_count, % new vs previous run.

Focused Crawler Design

Prioritization queue ranks next URL by composite score:

score = base_priority(path_type) + freshness_need + internal_link_rank - duplicate_risk.

base_priority: docs > api_reference > pricing > legal > blog.
freshness_need: higher if last_crawl_age > target_sla.
internal_link_rank: PageRank‑style weight from in‑site linking.
duplicate_risk: penalty if URL params present / session IDs.

Maintain a fingerprint set (normalized URL + content hash) to avoid loops. Change detection strategies: ETag/Last‑Modified, sitemap lastmod diff, periodic hash diff on high‑volatility pages (pricing, release notes).

Rendering & Dynamic Content

Adopt a “render only when necessary” policy:

Heuristics for dynamic need: low initial HTML text ratio, presence of app root div, critical selectors missing.
Use headless browser pool with max concurrency to protect origin.
Block non‑essential resources (ads, analytics) via request interception.
Script Timeout: abort >6s render to prevent queue starvation.
Snapshot final DOM, not network waterfall.

Log render_rate, avg_render_time, aborted_renders.

Content Extraction

Steps:

Main Region Identification: <main>, article, role=main fallback to highest text density subtree.
Boilerplate Removal: nav, footer, aside, cookie prompts, newsletter modals.
Heading Hierarchy Capture: Record H1..H4 sequence; repair skipped levels.
Code & Preformatted Blocks: Preserve indentation; attach language tag.
Table Handling: Extract header row; serialize to markdown table + store raw HTML snapshot for future transforms.
Link Normalization: Convert relative → absolute; store outbound link count as heuristic for hub pages.

Asset & Media Handling

Images: capture alt text; skip decorative (alt="" or role=presentation). Optionally OCR diagrams flagged by missing alt + size threshold.
Video/Audio: associate transcript if available (webvtt / srt). If absent and high priority page type, enqueue async transcription job.
Large Media (>5MB): metadata only; do not inline.
Attachment Links (PDF): queue secondary extraction (PDF → text) with separate pipeline for heavier cost.

Language & Locale Handling

Detect locale via URL pattern (/es/, /fr-FR/) + hreflang tags. Group variants under canonical slug. If translation missing, fallback strategy marks coverage gap (used later in retrieval fallback). Suppress indexing of machine‑translated placeholders flagged by extremely low lexical diversity.

Normalization & Cleaning

Transformations:

Trim excessive whitespace, normalize line breaks.
Standardize heading ladder (if H4 appears without H3 ancestor, promote/demote as needed).
Convert smart quotes & unicode punctuation to canonical forms while preserving code spans verbatim.
Metadata enrichment: page_type, updated_at, locale, product_area, slug_hash.
Compute content_quality_score (density + structural richness + low boilerplate ratio).

Deduplication

Techniques:

Text Shingles (size 5–7) → MinHash signature; compute Jaccard similarity.
Thresholds: if similarity >0.92 treat as duplicate, 0.80–0.92 near‑dup (retain highest authority or freshest).
Canonical Selection Heuristics: prefer canonical tag, shorter URL path, higher inbound link count, fresher updated_at.
Store duplicate_map for audit & future regression checks.

Freshness & Recrawl Strategy

Assign recrawl_interval using adaptive decay:

recrawl_interval = base_sla(page_type) * freshness_multiplier(change_rate_bucket).

Collect signals: last_modified header, observed change frequency, sitemap lastmod variance, release cadence (for release notes path). Maintain a priority heap; push pages earlier if linked from “/changelog” recently. Track staleness histogram (P50, P90 days since last refresh). Alert if P90 breaches target (e.g., >30 days for docs).

Quality Scoring

Per page metrics:

Metric	Purpose
text_density	Detect thin / navigation pages
boilerplate_ratio	Ensure stripping effectiveness
heading_depth	Structural richness
duplicate_flag	Index suppression decision
freshness_age_days	Recrawl urgency
locale_coverage	Translation completeness

Aggregate weekly to drive ingestion tuning.

Export Formats

Provide deterministic, reproducible outputs:

Raw Page Snapshot (clean HTML) for forensic debugging.
Structured JSON: { url, locale, updated_at, sections: […] }.
Chunk Manifest: array of { chunk_id, url, heading_path, text, tokens, hash, metadata }.
Embedding Payload: minimal fields needed for vector DB ingest (chunk_id, text, metadata, version markers).
Quality Report: CSV/JSON with scoring metrics feeding dashboards.

Version all exports; include schema_version in manifest root.

Key Takeaways

Crawl scope discipline prevents index bloat & relevancy decay.
Focused prioritization + adaptive recrawl sustains freshness efficiently.
Semantic chunking + stable IDs enable safe incremental re‑embeds.
Aggressive boilerplate stripping boosts embedding signal-to-noise.
Structured exports + quality metrics create an auditable ingestion pipeline.

Next Step

Per-Embed Metrics

Inspect retrieval quality signals

No-Code Configuration

Safely tune prompts & crawl scope

Interactive Demo (5‑Page Crawl)

Spin up a quick test index