Why does politeness matter?

It protects customer sites from overload, proves compliance, and keeps you off blocklists.

What should we enforce?

Respect robots.txt, enforce allowlists, throttle per host, handle 429/5xx gracefully, and block internal IP ranges.

How does CrawlBot implement this?

CrawlBot enforces per-host QPS, robots caching, SSRF deny lists, and headless render rate limits.

Crawler Politeness Controls for AI Assistants

2/24/2025

crawler • politeness • ssrf • ai-assistant

Crawler Politeness Controls for AI Assistants

A polite crawler keeps AI assistants fresh without angering origin sites. Follow these controls to stay safe.

Robots.txt and allowlists

Fetch robots.txt before crawling; cache directives per host.
Let tenants define explicit allowlists and denylists.
Reject attempts to crawl external domains unless whitelisted.

Throttling and retries

Enforce per-host QPS (e.g., 0.5–1 req/s) with jitter.
Back off exponentially on 429/5xx; log repeated failures for review.
Pause crawls if the origin sends Retry-After.

SSRF defenses

Block internal IP ranges (10.0.0.0/8, 169.254.169.254, localhost).
Use outbound allowlists per tenant/service.
Run headless renderers in isolated Cloud Run services with minimal permissions.

Crawl budgeting

Cap pages per run based on plan tier (e.g., demo = 5 pages).
Track discovered vs processed URLs; alert when budgets exceed expected totals.
Use Last-Modified headers to skip unchanged pages.

Monitoring

Log crawl runs (tenant, start/end time, success count, error count).
Expose metrics for QPS, bytes fetched, soft 404s, and blocked URLs.
Send alerts when error rates spike or when robots directives change.

CrawlBot defaults

CrawlBot starts with robots.txt, QPS throttling, SSRF deny lists, headless render limits, and plan-aware budgets. Copy these patterns to keep your assistant both respectful and reliable.***

Next Step

Per-Embed Metrics

Inspect retrieval quality signals

No-Code Configuration

Safely tune prompts & crawl scope

Interactive Demo (5‑Page Crawl)

Spin up a quick test index