Crawler Politeness Controls for AI Assistants
A polite crawler keeps AI assistants fresh without angering origin sites. Follow these controls to stay safe.
Robots.txt and allowlists
- Fetch robots.txt before crawling; cache directives per host.
- Let tenants define explicit allowlists and denylists.
- Reject attempts to crawl external domains unless whitelisted.
Throttling and retries
- Enforce per-host QPS (e.g., 0.5–1 req/s) with jitter.
- Back off exponentially on 429/5xx; log repeated failures for review.
- Pause crawls if the origin sends
Retry-After.
SSRF defenses
- Block internal IP ranges (10.0.0.0/8, 169.254.169.254, localhost).
- Use outbound allowlists per tenant/service.
- Run headless renderers in isolated Cloud Run services with minimal permissions.
Crawl budgeting
- Cap pages per run based on plan tier (e.g., demo = 5 pages).
- Track discovered vs processed URLs; alert when budgets exceed expected totals.
- Use Last-Modified headers to skip unchanged pages.
Monitoring
- Log crawl runs (tenant, start/end time, success count, error count).
- Expose metrics for QPS, bytes fetched, soft 404s, and blocked URLs.
- Send alerts when error rates spike or when robots directives change.
CrawlBot defaults
CrawlBot starts with robots.txt, QPS throttling, SSRF deny lists, headless render limits, and plan-aware budgets. Copy these patterns to keep your assistant both respectful and reliable.***