Crawler Politeness Controls for AI Assistants

crawler • politeness • ssrf • ai-assistant

Crawler Politeness Controls for AI Assistants

A polite crawler keeps AI assistants fresh without angering origin sites. Follow these controls to stay safe.

Robots.txt and allowlists

  • Fetch robots.txt before crawling; cache directives per host.
  • Let tenants define explicit allowlists and denylists.
  • Reject attempts to crawl external domains unless whitelisted.

Throttling and retries

  • Enforce per-host QPS (e.g., 0.5–1 req/s) with jitter.
  • Back off exponentially on 429/5xx; log repeated failures for review.
  • Pause crawls if the origin sends Retry-After.

SSRF defenses

  • Block internal IP ranges (10.0.0.0/8, 169.254.169.254, localhost).
  • Use outbound allowlists per tenant/service.
  • Run headless renderers in isolated Cloud Run services with minimal permissions.

Crawl budgeting

  • Cap pages per run based on plan tier (e.g., demo = 5 pages).
  • Track discovered vs processed URLs; alert when budgets exceed expected totals.
  • Use Last-Modified headers to skip unchanged pages.

Monitoring

  • Log crawl runs (tenant, start/end time, success count, error count).
  • Expose metrics for QPS, bytes fetched, soft 404s, and blocked URLs.
  • Send alerts when error rates spike or when robots directives change.

CrawlBot defaults

CrawlBot starts with robots.txt, QPS throttling, SSRF deny lists, headless render limits, and plan-aware budgets. Copy these patterns to keep your assistant both respectful and reliable.***