Help:User journeys/Combatting bots and scrapers

Journey info · Platform: Docker Compose & Kubernetes · Time: ~20 minutes

This is a guide to reducing the load that bots, scrapers, and AI/LLM crawlers can put on a Canasta wiki, using built-in tools such as the CrawlerProtection extension, CrowdSec and Caddy. These are not the only tools you can use; Cloudflare, for example, is an extremely popular tool that you can use outside of Canasta. See the guide Handling web crawlers on mediawiki.org for a more comprehensive listing of tools and approaches that can be used with MediaWiki.

For information on enabling CrowdSec within Canasta, see Help:CrowdSec. For additional information on how Caddy works within Canasta, see Help:Networking and TLS. Most of the measures here work on both orchestrators — Docker Compose and Kubernetes: the CrawlerProtection settings, the Caddy User-Agent block in config/Caddyfile.site, and CrowdSec are applied the same way on each. The one exception is the robots.txt step, which uses a Compose-only mount; the note there covers Kubernetes.

Types of crawlers

There are, roughly speaking, four types of crawlers that can hit your site. In generally decreasing order of desirability, they are:

Search-engine crawlers (e.g. Googlebot, Bingbot)
AI search/answer crawlers (e.g. OAI-SearchBot, PerplexityBot)
AI training crawlers (e.g. GPTBot, ClaudeBot, CCBot)
Abusive scrapers that ignore robots.txt or hammer the site, sometimes even maliciously.

Ideally you can take an approach that blocks the last one or two types while allowing the others access.

Blocking expensive pages with CrawlerProtection

The most effective measure against scrapers is to stop serving them expensive pages at all. Canasta includes the CrawlerProtection extension, which denies anonymous access to resource-intensive action URLs (e.g. ?action=history) and special pages (page histories, Special:WhatLinksHere, Special:RecentChanges, …), while leaving normal article reads and logged-in editors untouched.

You can enable and configure it in either a global settings file under config/settings/global/ (applied to every wiki on the instance) or a per-wiki file under config/settings/wikis/{wiki-id}/ (just that wiki).

Here is a reasonable configuration for CrawlerProtection:

wfLoadExtension( 'CrawlerProtection' );

// Deny anonymous requests with a raw 403 (skips MediaWiki rendering — faster,
// and sheds the most load).
$wgCrawlerProtectionRawDenial = true;
$wgCrawlerProtectionRawDenialText =
	"You must be logged in to view this page." .
	"<br><button onclick=\"history.back()\">Go Back</button>";

// Resource-intensive special pages to deny to anonymous users.
$wgCrawlerProtectedSpecialPages = [
	'mobilediff',
	'recentchangeslinked',
	'recentchanges',
	'relatedchanges',
	'whatlinkshere',
	'specialpages',
	'browse',
	'browsedata',
	'random',
];

// Action URLs to deny (page history is the big one).
$wgCrawlerProtectedActions = [ 'history' ];

No restart is needed for these changes (though if you are using GitOps, you should commit them so they are tracked).

You can also exempt trusted automation, such as monitoring and your own bots, from these restrictions with $wgCrawlerProtectionAllowedIPs.

Behavioral detection with CrowdSec

If CrowdSec is enabled, your engine already bans certain scraping behavior, regardless of IP address. The bundled crowdsecurity/caddy collection includes base-http-scenarios, which contains:

http-crawl-non_statics — aggressive crawling of dynamic pages (bulk content scraping), and
http-bad-user-agent — requests from known scraper/bot user-agents.

Legitimate crawlers (such as Googlebot and Bingbot) are protected from false bans by the bundled seo-bots-whitelist.

You can confirm the scenarios are active with:

canasta crowdsec scenarios

There is nothing to configure here — it is the default.

Blocking IP addresses with CrowdSec

Blocklists pre-emptively block IPs by reputation. In the CrowdSec Console under "Blocklists", subscribe this engine to free lists that catch scraper infrastructure:

Free Proxies — scrapers commonly route through open proxies.
Tor exit nodes — if you do not expect legitimate Tor readers.

You already receive the CrowdSec Community Blocklist via the Central API. CrowdSec's dedicated AI Crawlers blocklist is a paid list, but CrowdSec offers it free to open-source community projects. A public wiki usually qualifies; email community@crowdsec.net to request access. See Help:CrowdSec#Community blocklist for how blocklists reach the engine and how to read canasta crowdsec status.

Block self-identifying AI bots at the edge with Caddy

Well-behaved AI crawlers announce themselves by User-Agent and are not considered "malicious", so CrowdSec's blocklists will not stop them. You can nevertheless block them at the edge in config/Caddyfile.site in the instance directory (it is imported into the site block, and a matched handle short-circuits before MediaWiki):

@ai_scrapers header_regexp User-Agent (?i)(GPTBot|ChatGPT-User|OAI-SearchBot|ClaudeBot|anthropic-ai|Claude-Web|CCBot|PerplexityBot|Perplexity-User|Bytespider|Amazonbot|Meta-ExternalAgent|FacebookBot|Diffbot|ImagesiftBot|Omgilibot|cohere-ai|YouBot|DataForSeoBot|Timpibot)
handle @ai_scrapers {
    respond "Automated AI/LLM scraping of this wiki is not permitted." 403
}

Apply it with:

canasta restart

Two things to get right:

Do not list Google-Extended or Applebot-Extended here — those are robots.txt opt-out tokens, not real user-agents, and never appear in a request.
To stay citeable in AI search results, drop the search crawlers (e.g., OAI-SearchBot, ChatGPT-User, PerplexityBot, Perplexity-User) from the list.

The crawler list is always evolving; the community-maintained ai.robots.txt project is a good source to keep it current.

Using robots.txt

robots.txt asks compliant crawlers to stay out, and is the only way to express the Google/Apple AI-training opt-out tokens. Canasta serves /robots.txt from robots.php (it disallows Special:, MediaWiki: and /w/, and advertises sitemaps). Do not serve a competing /robots.txt from Caddy, which would override it.

Docker Compose only: the procedure below appends to robots.txt through a bind mount in docker-compose.override.yml, which has no Kubernetes equivalent. On Kubernetes there is no first-class way to inject extra-robots.txt, so rely on the other layers above — CrawlerProtection and the Caddy User-Agent block both work on Kubernetes — for the practical protection; only the Google/Apple AI-training opt-out tokens, which can only be expressed in robots.txt, are unavailable there.

robots.php appends /var/www/mediawiki/extra-robots.txt to its output when that file exists. That path is not bind-mounted by default, so add a persistent mount via docker-compose.override.yml. Keep the source file under config/ so backups and GitOps capture it automatically (everything under config/ is tracked).

Create the file first (it must exist before up, or Docker creates a directory):

cat > config/extra-robots.txt <<'EOF'
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: anthropic-ai
User-agent: CCBot
User-agent: Bytespider
User-agent: Amazonbot
User-agent: Meta-ExternalAgent
User-agent: Google-Extended
User-agent: Applebot-Extended
Disallow: /
EOF

Add the mount (Compose merges this onto the base web volumes, preserving the existing mounts):

services:
  web:
    volumes:
      - ./config/extra-robots.txt:/var/www/mediawiki/extra-robots.txt:ro

Apply it with:

canasta restart

Google-Extended and Applebot-Extended belong here, not in the Caddy block, because they opt your content out of Google/Apple AI training while still allowing those companies' search crawlers to index you. Note that Bytespider has been observed ignoring robots.txt, which is why it also appears in the Layer 4 hard block.

A note on backup and GitOps: config/extra-robots.txt is captured by canasta backup (restic) and tracked by GitOps (it lives under config/). The docker-compose.override.yml that mounts it is also backed up by restic and is not ignored by GitOps (only docker-compose.override.yml.example is) — on a GitOps instance, track it once with canasta gitops add docker-compose.override.yml and it rides along thereafter. So the whole setup survives a restore or a GitOps pull. See Help:Backup and restore.

Validation

To confirm that CrawlerProtection denies an anonymous history request, but not normal article reads, you can simply visit the pages, or you can call the following:

curl -sI "https://YOUR-WIKI/w/index.php?title=Main_Page&action=history" | head -1  # expect 403
curl -sI "https://YOUR-WIKI/" | head -1                                            # expect 200

To confirm that the edge block returns 403 for a blocked user-agent but serves a normal browser:

curl -sI -A "GPTBot" https://YOUR-WIKI/ | head -1       # expect 403
curl -sI -A "Mozilla/5.0" https://YOUR-WIKI/ | head -1  # expect 200

To confirm that your robots.txt file is being served correctly, simply visit the file https://YOUR-WIKI/robots.txt .

Confirm CrowdSec is enforcing the community and any console blocklists:

canasta crowdsec status

Undoing these protections

To disable CrawlerProtection, remove (or comment out) the CrawlerProtection block from your settings file (config/settings/global/ or config/settings/wikis/{wiki-id}/).
To undo Caddy blocking, remove the @ai_scrapers block from config/Caddyfile.site.
See Help:CrowdSec for how to disable CrowdSec. To disable it for only a certain set of trusted IPs, see Help:CrowdSec#Whitelisting trusted IPs.
To undo your robots.txt changes, remove the extra-robots.txt volume line from docker-compose.override.yml (delete the file too if unused). If the override file is now empty, you can remove it entirely.

After any of these changes, you will need to call canasta restart.

Production considerations

Tune, do not carpet-block. Keep search-engine and (optionally) AI-search crawlers allowed so your wiki(s) stay discoverable and citeable; block training crawlers and abusive scrapers. Revisit the User-Agent list periodically against ai.robots.txt.
Wiki farms. The Caddy block applies to the whole site address; on a farm it covers every wiki on that hostname. CrawlerProtection applies to every wiki when placed in config/settings/global/, or to a single wiki when placed in config/settings/wikis/{wiki-id}/.

Troubleshooting

A page anonymous users need returns 403. CrawlerProtection is denying a special page or action they legitimately use — remove it from $wgCrawlerProtectedSpecialPages / $wgCrawlerProtectedActions, or exempt the source IP via $wgCrawlerProtectionAllowedIPs (a config edit takes effect immediately — no restart is needed).
Caddy block does nothing. The directive must be a handle @ai_scrapers { ... } block in config/Caddyfile.site — a bare respond is ordered after reverse_proxy and never fires. Re-apply with canasta restart and re-run the curl -A test.
robots.txt lines missing. Confirm the bind mount resolved to a file, not a directory: docker compose exec web ls -l /var/www/mediawiki/extra-robots.txt. If it is a directory, you created the mount before the host file existed — remove it, create extra-robots.txt, and canasta restart.
A blocked AI bot still appears in logs. Some crawlers (e.g. Bytespider) spoof or ignore controls; CrowdSec's http-bad-user-agent and the community blocklist catch many of these by behavior and reputation.