Journey info · Platform: Docker Compose · Time: ~20 minutes
This is a guide to reducing the load that bots, scrapers, and AI/LLM crawlers can put on a Canasta wiki, using built-in tools such as the CrawlerProtection extension, CrowdSec and Caddy. These are not the only tools you can use; Cloudflare, for example, is an extremely popular tool that you can use outside of Canasta. See the guide Handling web crawlers on mediawiki.org for a more comprehensive listing of tools and approaches that can be used with MediaWiki.
For information on enabling CrowdSec within Canasta, see Help:CrowdSec. For additional information on how Caddy works within Canasta, see Help:Networking and TLS. Note that Docker Compose is needed for both CrowdSec and Caddy; neither can currently work with Kubernetes.
Types of crawlers
There are, roughly speaking, four types of crawlers that can hit your site. In generally decreasing order of desirability, they are:
- Search-engine crawlers (e.g. Googlebot, Bingbot)
- AI search/answer crawlers (e.g. OAI-SearchBot, PerplexityBot)
- AI training crawlers (e.g. GPTBot, ClaudeBot, CCBot)
- Abusive scrapers that ignore
robots.txtor hammer the site, sometimes even maliciously.
Ideally you can take an approach that blocks the last one or two types while allowing the others access.
Blocking expensive pages with CrawlerProtection
The most effective measure against scrapers is to stop serving them expensive pages at all. Canasta includes the CrawlerProtection extension, which denies anonymous access to resource-intensive action URLs (e.g. ?action=history) and special pages (page histories, Special:WhatLinksHere, Special:RecentChanges, …), while leaving normal article reads and logged-in editors untouched.
You can enable and configure it in either a global settings file under config/settings/global/ (applied to every wiki on the instance) or a per-wiki file under config/settings/wikis/{wiki-id}/ (just that wiki).
Here is a reasonable configuration for CrawlerProtection:
wfLoadExtension( 'CrawlerProtection' );
// Deny anonymous requests with a raw 403 (skips MediaWiki rendering — faster,
// and sheds the most load).
$wgCrawlerProtectionRawDenial = true;
$wgCrawlerProtectionRawDenialText =
"You must be logged in to view this page." .
"<br><button onclick=\"history.back()\">Go Back</button>";
// Resource-intensive special pages to deny to anonymous users.
$wgCrawlerProtectedSpecialPages = [
'mobilediff',
'recentchangeslinked',
'recentchanges',
'relatedchanges',
'whatlinkshere',
'specialpages',
'browse',
'browsedata',
'random',
];
// Action URLs to deny (page history is the big one).
$wgCrawlerProtectedActions = [ 'history' ];
No restart is needed for these changes (though if you are using GitOps, you should commit them so they are tracked).
You can also exempt trusted automation, such as monitoring and your own bots, from these restrictions with $wgCrawlerProtectionAllowedIPs.
Behavioral detection with CrowdSec
If CrowdSec is enabled, your engine already bans certain scraping behavior, regardless of IP address. The bundled crowdsecurity/caddy collection includes base-http-scenarios, which contains:
http-crawl-non_statics— aggressive crawling of dynamic pages (bulk content scraping), andhttp-bad-user-agent— requests from known scraper/bot user-agents.
Legitimate crawlers (such as Googlebot and Bingbot) are protected from false bans by the bundled seo-bots-whitelist.
You can call the following to confirm that the scenarios are active:
docker compose exec crowdsec cscli scenarios list
There is nothing to configure here — it is the default.
Blocking IP addresses with CrowdSec
Blocklists pre-emptively block IPs by reputation. In the CrowdSec Console under "Blocklists", subscribe this engine to free lists that catch scraper infrastructure:
- Free Proxies — scrapers commonly route through open proxies.
- Tor exit nodes — if you do not expect legitimate Tor readers.
You already receive the CrowdSec Community Blocklist via the Central API. CrowdSec's dedicated AI Crawlers blocklist is a paid list, but CrowdSec offers it free to open-source community projects. A public wiki usually qualifies; email community@crowdsec.net to request access. See Help:CrowdSec#Community blocklist for how blocklists reach the engine and how to read canasta crowdsec status.
Block self-identifying AI bots at the edge with Caddy
Well-behaved AI crawlers announce themselves by User-Agent and are not considered "malicious", so CrowdSec's blocklists will not stop them. You can nevertheless block them at the edge in config/Caddyfile.site in the instance directory (it is imported into the site block, and a matched handle short-circuits before MediaWiki):
@ai_scrapers header_regexp User-Agent (?i)(GPTBot|ChatGPT-User|OAI-SearchBot|ClaudeBot|anthropic-ai|Claude-Web|CCBot|PerplexityBot|Perplexity-User|Bytespider|Amazonbot|Meta-ExternalAgent|FacebookBot|Diffbot|ImagesiftBot|Omgilibot|cohere-ai|YouBot|DataForSeoBot|Timpibot)
handle @ai_scrapers {
respond "Automated AI/LLM scraping of this wiki is not permitted." 403
}
Apply it with:
canasta restart
Two things to get right:
- Do not list
Google-ExtendedorApplebot-Extendedhere — those arerobots.txtopt-out tokens, not real user-agents, and never appear in a request. - To stay citeable in AI search results, drop the search crawlers (e.g.,
OAI-SearchBot,ChatGPT-User,PerplexityBot,Perplexity-User) from the list.
The crawler list is always evolving; the community-maintained ai.robots.txt project is a good source to keep it current.
Using robots.txt
robots.txt asks compliant crawlers to stay out, and is the only way to express the Google/Apple AI-training opt-out tokens. Canasta serves /robots.txt from robots.php (it disallows Special:, MediaWiki: and /w/, and advertises sitemaps). Do not serve a competing /robots.txt from Caddy, which would override it.
robots.php appends /var/www/mediawiki/extra-robots.txt to its output when that file exists. That path is not bind-mounted by default, so add a persistent mount via docker-compose.override.yml. Keep the source file under config/ so backups and GitOps capture it automatically (everything under config/ is tracked).
Create the file first (it must exist before up, or Docker creates a directory):
cat > config/extra-robots.txt <<'EOF'
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: anthropic-ai
User-agent: CCBot
User-agent: Bytespider
User-agent: Amazonbot
User-agent: Meta-ExternalAgent
User-agent: Google-Extended
User-agent: Applebot-Extended
Disallow: /
EOF
Add the mount (Compose merges this onto the base web volumes, preserving the existing mounts):
services:
web:
volumes:
- ./config/extra-robots.txt:/var/www/mediawiki/extra-robots.txt:ro
Apply it with:
canasta restart
Google-Extended and Applebot-Extended belong here, not in the Caddy block, because they opt your content out of Google/Apple AI training while still allowing those companies' search crawlers to index you. Note that Bytespider has been observed ignoring robots.txt, which is why it also appears in the Layer 4 hard block.
- A note on backup and GitOps:
config/extra-robots.txtis captured bycanasta backup(restic) and tracked by GitOps (it lives underconfig/). Thedocker-compose.override.ymlthat mounts it is also backed up by restic and is not ignored by GitOps (onlydocker-compose.override.yml.exampleis) — on a GitOps instance, track it once withcanasta gitops add docker-compose.override.ymland it rides along thereafter. So the whole setup survives a restore or a GitOps pull. See Help:Backup and restore.
Validation
To confirm that CrawlerProtection denies an anonymous history request, but not normal article reads, you can simply visit the pages, or you can call the following:
curl -sI "https://YOUR-WIKI/w/index.php?title=Main_Page&action=history" | head -1 # expect 403
curl -sI "https://YOUR-WIKI/" | head -1 # expect 200
To confirm that the edge block returns 403 for a blocked user-agent but serves a normal browser:
curl -sI -A "GPTBot" https://YOUR-WIKI/ | head -1 # expect 403
curl -sI -A "Mozilla/5.0" https://YOUR-WIKI/ | head -1 # expect 200
To confirm that your robots.txt file is being served correctly, simply visit the file https://YOUR-WIKI/robots.txt .
Confirm CrowdSec is enforcing the community and any console blocklists:
canasta crowdsec status
Undoing these protections
- To disable CrawlerProtection, remove (or comment out) the CrawlerProtection block from your settings file (
config/settings/global/orconfig/settings/wikis/{wiki-id}/). - To undo Caddy blocking, remove the
@ai_scrapersblock fromconfig/Caddyfile.site. - See Help:CrowdSec for how to disable CrowdSec. To disable it for only a certain set of trusted IPs, see Help:CrowdSec#Whitelisting trusted IPs.
- To undo your robots.txt changes, remove the
extra-robots.txtvolume line fromdocker-compose.override.yml(delete the file too if unused). If the override file is now empty, you can remove it entirely.
After any of these changes, you will need to call canasta restart.
Production considerations
- Tune, do not carpet-block. Keep search-engine and (optionally) AI-search crawlers allowed so your wiki(s) stay discoverable and citeable; block training crawlers and abusive scrapers. Revisit the User-Agent list periodically against ai.robots.txt.
- Wiki farms. The Caddy block applies to the whole site address; on a farm it covers every wiki on that hostname. CrawlerProtection applies to every wiki when placed in
config/settings/global/, or to a single wiki when placed inconfig/settings/wikis/{wiki-id}/.
Troubleshooting
- A page anonymous users need returns 403. CrawlerProtection is denying a special page or action they legitimately use — remove it from
$wgCrawlerProtectedSpecialPages/$wgCrawlerProtectedActions, or exempt the source IP via$wgCrawlerProtectionAllowedIPs(a config edit takes effect immediately — no restart is needed). - Caddy block does nothing. The directive must be a
handle @ai_scrapers { ... }block inconfig/Caddyfile.site— a barerespondis ordered afterreverse_proxyand never fires. Re-apply withcanasta restartand re-run thecurl -Atest. - robots.txt lines missing. Confirm the bind mount resolved to a file, not a directory:
docker compose exec web ls -l /var/www/mediawiki/extra-robots.txt. If it is a directory, you created the mount before the host file existed — remove it, createextra-robots.txt, andcanasta restart. - A blocked AI bot still appears in logs. Some crawlers (e.g. Bytespider) spoof or ignore controls; CrowdSec's
http-bad-user-agentand the community blocklist catch many of these by behavior and reputation.