Django
,Python
🤖 Super Bot Fight 🥊
In March, I wrote about my robots.txt research and how I started proactively and defensively blocking AI Agents in my 🤖 On Robots.txt. Since March, I have updated my Django projects to add more robots.txt rules.
Earlier this week, I ran across this Blockin’ bots. blog post and this example, the mod_rewrite
rule blocks AI Agents via their User-Agent strings.
<IfModule mod_rewrite.c>
RewriteEngine on
RewriteBase /
# block “AI” bots
RewriteCond %{HTTP_USER_AGENT} (AdsBot-Google|Amazonbot|anthropic-ai|Applebot|AwarioRssBot|AwarioSmartBot|Bytespider|CCBot|ChatGPT|ChatGPT-User|Claude-Web|ClaudeBot|cohere-ai|DataForSeoBot|Diffbot|FacebookBot|FacebookBot|Google-Extended|GPTBot|ImagesiftBot|magpie-crawler|omgili|Omgilibot|peer39_crawler|PerplexityBot|YouBot) [NC]
RewriteRule ^ – [F]
</IfModule>
Since none of my projects use Apache, and I was short on time, I decided to leave this war to the bots.
Django Middleware
I asked ChatGPT to convert this snippet to a piece of Django Middleware called Super Bot Fight. After all, if we don’t have time to keep up with bots, then we could leverage this technology to help fight against them.
In theory, this snippet passed my eyeball test and was good enough:
# middleware.py
from django.http import HttpResponseForbidden
# List of user agents to block
BLOCKED_USER_AGENTS = [
"AdsBot-Google",
"Amazonbot",
"anthropic-ai",
"Applebot",
"AwarioRssBot",
"AwarioSmartBot",
"Bytespider",
"CCBot",
"ChatGPT",
"ChatGPT-User",
"Claude-Web",
"ClaudeBot",
"cohere-ai",
"DataForSeoBot",
"Diffbot",
"FacebookBot",
"Google-Extended",
"GPTBot",
"ImagesiftBot",
"magpie-crawler",
"omgili",
"Omgilibot",
"peer39_crawler",
"PerplexityBot",
"YouBot",
]
class BlockBotsMiddleware:
def __init__(self, get_response):
self.get_response = get_response
def __call__(self, request):
# Check the User-Agent against the blocked list
user_agent = request.META.get("HTTP_USER_AGENT", "")
if any(bot in user_agent for bot in BLOCKED_USER_AGENTS):
return HttpResponseForbidden("Access denied")
response = self.get_response(request)
return response
To use this middleware, you would update your Django settings.py
to add it to your MIDDLEWARE
setting.
# settings.py
MIDDLEWARE = [
...
"middleware.BlockBotsMiddleware",
...
]
Tests?
If this middleware works for you and you care about testing, then these tests should also work:
import pytest
from django.http import HttpRequest
from django.test import RequestFactory
from middleware import BlockBotsMiddleware
@pytest.mark.parametrize("user_agent, should_block", [
("AdsBot-Google", True),
("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)", False),
("ChatGPT-User", True),
("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3", False),
])
def test_user_agent_blocking(user_agent, should_block):
# Create a request factory to generate request instances
factory = RequestFactory()
request = factory.get('/', HTTP_USER_AGENT=user_agent)
# Middleware setup
middleware = BlockBotsMiddleware(get_response=lambda request: HttpResponse())
response = middleware(request)
# Check if the response should be blocked or allowed
if should_block:
assert response.status_code == 403, f"Request with user agent '{user_agent}' should be blocked."
else:
assert response.status_code != 403, f"Request with user agent '{user_agent}' should not be blocked."
Enhancements
To use this code in production, I would normalize the user_agent
and BLOCKED_USER_AGENTS
variables to be case-insensitive.
I would also consider storing my list of user agents in a Django model or using a project like django-robots instead of a hard-coded Python list.
Thursday April 18, 2024