Skip to content

dgl/haphash

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

haphash: Anti-scraper for haproxy

This is a simple anti-scraper solution for haproxy, using a similar "hashcash" challenge as anubis uses. The goal is to be as simple as possible, so this can be implemented alongside other haproxy rules to control traffic.

Overview

AI crawlers keep breaking the web. I mostly have avoided this problem by having very lightweight pages, but lately I've noticed some scrapers are being particularly obnoxious. Many solutions to this problem involve adding another proxy component, but I'm already running haproxy in most places, which is a perfectly fine reverse proxy and I don't want to make things more complex if I can avoid it.

This uses a haproxy "stick table" to store details of IP addresses. It is based on simply allowing IP addresses, rather than cookies. As the IP address is stored in memory and there's no cookie, this likely does not add to any GDPR obligations (this is not legal advice).

It is expected this will be combined with haproxy IP based rate limiting, with the benefit that this doesn't add another component to the system.

If you want to try it out, my contact page is always protected by it.

The moving parts

challenge.html is the HTML served to clients, templated via haproxy. (Because this is templated you can't just open it in your browser -- note the double percent signs.)

haproxy.conf is a haproxy config snippet that makes use of this. It's expected you adjust this for your implementation. The "challenge" backend is where the majority of the logic lives and should only need tiny changes.

This is small:

$ wc -l haproxy.conf challenge.html
      38 haproxy.conf
      94 challenge.html
     132 total

Set-up

Copy challenge.html to /etc/haproxy/challenge.html (or other suitable location).

From haproxy.conf add the challenge backend to your haproxy configuration. Add the relevant lines from frontend www to your frontend section.

To start with it is recommended you protect a single path for testing purposes. Restarting haproxy will clear the stick table (configure peers to make the allowed IP addresses persist).

The difficulty is set in both the HTML and the haproxy config, it defaults to 4 (which is pretty fast).

License

©David Leadbeater 2025; 0BSD, see COPYING.

Alternatives

Credits

About

Anti-scraper challenge for haproxy to stop naughty AI bots.

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

Languages