diff --git a/.github/actions/spelling/expect.txt b/.github/actions/spelling/expect.txt index 73606ea8f..96c0f3546 100644 --- a/.github/actions/spelling/expect.txt +++ b/.github/actions/spelling/expect.txt @@ -142,6 +142,7 @@ healthz hec Hetzner hmc +honeypots homelab hostable htmlc @@ -158,6 +159,7 @@ Imagesift imgproxy impressum inp +Iocaine internets IPTo iptoasn @@ -298,6 +300,7 @@ subrequest SVCNAME tagline tarballs +tarpit tarrif taviso tbn diff --git a/docs/docs/CHANGELOG.md b/docs/docs/CHANGELOG.md index 88e2693eb..75b769600 100644 --- a/docs/docs/CHANGELOG.md +++ b/docs/docs/CHANGELOG.md @@ -13,6 +13,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 +- Added `DENY_AND_REROUTE` action for redirecting denied requests to external AI tarpits ([#61](https://github.com/json-kyle/anubis/issues/61)) - Document missing environment variables in installation guide: `SLOG_LEVEL`, `COOKIE_PREFIX`, `FORCED_LANGUAGE`, and `TARGET_DISABLE_KEEPALIVE` ([#1086](https://github.com/TecharoHQ/anubis/pull/1086)) - Add validation warning when persistent storage is used without setting signing keys - Fixed `robots2policy` to properly group consecutive user agents into `any:` instead of only processing the last one ([#925](https://github.com/TecharoHQ/anubis/pull/925)) diff --git a/docs/docs/admin/policies.mdx b/docs/docs/admin/policies.mdx index f95821dfe..39374e2ab 100644 --- a/docs/docs/admin/policies.mdx +++ b/docs/docs/admin/policies.mdx @@ -66,6 +66,7 @@ There are four actions that can be returned from a rule: | :---------- | :---------------------------------------------------------------------------------------------------------------------------------- | | `ALLOW` | Bypass all further checks and send the request to the backend. | | `DENY` | Deny the request and send back an error message that scrapers think is a success. | +| `DENY_AND_REROUTE` | Deny the request and redirect it to an external URL (e.g. a [tarpit](#tarpits))). | | `CHALLENGE` | Show a challenge page and/or validate that clients have passed a challenge. | | `WEIGH` | Change the [request weight](#request-weight) for this request. See the [request weight](#request-weight) docs for more information. | @@ -319,13 +320,71 @@ This would have the Valkey client connect to host `valkey.int.techaro.lol` on po In case your service needs it for risk calculation reasons, Anubis exposes information about the rules that any requests match using a few headers: | Header | Explanation | Example | -| :---------------- | :--------------------------------------------------- | :--------------- | +|:------------------|:-----------------------------------------------------|:-----------------| | `X-Anubis-Rule` | The name of the rule that was matched | `bot/lightpanda` | | `X-Anubis-Action` | The action that Anubis took in response to that rule | `CHALLENGE` | | `X-Anubis-Status` | The status and how strict Anubis was in its checks | `PASS` | Policy rules are matched using [Go's standard library regular expressions package](https://pkg.go.dev/regexp). You can mess around with the syntax at [regex101.com](https://regex101.com), make sure to select the Golang option. +### Deny and Reroute Configuration {#tarpits} + +The `DENY_AND_REROUTE` action allows you to redirect denied requests to external AI tarpits or honeypots. This is useful for sending bot traffic to services like [Nepenthes](https://zadzmo.org/code/nepenthes/) or [Iocaine](https://iocaine.madhouse-project.org/) that specialize in wasting bots' time and resources. + + + + +```json +{ + "name": "ai-scrapers-to-tarpit", + "user_agent_regex": "(ChatGPT|GPTBot|Claude-Web|OpenAI|Anthropic)", + "action": "DENY_AND_REROUTE", + "reroute_to": "https://tarpit.example.com/honeypot" +} +``` + + + + +```yaml +- name: ai-scrapers-to-tarpit + user_agent_regex: (ChatGPT|GPTBot|Claude-Web|OpenAI|Anthropic) + action: DENY_AND_REROUTE + reroute_to: https://tarpit.example.com/honeypot +``` + + + + +The `reroute_to` field must contain an absolute URL (including the scheme like `http://` or `https://`). When this rule matches, Anubis will send a `307 Temporary Redirect` response to redirect the client to the specified URL. + +#### Usage + +**Requirements:** +- `reroute_to` must be an absolute URL with scheme (`http://` or `https://`) +- Returns HTTP 307 Temporary Redirect to the specified URL + +**Examples:** +```yaml +# Redirect suspicious AI scrapers with high weight +- name: ai-to-tarpit + action: DENY_AND_REROUTE + expression: + all: + - userAgent.contains("GPT") || userAgent.contains("Claude") + - weight > 10 + reroute_to: https://tarpit.example.com/honeypot + +# Reroute scrapers trying to access unsecured PHP files. Would be useful for sites that don't use PHP. +- name: php-scraper-reroute + action: DENY_AND_REROUTE + expression: + all: + - path.endsWith(".php") # PHP files are often targeted by bots + - weight > 5 + reroute_to: https://example.com/not-found +``` + ## Request Weight Anubis rules can also add or remove "weight" from requests, allowing administrators to configure custom levels of suspicion. For example, if your application uses session tokens named `i_love_gitea`: diff --git a/docs/docs/admin/robots2policy.mdx b/docs/docs/admin/robots2policy.mdx index 30f0eab08..7c4f509df 100644 --- a/docs/docs/admin/robots2policy.mdx +++ b/docs/docs/admin/robots2policy.mdx @@ -2,6 +2,7 @@ title: robots2policy CLI Tool sidebar_position: 50 --- +> "LET'S MAKE ROBOTS.TXT GREAT AGAIN!" - [Jason Cameron](https://jsn.cam/) The `robots2policy` tool converts robots.txt files into Anubis challenge policies. It reads robots.txt rules and generates equivalent CEL expressions for path matching and user-agent filtering. diff --git a/lib/anubis.go b/lib/anubis.go index 7b8d4f173..45444b67c 100644 --- a/lib/anubis.go +++ b/lib/anubis.go @@ -275,6 +275,17 @@ func (s *Server) checkRules(w http.ResponseWriter, r *http.Request, cr policy.Ch lg.Debug("rule hash", "hash", hash) s.respondWithStatus(w, r, fmt.Sprintf("%s %s", localizer.T("access_denied"), hash), s.policy.StatusCodes.Deny) return true + case config.RuleDenyAndReroute: + s.ClearCookie(w, s.cookieName, cookiePath) + lg.Info("deny and reroute", "reroute_to", cr.RerouteTo) + if cr.RerouteTo == nil || *cr.RerouteTo == "" { + lg.Error("reroute URL is missing for DENY_AND_REROUTE action") + s.respondWithError(w, r, "Internal Server Error: administrator has misconfigured Anubis. Please contact the administrator and ask them to look for the logs around \"maybeReverseProxy.RuleDenyAndReroute\"") + return true + } + // note for others, would it be better to be reverse proxying here? + http.Redirect(w, r, *cr.RerouteTo, http.StatusTemporaryRedirect) + return true case config.RuleChallenge: lg.Debug("challenge requested") case config.RuleBenchmark: @@ -553,6 +564,15 @@ func cr(name string, rule config.Rule, weight int) policy.CheckResult { } } +func crWithReroute(name string, rule config.Rule, weight int, rerouteTo *string) policy.CheckResult { + return policy.CheckResult{ + Name: name, + Rule: rule, + Weight: weight, + RerouteTo: rerouteTo, + } +} + // Check evaluates the list of rules, and returns the result func (s *Server) check(r *http.Request, lg *slog.Logger) (policy.CheckResult, *policy.Bot, error) { host := r.Header.Get("X-Real-Ip") @@ -577,6 +597,8 @@ func (s *Server) check(r *http.Request, lg *slog.Logger) (policy.CheckResult, *p switch b.Action { case config.RuleDeny, config.RuleAllow, config.RuleBenchmark, config.RuleChallenge: return cr("bot/"+b.Name, b.Action, weight), &b, nil + case config.RuleDenyAndReroute: + return crWithReroute("bot/"+b.Name, b.Action, weight, b.RerouteTo), &b, nil case config.RuleWeigh: lg.Debug("adjusting weight", "name", b.Name, "delta", b.Weight.Adjust) weight += b.Weight.Adjust diff --git a/lib/anubis_test.go b/lib/anubis_test.go index 0f6ef7f5c..4aea9df81 100644 --- a/lib/anubis_test.go +++ b/lib/anubis_test.go @@ -666,6 +666,79 @@ func TestRuleChange(t *testing.T) { } } +func TestDenyAndRerouteAction(t *testing.T) { + h := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + t.Log(r.UserAgent()) + w.WriteHeader(http.StatusOK) + fmt.Fprintln(w, "OK") + }) + + pol := loadPolicies(t, "./testdata/deny_and_reroute_test.yaml", 4) + + srv := spawnAnubis(t, Options{ + Next: h, + Policy: pol, + }) + + ts := httptest.NewServer(internal.RemoteXRealIP(true, "tcp", srv)) + defer ts.Close() + + testCases := []struct { + userAgent string + expectedCode int + expectedURL string + }{ + { + userAgent: "REROUTE_ME", + expectedCode: http.StatusTemporaryRedirect, + expectedURL: "https://example.com/tarpit", + }, + { + userAgent: "DENY_ME", + expectedCode: http.StatusOK, // From status_codes config + }, + { + userAgent: "ALLOW_ME", + expectedCode: http.StatusOK, + }, + } + + for _, tc := range testCases { + t.Run(tc.userAgent, func(t *testing.T) { + client := &http.Client{ + CheckRedirect: func(req *http.Request, via []*http.Request) error { + // Don't follow redirects, we want to test the redirect response + return http.ErrUseLastResponse + }, + } + + req, err := http.NewRequestWithContext(t.Context(), http.MethodGet, ts.URL, nil) + if err != nil { + t.Fatal(err) + } + + req.Header.Set("User-Agent", tc.userAgent) + + resp, err := client.Do(req) + if err != nil { + t.Fatal(err) + } + defer resp.Body.Close() + + if resp.StatusCode != tc.expectedCode { + t.Errorf("wanted status code %d but got: %d", tc.expectedCode, resp.StatusCode) + } + + if tc.expectedURL != "" { + location := resp.Header.Get("Location") + if location != tc.expectedURL { + t.Errorf("wanted Location header %q but got: %q", tc.expectedURL, location) + } + } + }) + } +} + func TestStripBasePrefixFromRequest(t *testing.T) { testCases := []struct { name string diff --git a/lib/policy/bot.go b/lib/policy/bot.go index 479bccc3a..4bfa90326 100644 --- a/lib/policy/bot.go +++ b/lib/policy/bot.go @@ -14,6 +14,7 @@ type Bot struct { Weight *config.Weight Name string Action config.Rule + RerouteTo *string } func (b Bot) Hash() string { diff --git a/lib/policy/checkresult.go b/lib/policy/checkresult.go index 31737dda5..60934e545 100644 --- a/lib/policy/checkresult.go +++ b/lib/policy/checkresult.go @@ -7,9 +7,10 @@ import ( ) type CheckResult struct { - Name string - Rule config.Rule - Weight int + Name string + Rule config.Rule + Weight int + RerouteTo *string } func (cr CheckResult) LogValue() slog.Value { diff --git a/lib/policy/config/config.go b/lib/policy/config/config.go index 6b5946ae9..a4a3cf92d 100644 --- a/lib/policy/config/config.go +++ b/lib/policy/config/config.go @@ -7,6 +7,7 @@ import ( "io/fs" "net" "net/http" + "net/url" "os" "regexp" "strings" @@ -31,22 +32,25 @@ var ( ErrCantSetBotAndImportValuesAtOnce = errors.New("config.BotOrImport: can't set bot rules and import values at the same time") ErrMustSetBotOrImportRules = errors.New("config.BotOrImport: rule definition is invalid, you must set either bot rules or an import statement, not both") ErrStatusCodeNotValid = errors.New("config.StatusCode: status code not valid, must be between 100 and 599") + ErrRerouteURLRequired = errors.New("config.Bot: reroute_to URL is required when using DENY_AND_REROUTE action") + ErrInvalidRerouteURL = errors.New("config.Bot: invalid reroute_to URL") ) type Rule string const ( - RuleUnknown Rule = "" - RuleAllow Rule = "ALLOW" - RuleDeny Rule = "DENY" - RuleChallenge Rule = "CHALLENGE" - RuleWeigh Rule = "WEIGH" - RuleBenchmark Rule = "DEBUG_BENCHMARK" + RuleUnknown Rule = "" + RuleAllow Rule = "ALLOW" + RuleDeny Rule = "DENY" + RuleDenyAndReroute Rule = "DENY_AND_REROUTE" + RuleChallenge Rule = "CHALLENGE" + RuleWeigh Rule = "WEIGH" + RuleBenchmark Rule = "DEBUG_BENCHMARK" ) func (r Rule) Valid() error { switch r { - case RuleAllow, RuleDeny, RuleChallenge, RuleWeigh, RuleBenchmark: + case RuleAllow, RuleDeny, RuleDenyAndReroute, RuleChallenge, RuleWeigh, RuleBenchmark: return nil default: return ErrUnknownAction @@ -65,6 +69,7 @@ type BotConfig struct { Name string `json:"name" yaml:"name"` Action Rule `json:"action" yaml:"action"` RemoteAddr []string `json:"remote_addresses,omitempty" yaml:"remote_addresses,omitempty"` + RerouteTo *string `json:"reroute_to,omitempty" yaml:"reroute_to,omitempty"` // Thoth features GeoIP *GeoIP `json:"geoip,omitempty"` @@ -80,6 +85,7 @@ func (b BotConfig) Zero() bool { b.Action != "", len(b.RemoteAddr) != 0, b.Challenge != nil, + b.RerouteTo != nil, b.GeoIP != nil, b.ASNs != nil, } { @@ -163,8 +169,8 @@ func (b *BotConfig) Valid() error { } } - switch b.Action { - case RuleAllow, RuleBenchmark, RuleChallenge, RuleDeny, RuleWeigh: + switch b.Action { // todo(json) refactor to use method above + case RuleAllow, RuleBenchmark, RuleChallenge, RuleDeny, RuleDenyAndReroute, RuleWeigh: // okay default: errs = append(errs, fmt.Errorf("%w: %q", ErrUnknownAction, b.Action)) @@ -180,6 +186,18 @@ func (b *BotConfig) Valid() error { b.Weight = &Weight{Adjust: 5} } + if b.Action == RuleDenyAndReroute { + if b.RerouteTo == nil || *b.RerouteTo == "" { + errs = append(errs, ErrRerouteURLRequired) + } else { + if u, err := url.Parse(*b.RerouteTo); err != nil { + errs = append(errs, fmt.Errorf("%w: %v", ErrInvalidRerouteURL, err)) + } else if !u.IsAbs() { + errs = append(errs, fmt.Errorf("%w: URL must be absolute (include scheme)", ErrInvalidRerouteURL)) + } + } + } + if len(errs) != 0 { return fmt.Errorf("config: bot entry for %q is not valid:\n%w", b.Name, errors.Join(errs...)) } diff --git a/lib/policy/config/config_test.go b/lib/policy/config/config_test.go index 3a96c9c9b..dcd593132 100644 --- a/lib/policy/config/config_test.go +++ b/lib/policy/config/config_test.go @@ -187,6 +187,65 @@ func TestBotValid(t *testing.T) { }, }, }, + { + name: "deny and reroute with valid URL", + bot: BotConfig{ + Name: "reroute-bot", + Action: RuleDenyAndReroute, + UserAgentRegex: p("BadBot"), + RerouteTo: p("https://example.com/tarpit"), + }, + err: nil, + }, + { + name: "deny and reroute with localhost URL", + bot: BotConfig{ + Name: "reroute-localhost", + Action: RuleDenyAndReroute, + UserAgentRegex: p("SpamBot"), + RerouteTo: p("http://localhost:8080/poison"), + }, + err: nil, + }, + { + name: "deny and reroute missing URL", + bot: BotConfig{ + Name: "reroute-missing-url", + Action: RuleDenyAndReroute, + UserAgentRegex: p("BadBot"), + }, + err: ErrRerouteURLRequired, + }, + { + name: "deny and reroute with empty URL", + bot: BotConfig{ + Name: "reroute-empty-url", + Action: RuleDenyAndReroute, + UserAgentRegex: p("BadBot"), + RerouteTo: p(""), + }, + err: ErrRerouteURLRequired, + }, + { + name: "deny and reroute with invalid URL", + bot: BotConfig{ + Name: "reroute-invalid-url", + Action: RuleDenyAndReroute, + UserAgentRegex: p("BadBot"), + RerouteTo: p("not-a-valid-url"), + }, + err: ErrInvalidRerouteURL, + }, + { + name: "deny and reroute with malformed URL", + bot: BotConfig{ + Name: "reroute-malformed-url", + Action: RuleDenyAndReroute, + UserAgentRegex: p("BadBot"), + RerouteTo: p("http://[invalid-ipv6"), + }, + err: ErrInvalidRerouteURL, + }, } for _, cs := range tests { @@ -367,4 +426,9 @@ func TestBotConfigZero(t *testing.T) { if b.Zero() { t.Error("config.BotConfig with challenge rules is zero value") } + + b.RerouteTo = p("https://example.com/tarpit") + if b.Zero() { + t.Error("BotConfig with reroute URL is zero value") + } } diff --git a/lib/policy/config/testdata/bad/deny_and_reroute_invalid_url.json b/lib/policy/config/testdata/bad/deny_and_reroute_invalid_url.json new file mode 100644 index 000000000..c454ba36e --- /dev/null +++ b/lib/policy/config/testdata/bad/deny_and_reroute_invalid_url.json @@ -0,0 +1,26 @@ +{ + "bots": [ + { + "name": "invalid-url", + "action": "DENY_AND_REROUTE", + "user_agent_regex": "BadBot", + "reroute_to": "not-a-valid-url" + } + ], + "status_codes": { + "CHALLENGE": 200, + "DENY": 403 + }, + "thresholds": [ + { + "name": "legacy-anubis-behaviour", + "expression": "true", + "action": "CHALLENGE", + "challenge": { + "algorithm": "fast", + "difficulty": 5, + "report_as": 5 + } + } + ] +} \ No newline at end of file diff --git a/lib/policy/config/testdata/bad/deny_and_reroute_missing_url.yaml b/lib/policy/config/testdata/bad/deny_and_reroute_missing_url.yaml new file mode 100644 index 000000000..dc11ebc44 --- /dev/null +++ b/lib/policy/config/testdata/bad/deny_and_reroute_missing_url.yaml @@ -0,0 +1,18 @@ +bots: + - name: "missing-url" + action: "DENY_AND_REROUTE" + user_agent_regex: "BadBot" + # Missing reroute_to field should cause validation error + +status_codes: + CHALLENGE: 200 + DENY: 403 + +thresholds: + - name: "legacy-anubis-behaviour" + expression: "true" + action: "CHALLENGE" + challenge: + algorithm: "fast" + difficulty: 5 + report_as: 5 \ No newline at end of file diff --git a/lib/policy/config/testdata/good/deny_and_reroute_basic.json b/lib/policy/config/testdata/good/deny_and_reroute_basic.json new file mode 100644 index 000000000..3e14e722c --- /dev/null +++ b/lib/policy/config/testdata/good/deny_and_reroute_basic.json @@ -0,0 +1,31 @@ +{ + "bots": [ + { + "name": "ai-scrapers-tarpit", + "action": "DENY_AND_REROUTE", + "user_agent_regex": "(ChatGPT|GPTBot|Claude-Web)", + "reroute_to": "https://example.com/tarpit" + }, + { + "name": "allow-legitimate", + "action": "ALLOW", + "user_agent_regex": "(Googlebot|Bingbot)" + } + ], + "status_codes": { + "CHALLENGE": 200, + "DENY": 403 + }, + "thresholds": [ + { + "name": "legacy-anubis-behaviour", + "expression": "true", + "action": "CHALLENGE", + "challenge": { + "algorithm": "fast", + "difficulty": 5, + "report_as": 5 + } + } + ] +} \ No newline at end of file diff --git a/lib/policy/config/testdata/good/deny_and_reroute_basic.yaml b/lib/policy/config/testdata/good/deny_and_reroute_basic.yaml new file mode 100644 index 000000000..b646f7435 --- /dev/null +++ b/lib/policy/config/testdata/good/deny_and_reroute_basic.yaml @@ -0,0 +1,22 @@ +bots: + - name: "ai-scrapers-tarpit" + action: "DENY_AND_REROUTE" + user_agent_regex: "(ChatGPT|GPTBot|Claude-Web)" + reroute_to: "https://example.com/tarpit" + + - name: "allow-legitimate" + action: "ALLOW" + user_agent_regex: "(Googlebot|Bingbot)" + +status_codes: + CHALLENGE: 200 + DENY: 403 + +thresholds: + - name: "legacy-anubis-behaviour" + expression: "true" + action: "CHALLENGE" + challenge: + algorithm: "fast" + difficulty: 5 + report_as: 5 \ No newline at end of file diff --git a/lib/policy/config/testdata/good/deny_and_reroute_mixed.yaml b/lib/policy/config/testdata/good/deny_and_reroute_mixed.yaml new file mode 100644 index 000000000..3918f8e20 --- /dev/null +++ b/lib/policy/config/testdata/good/deny_and_reroute_mixed.yaml @@ -0,0 +1,33 @@ +bots: + - name: "reroute-scrapers" + action: "DENY_AND_REROUTE" + user_agent_regex: "ScrapingBot" + reroute_to: "http://localhost:8080/poison" + + - name: "block-bad-actors" + action: "DENY" + user_agent_regex: "BadActor" + + - name: "challenge-suspicious" + action: "CHALLENGE" + user_agent_regex: "SuspiciousBot" + challenge: + algorithm: "fast" + difficulty: 6 + + - name: "allow-good" + action: "ALLOW" + user_agent_regex: "GoodBot" + +status_codes: + CHALLENGE: 200 + DENY: 403 + +thresholds: + - name: "legacy-anubis-behaviour" + expression: "true" + action: "CHALLENGE" + challenge: + algorithm: "fast" + difficulty: 5 + report_as: 5 \ No newline at end of file diff --git a/lib/policy/policy.go b/lib/policy/policy.go index 5493d8dd2..3dfc60cdb 100644 --- a/lib/policy/policy.go +++ b/lib/policy/policy.go @@ -69,8 +69,9 @@ func ParseConfig(ctx context.Context, fin io.Reader, fname string, defaultDiffic } parsedBot := Bot{ - Name: b.Name, - Action: b.Action, + Name: b.Name, + Action: b.Action, + RerouteTo: b.RerouteTo, } cl := checker.List{} diff --git a/lib/testdata/deny_and_reroute_test.yaml b/lib/testdata/deny_and_reroute_test.yaml new file mode 100644 index 000000000..212a89100 --- /dev/null +++ b/lib/testdata/deny_and_reroute_test.yaml @@ -0,0 +1,26 @@ +bots: + - name: "reroute-bot" + action: "DENY_AND_REROUTE" + user_agent_regex: "REROUTE_ME" + reroute_to: "https://example.com/tarpit" + + - name: "deny-bot" + action: "DENY" + user_agent_regex: "DENY_ME" + + - name: "allow-bot" + action: "ALLOW" + user_agent_regex: "ALLOW_ME" + +status_codes: + CHALLENGE: 200 + DENY: 200 + +thresholds: + - name: "legacy-anubis-behaviour" + expression: "true" + action: "CHALLENGE" + challenge: + algorithm: "fast" + difficulty: 5 + report_as: 5 \ No newline at end of file