Long story short, my VPS, which I’m forwarding my servers through Tailscale to, got hammered by thousands of requests per minute from Anthropic’s Claude AI. All of which being from different AWS IPs.

The VPS has a 1TB monthly cap, but it’s still kinda shitty to have huge spikes like the 13GB in just a couple of minutes today.

How do you deal with something like this?
I’m only really running a caddy reverse proxy on the VPS which forwards my home server’s services through Tailscale. "

I’d really like to avoid solutions like Cloudflare, since they f over CGNAT users very frequently and all that. Don’t think a WAF would help with this at all(?), but rate limiting on the reverse proxy might work.

(VPS has fail2ban and I’m using /etc/hosts.deny for manual blocking. There’s a WIP website on my root domain with robots.txt that should be denying AWS bots as well…)

I’m still learning and would really appreciate any suggestions.

  • poVoq@slrpnk.net
    link
    fedilink
    English
    arrow-up
    6
    ·
    edit-2
    10 hours ago

    It seems any somewhat easy to implement solution gets circumvented by them quickly. Some of the bots do respect robots.txt through if you explicitly add their self-reported user-agent (but they change it from time to time). This repo has a regularly updated list: https://github.com/ai-robots-txt/ai.robots.txt/

    In my experience, git forges are especially hit hard, and the only real solution I found is to put a login wall in front, which kinda sucks especially for open-source projects you want to self-host.

    Oh and recently the mlmym (old reddit) frontend for Lemmy seems to have started attracting AI scraping as well. We had to turn it off on our instance because of that.

    • zoey@lemmy.librebun.comOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      10 hours ago

      In my experience, git forges are especially hit hard

      Is that why my Forgejo instance has been hit twice like crazy before…
      Why can’t we have nice things. Thank you!

      EDIT: Hopefully Photon doesn’t get in their sights as well. Though after using the official lemmy webui for a while, I do really like it a lot.

      • poVoq@slrpnk.net
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        1
        ·
        9 hours ago

        Yeah, Forgejo and Gitea. I think it is partially a problem of insufficient caching on the side of these git forges that makes it especially bad, but in the end that is victim blaming 🫠

        Mlmym seems to be the target because it is mostly Javascript free and therefore easier to scrape I think. But the other Lemmy frontends are also not well protected. Lemmy-ui doesn’t even allow to easily add a custom robots.txt, you have to manually overwrite it in the reverse-proxy.