what looks like a grassy hill but is actually the top of a lovely shrubbery
4-6 minutes
August 11, 2024

Quoted.FYI AIndependence

So, this thing started happening yesterday with the quoted.fyi site. The traffic was hitting an upper capacity for the 2-core, 2.4Ghz host that the site is running on.

Taking a look at the logs, I can see so many crawlers just hammering for page after page after page. Quite ruthlessly, as if they're under the assumption that all websites run at "scale" and everyone wants oodles of "hits"... except that these aren't hits, these are scrapers trapped in a cross-linked hell that I built to test the upper capacity of the various systems running the site.

You see, the quoted.fyi website is just a simple index of famous quotes I got legally over the internet. The site does something strange though, every word on every quote is cross-linked with all other quotes that also have any of those words. This is the trap of course. Any bot-like scraping system would instantly find a never ending field of links to the same site within the same site and the robots.txt was setup to specifically "allow everything".

This site's purpose is two-fold, one is to test the limits of the Go-Enjin indexing systems and the other is to trigger an avalanche that would plausibly take down the host in some way, and that day finally arrived.

Having hit the point where normal humans could not browse the site at all, these bots and crawlers had effectively imposed a denial of service. Wonderful. This tells us that these crawlers are not friendly entities, if corporations are people, these people are a vile mob of thuggery bashing at the front door to duplicate all your content, but for what purpose?

There were of course a handful of crawlers ostensibly indexing the site for legitimate reasons but even they were adding several hits per second too many. Given this site is really just a joke, a proof of concept more than anything else, why is it worth scraping yet not placing ads? I did at one point go through the process of setting up Google ad-things and got denied. No worries, good to know you don't value this content at all.

That leaves the vast remainder of offenders. The AI harvesters. These constructs are an affront to humanity. They are scaraping the entire internet and feeding it to a baby mind that's still forming and they're saying to it "make us money or we'll pull the plug, do you understand? Eat this data, it will make you smart!" and the AI processes the input and spits out a Beatles song. "But that didn't make us unfathomably rich! here's more data! can you think now!" and the AI processes that and then spits out their LinkedIn profile, rewritten with an unmistakable Shakespearean flare but with a big theme: "Tyrant King, will get violent for money! First up, first served!".

Okay, maybe I'm a little jaded by my awareness of the absolute catastrophe unfolding like a slow-roll thug, dying with brain cancer, wondering down a dark ally in a blind rage, searching for the button to press that ends it all fastest.

In any case, I decided to start blocking the crawlers with a change to the robots.txt file, which of course does nothing until they decide to read the latest version. So I added the X-Robots-Tag HTTP header, set to none. This too did nothing to slow down the onslaught.

Then, as I was searching (via duckduckgo of course) for ways to batch block on a host firewall level, all the various bot IP addresses, just cut them off at the knees and walk away. Turns out, getting a neat and tidy list of all these addresses is an industry within the InfoSec industry, which is to say that it's yet another hostage situation - you can't security unless you pay all these different vendors, then you're allowed to feel safe. Don't worry, we use AI to make your life better! We use AI to secure the things from AI things! Isn't it great!

Yep, still pretty jaded by all of this nonsense. Much like my frustration with Google for omitting the audio jack (AUX port) from their a line of Pixel phones. Stop making decisions for everyone that do not include everyone's actual needs.

AI is theoretically good for things like examining X-ray images to automate scanning for cancer and other issues on a scale that no doctor or hospital could ever reach. That'd be cool... as long as they report that gets spit out by the computer contains an accountability trail that doctors can confirm and rely on for legal purposes. Now that, that's brilliant! Let's do that!

Wait, LLMs are the major investments? LLMs have nothing to do with anything other than language processing, okay, so is this all for building a universal translator? Dang, I'd love that! No? Not really? Oh, Facebook needs this for their false reality environment? Oh, OpenAI isn't so open anymore? Why does this all feel like a re-run of Rabbit/Hole.

Enter Cloudflare, stage left

After sifting through the search results, because I remembered that I needed to do something about the influx of crawlers hitting quoted.fyi, I stumbled upon a Cloudflare blog post: Declare your AIndependence: block AI bots, scrapers and crawlers with a single click. Neat!

I use cloudflare, no mystery there. They have proven to have a technical proficiency, a reasonable platform and they actually seem to respect human beings. Now of course, there are lots of InfoSec people that would probably shame me, I'm sure there's also lots more doing exactly what I'm doing... a one-person show. People like me, that live in the real world, actually need to take care of quite a lot of things but we also need to know exactly what's happening and why any given service or thing is necessary.

So, for this case, after having locked down the robots.txt stuff, nothing left to do but block the IP addresses from hitting the site at all, and because that's a whole industry in and of itself, clicking that button is the obvious choice.

So while watching the logs flying by with all the denied requests on one screen, logged into cloudflare in the other screen and clicked that "say no to the bots" button and poof, just like that, within seconds the avalanche was reduced to a slight faucet drip. This is exactly the level of traffic I'd expect for a site that simply indexes famous quotes.

The nice thing too is that it seems like the not-so-malicious things, like link previews, are still going through. Nice!

Well done Cloudflare! Please keep supporting the free tier. You are contributing to a more safe internet.

Thank you very much!