Everyone loves the dream of a free for all and open web.
But the reality is how can someone small protect their blog or content from AI training bots? E.g.: They just blindly trust someone is sending Agent vs Training bots and super duper respecting robots.txt? Get real...
Or, fine what if they do respect robots.txt, but they buy the data that may or may not have been shielded through liability layers via "licensed data"?
Unless you're reddit, X, Google, or Meta with scary unlimited budget legal teams, you have no power.
> Everyone loves the dream of a free for all and open web... But the reality is how can someone small protect their blog or content from AI training bots?
Aren't these statements entirely in conflict? You either have a free for all open web or you don't. Blocking AI training bots is not free and open for all.
No, that is not true. It is only true if you just equate "AI training bots" with "people" on some kind of nominal basis without considering how they operate in practice.
It is like saying "If your grocery store is open to the public, why is it not open to this herd of rhinoceroses?" Well, the reason is because rhinoceroses are simply not going to stroll up and down the aisles and head to the checkout line quietly with a box of cereal and a few bananas. They're going to knock over displays and maybe even shelves and they're going to damage goods and generally make the grocery store unusable for everyone else. You can say "Well, then your problem isn't rhinoceroses, it's entities that damage the store and impede others from using it" and I will say "Yes, and rhinoceroses are in that group, so they are banned".
It's certainly possible to imagine a world where AI bots use websites in more acceptable ways --- in fact, it's more or less the world we had prior to about 2022, where scrapers did exist but were generally manageable with widely available techniques. But that isn't the world that we live in today. It's also certainly true that many humans are using websites in evil ways (notably including the humans who are controlling many of these bots), and it's also very true that those humans should be held accountable for their actions. But that doesn't mean that blocking bots makes the internet somehow unfree.
This type of thinking that freedom means no restrictions makes sense only in a sort of logical dreamworld disconnected from practical reality. It's similar to the idea that "freedom" in the socioeconomic sphere means the unrestricted right to do whatever you please with resources you control. Well, no, that is just your freedom. But freedom globally construed requires everyone to have autonomy and be able to do things, not just those people with lots of resources.
You have a problem with badly behaved scrapers, not AI.
I can't disagree with being against badly behaved scrapers. But this is neither a new problem or an interesting one from the idea of making information freely available to everyone, even rhinoceroses, assuming they are well behaved. Blocking bad actors is not the same thing as blocking AI.
The thing is that rhinoceroses aren't well-behaved. Even if some small fraction of them in theory might be well-behaved, the effort of trying to account for that is too small to bother. If 99% of rhinoceroses aren't well-behaved, the simple and correct response is to ban them all, and then maybe the nice ones can ask for a special permit. You switch from allow-by-default to block-by-default.
Similarly it doesn't make sense to talk about what happens if AI bots were well-behaved. If they are, then maybe that would be okay, but they aren't, so we're not talking about some theoretical (or past) situation where bots were well-behaved and scraped in a non-disruptive fashion. We're talking about the present reality in which there actually are enormous numbers of badly-behaved bots.
Incidentally, I see that in a lot of your responses on this thread you keep suggesting that people's problem is "not with AI" but with something else. But look at your comment that I initially replied to:
> Blocking AI training bots is not free and open for all.
We're not talking about "AI". We're talking about AI training bots. If people want to develop AI as a theoretical construct and train it on datasets they download separately in a non-disruptive way, great. (Well, actually it's still terrible, but for other reason. :-) ) But that's not what people are responding to in this thread. They're talking about AI training bots that scrape websites in a way that is objectively more harmful than previous generations of scrapers.
ISPs are supposed to disconnect abusive customers. The correct thing to do is probably contact the ISP. Don't complain about scraping, complain about the DDOS (which is the actual problem and I'm increasingly beginning to believe the intent.)
But many people feel that the very act of incorporating your copyrighted words into their for-profit training set is itself the bad behavior. It's not about rate-limiting scrapers, it's letting them in the door in the first place.
Why was it OK for Google to incorporate their words into a for-profit search index which has increasingly sucked all the profit out of the system?
My Ithaca friends on Facebook complain incessantly about the very existence of AI to the extent that I would not want to say I ask Copilot how to use Windows Narrator or Junie where the CSS that makes this text bold or sometimes have Photoshop draw an extra row of bricks in a photograph for me.
The same people seem to have no problem with Facebook using their words for all things Facebook uses them for, however.
They were okay with it when Google was sending them traffic. Now they often don’t. They’ve broken the social contract of the web. So why should the sites whose work is being scraped be expected to continue upholding their end?
Not only are they scraping without sending traffic, they're doing so much more aggressively than Google ever did; Google, at least, respected robots.txt and kept to the same user-agent. They didn't want to index something that a server didn't want indexed. AI bots, on the other hand, want to index every possible thing regardless of what anyone else says.
> Why was it OK for Google to incorporate their words into a for-profit search index which has increasingly sucked all the profit out of the system?
It wasn't okay, it's just that the reasons it wasn't okay didn't become apparent until later.
> The same people seem to have no problem with Facebook using their words for all things Facebook uses them for, however.
Many of those people will likely have a problem with it later, for reasons that are happening now but that they won't become fully aware of until later.
> My Ithaca friends on Facebook complain incessantly about the very existence of AI to the extent that I would not want to say I ask Copilot how to use Windows Narrator or Junie where the CSS that makes this text bold or sometimes have Photoshop draw an extra row of bricks in a photograph for me.
Good! Why would you willingly confess any of that? I'd be humiliated if I did any of that.
Sure. But we're already talking about presumption of free and open here. I'm sure people are also reading my words and incorporating it into their own for-profit work. If I cared, I wouldn't make it free and open in the first place.
> You have a problem with badly behaved scrapers, not AI.
And you have a problem understanding that "freedom and openness" extend only to where the rights (e. g. the freedom) of another legal entity begins. When I don't want "AI" (not just the badly-behaved subset) rifling my website then I should be well within my rights to disallow just that, in the same way as it's your right to allow them access to your playground. It's not rocket science.
This is not what the parent means. What they mean is such behavior is a hypocrisy. Because you are getting access to truly free websites whose owners are interested in having smart chatbots trained on the free web, but you are blocking said chatbots while touting "free Internet" message.
Badly behaved scrapers are not a new problem, but badly behaved scrapers run by multibillion-dollar companies which use every possible trick to bypass every block or restriction or rate limit you put in front of them is a completely new problem on a scale we've never seen before.
You can always stop bots. Add login/password. But people want their content to be accessible to as large audience as possible, but at the same time they don't want that data to be accessible to the same audience via other channels. logic. Bots are not consuming your data - humans are. At the end of the day humans will eventually read it and take actions. For example chatgpt will mention your site, the user will visit it.
And no, nothing was different before 2022. Just look at google, the largest bot scraping network in the world. Since 1996.
> And no, nothing was different before 2022. Just look at google, the largest bot scraping network in the world. Since 1996.
I'm sorry, but this statement shows you have no recent experience with these crawlernets.
Google, from the beginning, has done their best to work with server owners. They respect robots.txt. I think they were the first to implement Crawl-Delay. They crawl based on how often things change anyway. They have an additional safeguard that when they notice a slowdown in your responses, they back off.
Compare this with Anthropic. On their website they say they follow robots.txt and Crawl-Delay. I have an explicit ban on Claudebot in there and a Crawl-Delay for everyone else. It ignores both. I send an email to them about this, and their answer didn't address the discrepancy between the docs and the behaviour. They just said they'll add me to their internal whitelist and that I should've sent 429s when they were going too fast. (Fuck off, how about you follow your public documentation?)
That's just my experience, but if you Google around you'll find that Anthropic is notorious for ignoring robots.txt.
And still, Claudebot is one of the better behaved bots. At least they identify themselves, have a support email they respond to, and use identifiable IP-addresses.
A few weeks ago I spend four days figuring out why I had 20x the traffic I normally have (which maxed out the server; causing user complaints). Turns out there are parties that crawl using millions of (residential) IPs and identify themselves as normal browsers. Only 1 or 2 connections per IP at the time. Randomization of identifying properties. Even Anthropics 429 solution wouldn't have worked there.
I managed to find a minor identifying property in some of the requests that wasn't catching too many real users. I used that to start firewalling IPs on sight and then their own randomization caused every IP to fall into the trap in the end. But it took days.
In the end I had to firewall nearly 3 million non-consecutive IP addresses.
So no, Google in 1996 or 2006 or 2016 is not the same as the modern DDoSing crawlernet.
> "If your grocery store is open to the public, why is it not open to this herd of rhinoceroses?"
What this scenario actually reveals is that the words "open to the public" are not intended to mean "access is completely unrestricted".
It's fine to not want to give completely unrestricted access to something. What's not fine, or at least what complicates things unnecessarily, is using words like "open and free" to describe this desired actually-we-do-want-to-impose-certain-unstated-restrictions contract.
I think people use words like "open and free" to describe the actually-restricted contracts they want to have because they're often among like-minded people for whom these unstated additional restrictions are tacitly understood -- or, simply because it sounds good. But for precise communication with a diverse audience, using this kind of language is at best confusing, at worst disingenuous.
Nobody has ever meant "access is completely unrestricted".
As a trivial example: what website is going to welcome DDoS attacks or hacking attempts with open arms? Is a website no longer "open to the public" if it has DDoS protection or a WAF? What if the DDoS makes the website unavailable to the vast majority of users: does blocking the DDoS make it more or less open?
Similarly, if a concert is "open to the public", does that mean they'll be totally fine with you bringing a megaphone and yelling through the performance? Will they be okay with you setting the stage on fire? Will they just stand there and say "aw shucks" if you start blocking other people from entering?
You can try to rules-lawyer your way around commonly-understood definitions, but deliberately and obtusely misinterpreting such phrasing isn't going to lead to any kind of productive discussion.
>You can try to rules-lawyer your way around commonly-understood definitions
Despite your assertions to the contrary, "actually free to use for any purpose" is a commonly understood interpretation of "free to use for any purpose" -- see permissive software licenses, where licensors famously don't get to say "But I didn't mean big companies get to use it for free too!"
The onus is on the person using a term like "free" or "open" to clarify the restrictions they actually intend, if any. Putting the onus anywhere else immediately opens the way for misunderstandings, accidental or otherwise.
To make your concert analogy actually fit: A scraper is like a company that sends 1000 robots with tape recorders to your "open to the public" concert. They do only the things an ordinary member of the public do; they can't do anything else. The most "damage" they can do is to keep humans who would enjoy the concert from being able to attend if there aren't enough seats; whatever additional costs they cause (air conditioning, let's say) are the same as the costs that would have been incurred by that many humans.
they're basically describing the tragedy of the commons, but if a handful of the people have bulldozers to rip up all the grass and trees.
We can't have nice things because the powerful cannot be held accountable. The powerful are powerful due to their legal teams and money, and power is the ability to carve exceptions to rules.
How so? If you don't want AI bots reading information on the web, you don't actually want a free and open web. The reality of an open web is that such information is free and available for anyone.
> If you don't want AI bots reading information on the web, you don't actually want a free and open web.
This is such a bad faith argument.
We want a town center for the whole community to enjoy! What, you don't like those people shooting up drugs over there? But they're enjoying it too, this is what you wanted right? They're not harming you by doing their drugs. Everyone is enjoying it!
If an AI bot is accessing my site the way that regular users are accessing my site -- in other words everyone is using the town center as intended -- what is the problem?
Seems to be a lot of conflating of badly coded (intentionally or not) scrapers and AI. That is a problem that predates AI's existence.
So if I buy a DDoS service and DDoS your site, it's ok as long as it accesses it the same way regular people do? In sorry for extreme example, it's obviously not, but that's how I understand your position as written.
We can also consider 10 exploit attempts per second that my site sees.
Set aside that there's a pretty big difference between AI scraping and illegal drug usage.
If the person using illegal drugs is on no way harming anyone but themselves and not being a nuisance, then yeah, I can get behind that. Put whatever you want in your body, just don't let it negatively impact anyone around you. Seems reasonable?
I think this is actually a good example despite how stark the differences are - both the nuisance AI scrapers and the drug addicts have negative externalities that while possible for them to self regulate, they are for whatever reasons proving unable to do so, and therefore cause other people to have a bad time.
Other commenters saying the usual “drugs are freedom” type opinions, but now having lived in China and Japan where drugs are dealt with very strictly (and basically don’t have a drug problem today), I can see the other side of the argument where in fact places feeling dirty and dangerous because of drugs - even if you think of addicts sympathetically as victims who need help - makes everyone else less free to live the lifestyle they would like to have.
More freedom for one group (whether to ruin their own lives for a high; or to train their AI models) can mean less freedom for others (whether to not feel safe walking in public streets; or to publish their little blog in the public internet).
You can want public water fountains without wanting a company attaching a hose to the base to siphon municipal water for corporate use, rendering them unusable for everyone else.
You can want free libraries without companies using their employees' library cards to systematically check out all the books at all times so they don't need to wait if they want to reference one.
Does allow bots to access my information prevent other people from accessing my information? No. If it did, you'd have a point and I would be against that. So many strange arguments are being made in this thread.
Ultimately it is the users of AI (and am I one of them) that benefit from that service. I put out a lot of open code and I hope that people are able to make use of it however they can. If that's through AI, go ahead.
> Does allow bots to access my information prevent other people from accessing my information? No.
Yes it does, that's the entire point.
The flood of AI bots is so bad that (mainly older) servers are literally being overloaded and (newer servers) have their hosting costs spike so high that it's unaffordable to keep the website alive.
I've had to pull websites offline because badly designed & ban-evading AI scraper bots would run up the bandwidth into the TENS OF TERABYTES, EACH. Downloading the same jpegs every 2-3 minutes into perpetuity. Evidently all that vibe coding isn't doing much good at Anthropic and Perplexity.
Even with my very cheap transfer racks up $50-$100/mo in additional costs. If I wanted to use any kind of fanciful "app" hosting it'd be thousands.
That's a problem with scrapers, not with AI. I'm not sure why there are way more AI scraper bots now than there were search scraper bots back when that was the new thing. However that's still an issue of scapers and rate limiting and nothing to do with wanting or not wanting AI to read your free and open content.
Do the AI training bots provide free access to the distillation of the content they drain from my site repeatedly? Don't they want a free and open web?
I don’t feel a particular need to subsidize multi–billion even trillion dollar corporations with my content, bandwidth, and server costs since their genius vibe coded bots apparently don’t know how to use modified-GETs or caching, let alone parse and respect robots.txt.
Is the problem they exist or the problem they are badly accessing your site? Because there are two conflating issues here. If humans or robots are causing you issues, as both can do, that's bad. But that has nothing to do with AI in particular.
Problem one is they do not honor the conventions of the web and abuse the sites.
Problem two is they are taking content for free, distilling it into a product, and limiting access to that product.
Problem one is not specific to AI and not even about AI.
Problem two is not anything new. Taking freely available content and distilling it into a product is something valuable and potentially worth paying for. People used to buy encyclopedias too. There are countless examples.
And that problem was largely solved by robots.txt. AI scrapers are ignoring robots.txt and beating the hell out of sites. Small sites that have decades worth of quality information are suffering the most. Many of the scrapers are taking extreme measures to avoid being blocked, like using large numbers of distinct IP addresses (perhaps using botnets).
HN people working in these AI companies have commented to say they do this, and the timing correlates with the rise of AI companies/funding.
I haven't tried to find it in my own logs, but others have said blocking an identifiable AI bot soon led to the same pattern of requests continuing through a botnet.
The problem is not AI bot scraping, per se, but "AI bot scraping while disregarding all licenses and ethical considerations".
Freedom, the word, while implies no boundaries, is always bound by ethics, mutual respect and "do no harm" principle. The moment you trip either one of these wires and break them, the mechanisms to counter it becomes active.
Then we cry "but, freedom?!". Freedom also contains the consequences of one's actions.
Freedom without consequences is tyranny of the powerful.
The problem isn't "AI bot scraping while disregarding all licenses and ethical considerations". The problem is "AI bot scraping while ignoring every good practice to reduce bandwidth usage".
> The problem is not AI bot scraping, per se, but "AI bot scraping while disregarding all licenses and ethical considerations".
What licenses? Free and open web. Go crazy. What ethical considerations? Do I police how users use the information on my site? No. If they make a pipe bomb using an 6502 CPU using code taken from my website -- am I supposed to do something about that?
Absolutely. If you want to put all kinds of copyright, license, and even payment restrictions on your content go ahead. And if AI companies or people abuse that, that's bad on them.
But I do think if you're serious about free and open information than why are you doing that in the first place? It's perfectly reasonable to be restrictive; I write both very open software and very closed software. But I see a lot of people want to straddle the line when it comes to AI without a rational argument.
Let me try to make my point as compact as possible. I may fail, but please bear with me.
I prefer Free Software to Open Source software. My license of choice is A/GPLv3+. Because, I don't want my work to be used by people/entities in a single sided way. The software I put out is the software I develop for myself, with the hope of being useful for somebody else. My digital garden is the same. My blog is a personal diary in the open. These are built on my free time, for myself, and shared.
See, permissive licenses are for "developer freedom". You can do whatever you do with what you can grab, as long as you write a line to credits. A/GPL family is different. Wants reciprocity. It empowers the user vs. the developer. You have to give the source. Who modifies the source, shares the modifications. It stays in the open. It has to stay open.
I demand this reciprocity for what I put out there. The licenses reflect that. It's "restricting the use to keep the information/code open". I share something I spent my time on, and I want it to live on the open, want a little respect for putting out what I did. That respect is not fame or superiority. Just not take it and run with it, keeping all the improvements to yourself.
It's not yours, but ours. You can't keep it to yourself.
When it comes to AI, it's an extension of this thinking. I do not give consent to a faceless corporation to close, twist and earn money from what I put out for public good. I don't want a set of corporations act as a middleman to get what I put out, repackage and corrupt it in the process and sell it. It's not about money; it's about ethics, doing the right thing and being respectful. It's about exploitation. Same is applicable to my photos.
I'm not against AI/LLM/Generative technology/etc. I'm against exploitation of people, artists, musicians, software developers, other companies. I equally get angry when a company's source available code is scraped and used for suggestions as well as an academic's LGPL high performance matrix library which is developed via grants over the years. This thing affect livelihoods of people.
I get angry when people say "if we take permission for what we do, AI industry will collapse", or "this thing just learns like humans, this is fair use".
I don't buy their "we're doing something awesome, we need no permission" attitude. No, you need permission to use my content. Because I say so. Read the fine print.
I don't want knowledge to be monopolized by these corporations. I don't want the small fish to be eaten by the bigger one and what remains is buried into the depths of information ocean.
This is why I stopped sharing my photos for now, and my latest research won't be open source for quite some time.
What I put out is for humans' direct consumption. Middlemen are not welcome.
If you have any questions or left any holes up there, please let me know.
I respect the desire for reciprocity, but strong copyleft isn't the only, or even the best, way to protect user freedom or public knowledge. My opinion is that permissive licensing and open access to learn from public materials have created enormous value precisely because they don't pre-empt future uses. Requiring permission for every new kind of reuse (including ML training) shrinks the commons, entrenches incumbents who already have data deals, and reduces the impact of your work. The answer to exploitation is transparency, attribution, and guardrails against republication, not copyright enforced restrictions.
I used to be much more into the GPL than I am now. Perhaps it was much more necessary decades ago or perhaps our fears were misguided. I license all my own stuff as Apache. If companies want to use it, great. It doesn't diminish what I've done. But those who prefer GPL, I completely understand.
> as well as an academic's LGPL high performance matrix library which is developed via grants over the years.
The academic got paid with grants. So now this high performance library exists in the world, paid for by taxes, but it can't be used everywhere. Why is it bad to share this with everyone for any purpose?
> What I put out is for humans' direct consumption. Middlemen are not welcome.
Why? Why must it be direct consumption? I've use AI tools to accomplish things that I wouldn't be able to do on my own in my free time -- work that is now open source. Tons of developers this week are benefiting from what I was able to accomplish using a middle man. Not all middlemen, by definition, are bad. Middlemen can provide value. Why is that value not welcome?
> I'm not against AI/LLM/Generative technology/etc. I'm against exploitation of people, artists, musicians, software developers, other companies.
If you define AI/LLM/Generative technology/etc as the exploitation of exploitation of people, artists, musicians, software developers, other companies then you are against it. As software developers our work directly affects the livelihoods of people. Everything we create is meant to automate some human task. To be a software developer and then complain that AI is going to take away jobs is to be a hypocrite.
Your whole argument is easily addressed by requiring the AI models to be open source. That way, they obviously respect the AGPL and any other open license, and contribute to the information being kept free. Letting these companies knowingly and obviously infringe licenses and all copyright as they do today is obviously immoral, and illegal.
AGPL doesn't pre-empt future uses or require permission for any kind of re-use. You just have to share alike. It's pretty simple.
AGPL lets you take a bunch of data and AI-train on it. You just have to release the data and source code to anyone who uses the model. Pretty simple. You don't have to rent them a bunch of GPUs.
Actually it can be annoying because of the specific mechanism by which you have to share alike - the program has to have a link to its own source code - you can't just offer the source alongside the binary. But it's doable.
Is that really the problem we are discussing? I've had people attack my server and bring it down. But that has nothing to do with being free and open to everyone. A top hacker news post could take my server.
Yes, because a top hacker news post takes your server down because a large number of actual humans are looking to gain actual value from your posts. Meanwhile, you stand to benefit from the HN discussion by learning new things and perspectives from the community.
The AI bot assault, on the other hand, is one company (or a few companies) re-fetching the same data over and over again, constantly, in perpetuity, just in case it's changed, all so they can incorporate it into their training set and make money off of it while giving you zero credit and providing zero feedback.
The refrain here comes down not to "AI" but mostly to "the AI bot assault" which is a different thing. Sure lets have an discussion about badly behaved and overzealous web scrapers. As for credit, I've asked AI for it's references and gotten them. If my information is merely mushed into AI training model I'm not sure why I need credit. If you discuss this thread with your friends are you going to give me credit?
You realize this entire thread is about a pitch from a CDN company trying to solve an issue that has presented itself at such a scale that this is the best option they can think of to keep the web alive, right?
"Use a CDN" is not sufficient when these bots are so incredibly poorly behaved, because you're still paying for that CDN and this bad behavior is going to cost you a fortune in CDN costs (or cost the CDN a fortune instead, which is why Cloudflare is suggesting this).
Ultimately, you have to realize that this is a losing battle, unless we have completely draconian control over every piece of silicon. Captchas are being defeated; at this point they're basically just mechanisms to prove you Really Want to Make That Request to the extent that you'll spend some compute time on it, which is starting to become a bit of a waste of electricity and carbon.
Talented people that want to scrape or bot things are going to find ways to make that look human. If that comes in the form of tricking a physical iPhone by automatically driving the screen physically, so be it; many such cases already!
The techniques you need for preventing DDoS don't need to really differentiate that much between bots and people unless you're being distinctly targeted; Fail2Ban-style IP bans are still quite effective, and basic WAF functionality does a lot.
Don't you have rate-limits? And how much are you paying for the instance where you're hosting it? I've run/helped run projects with something like ~10 req/s easily on $10 VPSs, surely hosting HTML can't cost you that much?
Of course it won't be free, but you can get pretty close to free but employing the typical things you'd put in place to restrict the amount of resources used, like rate-limits, caches and so on.
And? Paying Cloudflare or someone else to block bad actors is required these days unless you have the scale and expertise to do it yourself.
Why is outsourcing this to Cloudflare bad and doing it yourself ok? Am I allowed to buy a license to a rate limiter or do I need to code my own? Am I allowed to use a firewall or is blocking people from probing my server not free enough?
Why are bots or any other user entitled to unlimited visits to my website? The entitlement is kind of unreal at this point
> And? Paying Cloudflare or someone else to block bad actors is required these days unless you have the scale and expertise to do it yourself.
Where are people getting this from? No, Cloudflare or any other CDN is not required for you to host your own stuff. Sure, it's easy, and probably the best way to go if you just wanna focus on shipping, but lets not pretend it's a requirement today.
> Why are bots or any other user entitled to unlimited visits to my website? The entitlement is kind of unreal at this point
I don't think they are, that's why we have rate limiters, right? :) I think the point is that if you're allowing a user to access some content in one way, why not allow that same user to access the content in the same way, but using a different user-agent? That's the original purpose of that header after all, to signal what the user used as an agent on their behalf. Commonly, I use Firefox as my agent for browsing, but I should be free to use any user-agent, if we want the web to remain open and free.
GPL doesn't care if you use it for profit or not (good), it just says that the resultant model needs to be open too. And open models exist in droves nowadays. Even closed models can be distilled into open ones.
The dream is real, man. If you want open content on the Internet, it's never been a better time. My blog is open to all - machine or man. And it's hosted on my home server next to me. I don't see why anyone would bother trying to distinguish humans from AI. A human hitting your website too much is no different from an AI hitting your website too much.
I have a robots.txt that tries to help bots not get stuck in loops, but if they want to, they're welcome to. Let the web be open. Slurp up my stuff if you want to.
Amazonbot seems to love visiting my site, and it is always welcome.
> I don't see why anyone would bother trying to distinguish humans from AI.
Because a hundred thousand people reading a blog post is more beneficial to the world than an AI scraper bot fetching my (unchanged) blog post a hundred thousand times just in case it's changed in the last hour.
If AI bots were well-behaved, maintained a consistent user agent, used consistent IP subnets, and respected robots.txt, I wouldn't have a problem with them. You could manage your content filtering however you want (or not at all) and that would be that. Unfortunately at the moment, AI bots do everything they can to bypass any restrictions or blocks or rate limits you put on them; they behave as though they're completely entitled to overload your servers in their quest to train their AI bots so they can make billions of dollars on the new AI craze while giving nothing back to the people whose content they're misappropriating.
I've not seen an AI scraper reading a blog post 100,000 times in an hour to see if it's changed. As far as I can tell, that's a NI hallucination. Typical fetch rates are more like 3 times per second (10k per hour) and fetch a different URL each time.
The only bot that bugs the crap out of me is Anthropic's one. They're the reason I set up a labyrinth using iocaine (https://iocaine.madhouse-project.org/). Their bot was absurdly aggressive, particularly with retries.
It's probably trivial in the whole scheme of things, but I love that anthropic spent months making about 10rps against my stupid blog, getting markov chain responses generated from the text of Moby Dick. (looks like they haven't crawled my site for about a fortnight now)
No wonder Anthropic isn't working well! The "Moby Dicked" explanation of the state of AI!
But seriously, Why must someone search even a significant part of the public Internet to develop an AI? Is it believed that missing some text will cripple the AI?
Isn't there some sort of "law of diminishing returns" where, once some percentage of coverage is reached, further scraping is not cost-effective?
On the contrary, AI training techniques require gigantic amounts of data to do anything, and there is no upper limit whatsoever - the more relevant data you have to train on, the better your model will be, period.
In fact, the biggest thing that is making it unlikely that LLM scaling will continue is that the current LLMs have already been trained on virtually every piece of human text we have access to today. So, without new training data (in large amounts), the only way they'll scale more is by new discoveries on how to train more efficiently - but there is no way to put a predictable timeline on that.
Yes, I obviously agree with you. My comment's point is missed a little I think by you. CF is making these tools and giving access to it to millions of people.
Well there's open source stuff like https://github.com/TecharoHQ/anubis; one doesn't need a top-down mandated solution coming from a corporation.
In general Cloudflare has been pushing DRMization of the web for quite some time, and while I understand why they want to do it, I wish they didn't always show off as taking the moral high ground.
The actual response to which Anubis was created is seemingly a strange kind of DDOS attack that has been misattributed to LLMs, but is some kind of attacker that makes partial GET requests that are aborted soon after sending the request headers, mostly coming from residential proxies. (Yes, it doesn’t help that the author of Anubis also isn’t fully aware of the mechanics of the attack. In fact, there is no proper write up of the mechanism of the attack which I hope to write about someday).
Having said that, the solution is effective enough, having a lightweight proxy component that issues proof of work tokens to such bogus requests works well enough, as various users on HN seem to point out.
> a strange kind of DDOS attack that has been misattributed to LLMs, , but is some kind of attacker that makes partial GET requests that are aborted soon after sending the request headers, mostly coming from residential proxies.
um, no? Where did you get this strange bit of info.
(I've seen more posts with the analysis, including one which showed an AI crawler which would identify properly, but once it hits the ratelimit, would switch to fake user agent from proxies.. but I cannot find it now)
I self-host lots of stuff. But yes it is more pain to host a WAF that can handle billions of request per minute. Even harder to do it for free like Cloudflare. And in the end the end result for the user is exactly the same if you use a self-hosted WAF or let someone else host it for you.
If you're handling billions of requests per second, you're not a self hoster. That's a commercial service with a dedicated team to handle traffic around the clock. Most ISPs probably don't even operate lines that big
To put that in perspective, even if they're sending empty TCP packets, "several billion" pps is 200 to 1800 gigabits of traffic, depending on what you mean by that. Add a cookieless HTTP payload and you're at many terabits per second. The average self hoster is more likely to get struck by lightning than encounter and need protection from this (even without considering the, probably modest, consequences of being offline a few hours if it does happen)
Edit: off by a factor of 60, whoops. Thanks to u/Gud for pointing that out. I stand by the conclusion though: less likely to occur than getting struck by lightning (or maybe it's around equally likely now? But somewhere in that ballpark) and the consequences of being down for a few hours are generally not catastrophic anyway. You can always still put big brother in front if this event does happen to you and your ISP can't quickly drop the abusive traffic
If somebody decides they hate you, your site that could handle, say, 100,000 legitimate requests per day could suddenly get billions of illegitimate requests.
I have this argument every time self hosting comes up, and every time I wonder if someone will do it to me to make a point. Or if one of the like million other comments I post upsets someone or one of the many tools that I host. Yet to happen, idk. It's like arguing whether you need a knife on the street at all times because someone might get angry from a look. It happens, we have a word for it in NL (zinloos geweld) and tiles in sidewalks (lady bug depictions) and everything, but no normal person actually wears weapons 24/7 (drug dealers surely yeah) or has people talk through a middle person
I'd suspect other self hosters just see more shit than I do, were it not for that nobody ever says it happened to them. The only argument I ever hear is that they want to be "safe" while "self hosting with cloudflare". Who's really hosting your shit then?
Not everybody wants to manage some commercial grade packet filter that can handle some DDoSing script kiddie, it’s a strong argument.
But another argument against using the easiest choice, the near monopoly, is that we need a diverse, thriving ecosystem.
We don’t want to end up in a situation where suddenly Cloudflare gets to dictate what is allowed on the web.
We have already lost email to the tech giants, try running your own mail sometime.
The technical aspect is easy, the problem is you will end up in so many spam folders it’s disgusting.
Please do try running your own mail some time. It's not nearly as hard as doomers would have you think. And if you only receive, you don't have any problems at all.
At first, you can use it for less serious stuff until you see how much it works.
But you don't get billions of requests per minute. You get maybe five requests per second (300 per minute) on a bad day. The sites that seem to be getting badly attacked, they get 200 per second, which is still within reach of a self hosted firewall. Think about how many CPU cycles per packet that allows for. Hardly a real DDoS.
The only reason you even want to firewall 200 requests per second is that the code downstream of the firewall takes more than 5ms to service a request, so you could also consider improving that. And if you're only getting <5 and your server isn't overloaded then why block anything at all?
How does an agent help my website not get crushed by traffic load, and how is this proposal any different from the gatekeeping problem to the open web, except even less transparent and accountable because now access is gated by logic inside an impenetrable web of NN weights?
This seems like slogan-based planning with no actual thought put into it.
It's not a question of languages or frameworks, but hardware. I cannot finance servers large enough to keep up with AI bots constantly scrapping my host, bypassing cache indications, or changing IP to avoid bans.
I have had to disable at least one service because AI bots kept hitting it and it started impacting other stuff I was running that I am more interested in. Part of it was the CPU load on the database rendering dozens of 404s per second (which still required a database call), part of it was that the thumbnail images were being queried over and over again with seemingly different parameters for no reason.
I'm sure there are AI bots that are good and respect the websites they operate on. Most of them don't seem to, and I don't care enough about the AI bubble to support them.
When AI companies stop people from using them as cheap scrapers, I'll rethink my position. So far, there's no way to distinguish any good AI bot from a bad one.
So by a free and open for all web you mean only for the tech priests competent enough to build the skills and maintain them in light of changes to the spec(hope these people didn’t run across xml/xslt dependent techniques building their site), or have a rich enough family that you can casually learn a skill while not worry about putting food on the table?
There’s going to be bad actors taking advantage of people who cannot fight back without regulations and gatekeepers, suggesting otherwise is about as reasonable as ancaps idea of government
We have thousands of engineers of these companies right here on hackernews and they cry and scream about privacy and data governance on every topic but their own work. If you guys need a mirror to do some self reflection I am offering to buy.
In the recent days, the biggest delu-lulz was delivered by that guy who'd bravely decided to boycott Grok out of... environmental concerns, apparently. It's curious how everybody is so anxious these days, about AI among other things in our little corner of the web. I swear, every other day it's some new big fight against something... bad. Surely it couldn't ALL be attributed to policy in the US!
I don't know about this. This means I'd get sued for using a feed reader on Codeberg[1], or for mirroring repositories from there (e.g. with Forgejo), since both are automated actions that are not caused directly by a user interaction (i.e. bots, rather than user agents).
To be more specific, if we assume good faith upon our fine congresspeople to craft this well... ok yeah, well for hypothetical case I'll continue...
What legal teeth I would advocate would be targeted to crawlers (a subset of bot) and not include your usage. It would mandate that Big Corp crawlers (for search indexing, AI data harvesting, etc.) be registered and identify themselves in their requests. This would allow serverside tools to efficiently reject them. Failure to comply would result in fines large enough to change behavior.
Now that I write that out, if such a thing were to come to pass, and it was well received, I do worry that congress would foam at the mouth to expand it to bots more generally, Microsoft-Uncertified-Devices, etc.
Yeah, my main worry here is how we define the unwanted traffic, and how that definition could be twisted by bigcorp lawyers.
If it's too loose and similar to "wanted traffic is how the authors intend the website to be accessed, unwanted traffic is anything else", that's an argument that can be used against adblocks, or in favor of very specific devices like you mention. Might even give slightly more teeth to currently-unenforceable TOS.
If it's too strict, it's probably easier to find loopholes and technicalities that just lets them say "technically it doesn't match the definition of unwanted traffic".
Even if it's something balanced, I bet bigcorp lawyers will find a way to twist the definitions in their favor and set a precedent that's convenient for them.
I know this is a mini-rant rather than a helpful comment that tries to come up with a solution, it's just that I'm pessimistic because it seems the internet becomes a bit worse day by day no matter what we try to do :c
You don't get sued for using a service as it is meant to be used (using an RSS reader on their feed endpoint; cloning repositories that it is their mission to host). It doesn't anger anyone so they wouldn't bother trying to enforce a rule, and secondly it's a fruitless case because the judge would say it's not a reasonable claim they're making
Robots.txt is meant for crawlers, not user agents such as a feed reader or git client
I agree with you, generally you can expect good faith to be returned with good faith (but here I want to make heavy emphasis that I only agree on the judge part iff good faith can be assumed and the judge is informed enough to actually be able to make an informed decision).
But not everyone thinks that's the purpose of robots.txt. Example, quoting Wikipedia[1] (emphasis mine):
> indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.
Quoting the linked `web robots` page[2]:
> An Internet bot, web robot, robot, or simply bot, is a software application that runs automated tasks (scripts) on the Internet, usually with the intent to imitate human activity, such as messaging, on a large scale. [...] The most extensive use of bots is for web crawling, [...]
("usually" implying that's not always the case; "most extensive use" implying it's not the only use.)
Also a quick HN search for "automated robots.txt"[3] shows that a few people disagree that it's only for crawlers. It seems to be only a minority, but the search results are obviously biased towards HN users, so it could be different outside HN.
Besides all this, there's also the question of whether web scraping (not crawling) should also be subject to robots.txt or not; where "web scraping" includes any project like "this site has useful info but it's so unusable that I made a script so I can search it from my terminal, and I cache the results locally to avoid unnecessary requests".
The behavior of alternative viewers like Nitter could also be considered web scraping if they don't get their info from an API[4], and I don't know if I'd consider Nitter the bad actor here.
But yeah, like I said I agree with your comment and your interpretation, but it's not the only interpretation of what robots.txt is meant for.
[4]: I don't know how Nitter actually works or where does it get its data from, I just mention it so it's easier to explain what I mean by "alternative viewer".
But it's the same thing with random software from a random nobody that has no license, or has a license that's not open-source: If I use those libraries or programs, do I think they would sue me? Probably not.
I've run a few hundred small domains for various online stores with an older backend that didn't scale very well for crawlers and at some point we started blocking by continent.
What we need is stop fighting robots and start welcoming and helping them. I se zero reasons to oppose robots visiting any website I would build. The only purpose I ever tried disallowed robots for was preventing search engines from indexing incomplete versions or going the paths which really make no sense for them to go. Now I think we should write separate instructions for different kinds of robots: a search engine indexer shouldn't open pages which have serious side-effects (e.g. place an order) or display semi-realtime technical details but an LLM agent may be on a legitimate mission involving this.
Let my site go down and then restart my server a few hours later. I'm a dude with a blog I'm not making uptime guarantees. I think you're overestimating the harm and how often this happens.
Misbehaving scrapers have been a problem for years not just from AI. I've written posts on how to properly handle scraping and the legal grey area it puts you in and how to be a responsible one. If companies don't want to be responsible the solution isn't abdicate an open web. It's make better law and enforcement of said law.
> What we need is stop fighting robots and start welcoming and helping them. I se zero reasons to oppose robots visiting any website I would build.
Well, I'm glad you speak for the entire Internet.
Pack it in folks, we've solved the problem. Tomorrow, I'll give us the solution to wealth inequality (just stop fighting efforts to redistribute wealth and political power away from billionaires hoarding it), and next week, we'll finally get to resolve the old question of software patents.
I have the feeling that it's the small players that cause problems.
Dumb bots that don't respect robot.txt or nofollow are the ones trying all combinations of the filters available in your search options and requesting all pages for each such combination.
The number of search pages can easily be exponential in the number of filters you offer.
Bots walking around in these traps, do it because they are dumb.
But even a small degenerate bot can send more requests than 1M MAUs.
At least that's my impression of the problem we're sometimes facing.
Signed agents seems like a horrific solution. And many serving the traffic is just better.
I like the idea of web as a blockchain of content. If you want to pull some data, you have to pay for it with some kind of token. You either buy that token to consume information if you're of the leecher type, or get some by doing contributions that gain back tokens.
It's more or less the same concept as torrents back in the day.
This should be applied to emails too. The regular person send what, 20 emails per day max ? Say it costs $0.01 per mail, anyone could pay that. But if you want to spam 1,000,000 everyday that becomes prohibitive.
I recently found out my website has been blocked by AI agents, when I had never asked for it. It seems to be opt-out by default, but in an obscure way. Very frustrating. I think some of these companies (one in particular) are risking burning a lot of goodwill, although I think they have been on that path for a while now.
You can lock it up with a user account and payment system. The fact the site is up on the internet doesn’t mean you can or cannot profit from it. It’s up to you. What I would like it’s a way to notify my isp and say, block this traffic to my site.
> What I would like it’s a way to notify my isp and say, block this traffic to my site.
I would love that, and make it automated.
A single message from your IP to your router: block this traffic. That router sends it upstream, and it also blocks it. Repeat ad nauseum until source changes ASN or (if the originator is on the same ASN) reaches the router from the originator, routing table space notwithstanding. Maybe it expires after some auto-expiry -- a day or month or however long your IP lease exists. Plus, of course, a way to query what blocks I've requested and a way to unblock.
You can't trust everyone will be polite or follow "standards".
However, you can incentivize good behavior. Let's say there's a scraping agent, you could make a x402 compatible endpoint and offer them a discount or something.
Kinda like piracy; if you offer a good, simple, cheap service people will pay for it versus go through the hassle of pirating.
They normally use a puzzle that the website generates, or the use a proof of work based capcha. I've found proof of work good enough out of these two, and it also means that the site owner can run it themselves instead of being reliant on cloudflare and third parties.
> But the reality is how can someone small protect their blog or content from AI training bots?
A paywall.
In reality, what some want is to get all the benefits of having their content on the open internet while still controlling who gets to access it. That is the root cause here.
This. We need to get rid of the ad-supported free internet economy. If you want your content to be free, you release it and have no issues with AI. If you want to make money of your content, add a paywall.
We need micropayments going forward, Lightning (Bitcoin backend) could be the solution.
> Everyone loves the dream of a free for all and open web.
> protect their blog or content from AI training bots
It strikes me that one needs to chose one of these as their visionary future.
Specifically: a free and open web is one where read access is unfettered to humans and AI training bots alike.
So much of the friction and malfunction of the web stems from efforts to exert control over the flow (and reuse) of information. But this is in conflict with the strengths of a free and open web, chief of which is the stone cold reality that bytes can trivially be copied and distributed permissionlessly for all time.
It's the new "ban cassette tapes to prevent people from listening to unauthorized music," but wrapped in an anti-corporate skin delivered by a massive, powerful corporation that could sell themselves to Microsoft tomorrow.
The AI crawlers are going to get smarter at crawling, and they'll have crawled and cached everything anyway; they'll just be reading your new stuff. They should literally just buy the Internet Archive jointly, and only read everything once a week or so. But people (to protect their precious ideas) will then just try to figure out how to block the IA.
One thing I wish people would stop doing is conflating their precious ideas and their bandwidth. The bandwidth is one very serious issue, because it's a denial of service attack. But it can be easily solved. Your precious ideas? Those have to be protected by a court. And I don't actually care iff the copyright violation can go both ways; wealthy people seem to be free to steal from the poor at will, even rewarded, "normal" (upper-middle class) people can't even afford to challenge obviously fraudulent copyright claims, and the penalties are comically absurd and the direct result of corruption.
Maybe having pay-to-play justice systems that punish the accused before conviction with no compensation was a bad idea? Even if it helped you to feel safe from black people? Maybe copyright is dumb now that there aren't any printers anymore, just rent-seekers hiding bitfields?
It's not the publishers who need to do the hard work, it's the multi-billion dollar investments into training these systems that need to do the hard work.
We are moving to a position whereby if you or I want to download something without compensating the publisher, that's jail time, but if it's Zuck, Bezos or Musk, they get a free pass.
That's the system that needs to change.
I should not have to defend my blog from these businesses. They should be figuring out how to pay me for the value my content adds to their business model. And if they don't want to do that, then they shouldn't get to operate that model, in the same way I don't get to build a whole set of technologies on papers published by Springer Nature without paying them.
This power imbalance is going to be temporary. These trillion-dollar market cap companies think if they just speed run it, they'll become too big, too essential, the law will bend to their fiefdom. But in the long term, it won't - history tells us that concentration of power into monarchies descends over time, and the results aren't pretty. I'm not sure I'll see the guillotine scaffolds going up in Silicon Valley or Seattle in my lifetime, but they'll go up one day unless these companies get a clue from history as to what they need to do.
It is a service available to Cloudflare customers and is opt-in. I fail to see how they’re being gatekeepers when site owners have option not to use it.
I care more about the dream of a wide open free web than a small time blogger’s fears of their content being trained on by an AI that might only ever emit text inspired by their content a handful of times in their life.
"Okay, that means AI companies can train on your content."
"Well, actually, we need some protections..."
"So you want a closed web with access controls?"
"No no no, I support openness! Can't we just have, like, ethical openness? Where everyone respects boundaries but there's no enforcement mechanism? Why are you making this so black and white?"
> “When we started the “free speech movement,” we had a bold new vision. No longer would dissenters’ views be silenced. With the government out of the business of policing the content of speech, robust debate and the marketplace of ideas would lead us toward truth and enlightenment. But it turned out that freedom of the press meant freedom for those who owned one. The wealthy and powerful dominated the channels of speech. The privileged had a megaphone and used free speech protections to immunize their own complacent or even hateful speech. Clearly, the time has come to denounce the naïve idealism of the past and offer a new movement, Speech 2.0, which will pay more attention to the political economy of media and aim at “free-ish” speech — the good stuff without the bad.”
> Everyone loves the dream of a free for all and open web. But the reality is how can someone small protect their blog or content from AI training bots?
I'm old enough to remember when people asked the same questions of Hotbot, Lycos, Altavista, Ask Jeeves, and -- eventually -- Google.
Then, as now, it never felt like the right way to frame the question. If you want your content freely available, make it freely available... including to the bots. If you want your content restricted, make it restricted... including to the humans.
It's also not clear to me that AI materially changes the equation, since Google has for many years tried to cut out links to the small sites anyway in favor of instant answers.
(FWIW, the big companies typically do honor robots.txt. It's everyone else that does what they please.)
What if I want my content freely available to humans, and not to bots? Why is that such an insane, unworkable ask? All I want is a copyleft protection that specifically allows humans to access my work to their heart's content, but disallows AI use of it in any form. Is that truly so unreasonable?
Yes, it is an unreasonable and absurd ask. You cannot want freedom while restricting it. You forget that it is people that use AI agents, essentially, being cyborgs. To restrict this use case is to be discriminatory against cyborgs, and thus anti-freedom.
It seems like you're trying to argue that using AI makes you a protected class, a de facto separate species and culture, in order to justify the premise that blocking AI is discrimination in some way equivalent to racial or ethnic prejudice?
If so, no. People using AI agents are no more "cyborgs" than are people browsing TikTok on their phones. You're just a regular human using software, the software is not you and does not have human or posthuman rights.
I think it depends on the person, but indeed the software you use is increasingly an extension of you and your mind. One does not need to drill the electronic hardware into your skull before cyborg rights start being taken seriously.
> What if I want my content freely available to humans, and not to bots? Why is that such an insane, unworkable ask?
Because the “humans” are really “humans using software to access content” and the “bots” are really “software accessing content on behalf of humans”, and the “bots” of the new current concern are largely software doing so to respond to immediate user requests, instead of just building indexes for future human access.
It's not unreasonable to ask but I think it probably is unreasonable to expect a strictly technical solution. It feels like we're in the realm of politics, policy, and law.
I don't know which companies, of course. They hide their identity by using a botnet.
This traffic is new, and started around when many AI startups started.
I see traffic from new search engines and other crawlers, but it generally respects robots.txt and identifies itself, or else comes from a small pool of IP addresses.
If others respected robots.txt, we would not need solutions like what Cloudflare is presenting here. Since abuse is rampant, people are looking for mitigations and this CF offering is an interesting one to consider.
how about we discuss and design and implement a system that charges them for their actions? we could put some dark patterns in our sites that specifically have this cost through some sort of problem solving thing in the site that harvests their energetic scraping/LLM tools into directing their energy onto causes that give us profit on our site, in exchange for revealing some content in return that achieves their mission of scraping too. Looks like these exist to degrees.
It's amazing how this catchphrase has reversed meanings for some people. It was previously used against walled gardens and paywalls, but these corporate LLMs are the ultimate walled garden for information because in most cases you can't even find out who created the information in the first place.
"Information wants to be free! That's why I support hiding it behind a chatbot paywall that makes a few people billionaires"
> But the reality is how can someone small protect their blog or content from AI training bots?
First off, there's no harm from well-behaved bots. Badly behaved bots that cause problems for the server are easily detected (by the problems they cause), classified, and blocked or heavily throttled.
Of course, if you mean "protect" in the sense of "keep AI companies from getting a copy" (which you may have, given that you mentioned training) - you simply can't, unless you consider "don't put it on the web" a solution.
It's impossible to make something "public, but not like that". Either you publish or you don't.
If anything, it's a legal issue (copyright/fair use), not a technical one. Technical solutions won't work.
I'm not sure why people are so confused by this. The Mastodon/AP userbase put their public content on a publicly federated protocol then lost their shit and sent me death threats when I spidered and indexed it for network-wide search.
There are upsides and downsides to publishing things you create. One of the downsides is that it will be public and accessible to everyone.
I personally love the idea of a free and open internet and also have no issues with bots scraping or training off of my data.
I would much rather have it open for all, including companies, than the coming dystopian landscape of paywall gates. I don’t care about respecting robots.txt or any other types of rules. If it’s on the internet it’s for all to consume. The moment you start carving out certain parties is the moment it becomes a slippery slope.
I have zero issue with Ai Agents, if there's a real user behind there somewhere. I DO have a major issue with my sites being crawled extremely aggressively by offenders including Meta, Perplexity and OpenAI - it's really annoying realising that we're tying up several cpu cores on AI crawling. Less than on real users and google et al.
I've some personal apps online and I had to turn the cloudflare ai bot protection on because one of them got 1.6TB of data accessed by the bots in the last month, 1.3 million requests per day, just non stop hammering it with no limits.
You'd think they would have an interest in developing reasonable crawling infrastructure, like Google, Bing or Yandex. Instead they go all in on hosts with no metering. All of the search majors reduce their crawl rate as request times increase.
On one hand these companies announce themselves as sophisticated, futuristic and highly-valued, on the other hand we see rampant incompetence, to the point that webmasters everywhere are debating the best course of action.
I suspect it's because they're dealing with such unbelievable levels of bandwidth and compute for training and inference that the amount required to blast the entire web like this barely registers to them.
Honestly it's just tragedy of the commons. Why put the effort in when you don't have to identify yourself, just crawl and if you get blocked move the job to another server.
At this point I'm blocking several ASNs. Most are cloud provider related, but there are also some repurposed consumer ASNs coming out of the PRC. Long term, this devalues the offerings of those cloud providers, as prospective customers will not be able to use them for crawling.
I'm seeing around the same, as a fairly constant base load. Even more annoying when it's hitting auth middleware constantly, over and over again somehow expecting a different answer.
I wonder how many CPU cycles are spent because of AI companies scraping content. This factor isn't usually considered when estimating “environmental impact of AI.” What’s the overhead of this on top of inference and training?
To be fair, an accurate measurement would need to consider how many of those CPU cycles would be spent by the human user who is driving the bot. From that perspective, maybe the scrapers can “make up for it” by crawling efficiently, i.e. avoid loading tracker scripts, images, etc unless necessary to solve the query. This way they’ll still burn CPU cycles but at least it’ll be less cycles than a human user with a headful browser instance.
Same with me. If there is a real user behind the use of the AI agents and they do not make excessive accesses in order to do what they are trying to do, then I do not have a complaint (the use of AI agents is not something I intend, but that is up to whoever is using them and not up to me). I do not like the excessive crawling.
However, what is more important to me than AI agents, is that someone might want to download single files with curl, or use browsers such as Lynx, etc, and this should work.
Cloudflare is trying to gatekeep which user-initated agents are allowed to read website content, which is of course very different from scraping website for training data. Meta, Perplexity and OpenAI all have some kind of web-search functionality where they sent requests based on user prompts. These are not requests that get saved to train the next LLM. Cloudflare intentionally blurs the line between both types of bots, and in that sense it is a bait-and-switch where they claim to 'protect content creators' by being the man in the middle and collecting tolls from LLM providers to pay creators (and of course take a cut for themselves). Its not something they do because it would be fair, theres financial motivation.
> Cloudflare is trying to gatekeep which user-initated agents are allowed to read website content, which is of course very different from scraping website for training data.
That distinction requires you to take companies which benefit from amassing as much training data as possible at their word when they pinky swear that a particular request is totally not for training, promise.
If you look at the current LLM landscape, the frontier is not being pushed by labs throwing more data at their models - most improvements come from using more compute and improving training methods. In that sense I dont have to take their word, more data just hasnt been the problem for a long time.
Just today Anthropic announced that they will begin using their users data for training by default - they still want fresh data so badly that they risked alienating their own paying customers to get some more. They're at the stage of pulling the copper out of the walls to feed their crippling data addiction.
I use uncommon web browsers that don't leak a lot of information. To Cloudflare, I am indistingushable from a bot.
Privacy cannot exist in an environment where the host gets to decide who access the web page. I'm okay with rate limiting or otherwise blocking activity that creates too much of a load, but trying to prevent automated access is impossible withou preventing access from real people.
And god forbid you live in an authoritarian country and must use VPN to protect your freedom. Internet becomes captcha hell run by 2-3 companies.
I've had far fewer issues with my own bots that access cloudflare protected websites, than during my regular browsing with privacy respecting browsers and a VPN.
As a side note: I'm at least thankful Microsoft isn't behind web gatekeeping. Try and solve any microsoft captcha behind a VPN - its like writing a thesis, you gotta dedicate like 5 minutes, full attention.
The website owner has rights too. Are you arguing they cannot choose to implement such gatekeeping to keep their site operating in a financially viable manner?
The first article of our constitution says people shall be treated equally in equal situations. I presume that most countries have similar clauses but, beyond legalese, it's also simply in line with my ethics to treat everyone equally
There are people behind those connection requests. I don't try to guess on my server who is a bot and who is not; I'll make mistakes and probably bias against people who use uncommon setups (those needing accessibility aids or using e.g. experimental software that improves some aspect like privacy or functionality)
Sure, I have rights as a website owner. I can take the whole thing offline; I can block every 5th request; I can allow each /16 block to make 1000 requests per day; I can accept requests only from clients that have a Firefox user agent string. So long as it's equally applied to everyone and it's not based on a prohibited category such as gender or religious conviction, I am free to decide on such cuts and I'd encourage everyone to apply a policy that they believe is fair
Cloudflare and its competitors, as far as I can tell, block arbitrary subgroups of people based on secret criteria. It does not appear to be applied fairly, such as allowing everyone to make the same number of requests per unit time. I'm probably bothered even more because I happen to be among the blocked subgroup regularly (but far from all the time, just little enough to feel the pain)
If by "our constitution" you mean the U.S. Constitution then no, it says nothing of the sort. The first article of the U.S. Constitution concerns the organization of the legislative branch. You may be referencing the Equal Protection and Due Process clauses, in the Fifth and Fourteenth amendments, but neither of those applies in this situation either since there are no laws or governmental actions at issue here, and random sites on the internet are not universally considered to be public accommodations. Even in the ADA context, the law isn't actually clear, since websites aren't specified anywhere in the text at the federal level and there's no SCOTUS precedent on point.
Some states are more stringent with their own disability regulations or state constitutions, but no state anywhere in the U.S. has a law that says every visitor to a website has to be treated equally.
You can assume it's the USA and that I'm just dead wrong, but the third word of my profile specifies where I'm from and you'd find that this Dutch constitution matches the comment's contents
Equal protection is indeed not the same as equal treatment. No, it really does say that everyone shall be treated equally so long as the circumstances are equal (gelijke behandeling in gelijke gevallen)
I didn't assume, that's why I started my comment with "if by what you mean." Good to know that you were referencing a different place, but it's unrealistic to expect people to delve into your account bio to understand what you intended by "our constitution," especially when the parent comment also contained no geographic or cultural references. Perhaps you know the parent commenter and know that they share your geography? If so, that would also have been helpful context.
As an aside, I'm curious by how that language in the Dutch constitution actually works in practice. Is it just a game of distinguishing between situations or people to excuse disparate conduct? It seems like it would be unworkable if interpreted literally.
I never said there was anything prohibiting them, just that they will be losing users. (Although, blocking some access can be illegal, for example when accessability tools are blocked.)
There's a whole spectrum of gatekeeping on communications with users, from static sites that broadcast their information to anyone, and stores that let you order without even making an account, to organizations that require you install local software to even access data and perform transactions. The latter means 90%+ of your users will hate you for it, and half will walk away, but it's still very common, collectively causing business that do so billions of dollers a year. (https://www.forbes.com/sites/johnkoetsier/2021/02/15/91-of-u...
to-install-apps-to-do-business-costing-brands-billions/)
When companies get big enough to have entire departments devoted tasks, those departments will follow the fads that bring them the most prestige, at the cost of the rest of the company. Eventually the company will lose out to newer more efficient businesses that forgo fads in favor of serving customers, and the cycle continues.
I'm just point out how a new fad is hurting businesses, but by no means wish to limit their ability to do so. They just won't be getting my business, nor business from a quickly growing cohort that desires anonymitiy, or even requires it to get around growing local censorship.
If you put your information freely on the web, you should have minimal expectations on who uses it and how. If you want to make money from it, put up a paywall.
If you want the best of both worlds, i.e. just post freely but make money from ads, or inserting hidden pixels to update some profile about me, well good luck. I'll choose whether I want to look at ads, or load tracking pixels, and my answer is no.
In a lot of circumstances, that is exactly the case. What the open source license stops is redistribution under terms that violate the license, not usage itself. An individual can very well take your open source code, make any changes they want, compile and use it for their own purposes without adhering to the terms of your license - as long as they don't redistribute it.
All "open source" code was already pretty much public domain. All they'd have to do was put a page of OSI-approved licenses up on the site, right? An index of Open Source projects and their authors? Is this more than a weeks work to comply?
Free Software is the only place where this is a real abridgement of rights and intention, and it's over. They've already been trained on all of it, and no judge will tell them to stop, and no congressman will tell them to stop.
I'm not talking about ads or pixels, I'm referring to bot operators creating so much traffic that the network bill makes the hosting financially impossible
You have every right to take the content offline, or to put any technical barriers you desire in place to access it - but that's about all you should be able to do.
If you don't want to lose money and don't feel confident that you can protect your content with technical measures, best to take your stuff off the internet.
I also do the same and get caught up by bot blockers.
However, I do believe the host can do whatever they want with my request also.
This issue becomes more complex when you start talking about government sites, since ideally they have a much stronger mandate to serve everyone fairly.
I agree with you, but the website owners just don't seem to understand that they are making their small problem into a big problem for real people, some of which will drop off.
Well, if you have a better way to solve this that’s open I’m all ears. But what Cloudflare is doing is solving the real problem of AI bots. We’ve tried to solve this problem with IP blocking and user agents, but they do not work. And this is actually how other similar problems have been solved. Certificate authorities aren’t open and yet they work just fine. Attestation providers are also not open and they work just fine.
> Well, if you have a better way to solve this that’s open I’m all ears.
Regulation.
Make it illegal to request the content of a webpage by crawler if a website operator doesn't explicitly allows it via robots.txt. Institute a government agency that is tasked with enforcement. If you as a website operator can show that traffic came from bots, you can open a complaint with the government agency and they take care of shaking painful fines out of the offending companies. Force cloud hosts to keep books on who was using what IP addresses. Will it be a 100% fix, no, will it have a massive chilling effect if done well, absolutely.
The biggest issue right now seems to be people renting their residential IP addresses to scraper companies, who then distribute large scrapes across these mostly distinct IPs. These addresses are from all over the world, not just your own country, so we'll either need a World Government, or at least massive intergovernmental cooperation, for regulation to help.
I don't think we need a world government to make progress on that point.
The companies buying these services, are buying them from other companies. Countries or larger blocks like the EU can exert significant pressure on such companies by declaring the use of such services as illegal when interacting with websites hosted in the country or block or by companies in them.
It just seems too easy to skirt around via middlemen. The EU (say) could prosecute an EU company directly doing this residential scraping, and it could probably keep tabs on a handful of bank accounts of known bad actors in other countries, and then investigate and prosecute EU companies transferring money to them. But how do you stop an EU company paying a Moldovan company (that has existed for 10 days) for "internet services", that pays a Brazilian company, that pays a Russian company to do the actual residential scraping? And then there's all the crypto channels and other quid pro quo payment possibilities.
Agreed. It might not be THE BEST solution, but it is a solution that appears to work well.
Centralization bad yada yada. But if Cloudflare can get most major AI players to participate, then convince the major CDN's to also participate.... ipso facto columbo oreo....standard.
Many people may not see a problem on wallet gardens but reality is that we have much less innovation in mobile than in web because anyone can spawn a web server vs publish an app in the App Store (apple)
I'm not sure if things are as fine as you say they are. Certificate authorities were practically unheard of outside of corporate websites (and even then mostly restricted to login pages) until Let's Encrypt normalized HTTPS. Without the openness of Let's Encrypt, we'd still be sharing our browser history and search queries with our ISPs for data mining. Attestation providers have so far refused to revoke attestation for known-vulnerable devices (because customers needing to replace thousands of devices would be an unacceptable business decision), making the entire market rather useless.
That said, what I am missing from these articles is an actual solution. Obviously we don't want Cloudflare from becoming an internet gatekeeper. It's a bad solution. But: it's a bad solution to an even worse problem.
Alternatives do exist, even decentralised ones, in the form of remote attestation ("can't access this website without secure boot and a TPM and a known-good operating system"), paying for every single visit or for subscriptions to every site you visit (which leads to centralisation because nobody wants a subscription to just your blog), or self-hosted firewalls like Anubis that mostly rely on AI abuse being the result of lazy or cheap parties.
People drinking the AI Kool-Aid will tell you to just ignore the problem, pay for the extra costs, and scale up your servers, because it's *the future*, but ignoring problems is exactly why Cloudflare still exists. If ISPs hadn't ignored spoofing, DDoS attacks, botnets within their network, """residential proxies""", and other such malicious acts, Cloudflare would've been an Akamai competitor rather than a middle man to most of the internet.
The current state of the art in AI poisoning is Nightshade from the University of Chicago. It's meant to eventually be an addon to their WebGlaze[1] which is an invite-only tool meant for artists to protect their art from AI mimicry
Nobody is dying because artists are protecting their art
If science taught us anything it's that no data is ever reliable. We are pretty sure about so many things, and it's the best available info so we might as well use it, but in terms of "the internet can be wrong" -> any source can be wrong! And I'd not even be surprised if internet in aggregate (with the bot reading all of it) is right more often than individual authors of pretty much anything
You don't think that the AI companies will take efforts to detect and filter bad data for training? Do you suppose they are already doing this, knowing that data quality has an impact on model capabilities?
The current state of the art in AI poisoning is Nightshade from the University of Chicago. It's meant to eventually be an addon to their WebGlaze[1] which is an invite-only tool meant for artists to protect their art from AI mimicry
If these companies are adding extra code to bypass artists trying to protect their intellectual property from mimicry then that is an obvious and egregious copyright violation
More likely it will push these companies to actually pay content creators for the content they work on to be included in their models.
Are they? Until Let's Encrypt came along and democratise the CA scene, it was a hell hole. Web Security was depending on how deep your pockets are. One can argue that the same path is being laid in front us until a Let's Encrypt comes along and democratise it? And here as it's about attestation, how are we going to prevent gatekeeper's doing "selective attestations with arguable criteria"? How will we prevent political forces?
We have far too many gatekeepers as it is. Any attempt to add any more should be treated as an act of aggression.
Cloudflare seems very vocal about its desire to become yet another digital gatekeeper as of late, and so is Google. I want both reduced to rubble if they persist in it.
Several companies are looking to provide a solution for the AI bot problem. Cloudflare stands to make a lot of money if people pick their solution. But Cloudflare backing down won't make the problem go away, and someone else's bad solution will be chosen instead.
The gatekeeping described here is gatekeeping a website owner chooses. It's an alternative to pay walls, bespoke bot detection, or some kind of ID verification. Cloudflare already provides a service, but standardising the service will open up the market (at the cost of competitors adopting Cloudflare's standard).
The freedom of the open web also extends to the owners of the websites people visit.
What do you mean Google "desires" to become a gatekeeper? They have been a gatekeeper for years, since they control the browser everyone uses, and Firefox usage is now in the noise. Google just steers the www where they want it to go. Killing ublock, pushing .webp trash, etc.
An allowlist run by one company that site owners chose to engage with. But the irony of taking an ideological stance about fairness while using AI generated comics for blog posts…
Cloudflare is implementing the (still-emerging) Web Bot Auth standard. We're working on the same at Stytch for https://IsAgent.dev .
The discourse around this is a little wild and I'm glad you said this. The allowlist is a Cloudflare feature and their customers are free to use it. The core functionality involving HTTP Message Signatures is decentralized and open, so anyone can adopt it and benefit.
I can't speak for the other commenter, but I think companies like Midjourney and OpenAI are robber barons exploiting people's creative work in ways that obviously aren't fair, but that our legal system wasn't equipped to prevent.
It's a frying pan/fire choice that could create a de-facto standard we end up depending on, during a critical moment where the hot topic could have a protocol or standards based solution. Cloudflare is actively trying to make a blue ocean for themselves of a real issue affecting everyone.
>But the irony of taking an ideological stance about fairness while using AI generated comics for blog posts…
This is sort of like how email is based on Internet standards but a large percentage of email users use Gmail. The Internet standards Cloudflare is promoting are open, but Cloudflare has a lot of power due to having so many customers.
(What are some good alternatives to Cloudflare?)
Another way the situation is similar: email delivery is often unreliable and hard to implement due to spam filters. A similar thing seems to be happening to the web.
It is a big problem. There is no good alternative to Cloudflare as a free CDN. They put servers all over the world and they are giving them away for free. And making their money on premium serverless services.
Not to mention the big cloud providers are unhinged with their egress pricing.
> Not to mention the big cloud providers are unhinged with their egress pricing.
I always wonder why this status quo persisted even after Cloudflare. Their pricing is indeed so unhinged, that they're not even in consideration for me for things where egress is a variable.
Why is egress seemingly free for Cloudflare or Hetzner but feels like they launch spaceships at AWS and GCP every time you send a data packet to the outside world?
They are just greedy. And they know nobody can compete with them for availability in every country. Except for Cloudflare, which is why it is so popular.
The web doesn't need attestation. It doesn't need signed agents. It doesn't need Cloudflare deciding who's a "real" user agent. It needs people to remember that "public" means PUBLIC and implement basic damn rate limiting if they can't handle the traffic.
The web doesn't need to know if you're a human, a bot, or a dog. It just needs to serve bytes to whoever asks, within reasonable resource constraints. That's it. That's the open web. You'll miss it when it's gone.
Basic damn rate limiting is pretty damn exploitable. Even ignoring botnets (which is impossible), usefully rate limiting IPv6 is anything but basic. If you just pick some prefix from /48 to /64 to key your rate limits on, you'll either be exploitable by IPs from providers that hand out /48s like candy or you'll bucket a ton of mobile users together for a single rate limit.
You make unauthenticated requests cheap enough that you don't care about volume.
Reserve rate limiting for authenticated users where you have real identity. The open web survives by being genuinely free to serve, not by trying to guess who's "real."
A basic Varnish setup should get you most of the way there, no agent signing required!
Your response to unauthenticated requests could be <h1>Hello world</h1> served from memory and your server/link will still fail under a volumetric attack, and you still get the pleasure of paying for the bandwidth.
So no, this advice has been outdated for decades.
Also you're doing some sort of victim blaming where everyone on earth has to engineer their service to withstand DoS instead of outsourcing that to someone else. Abusers outsource their attacks to everyone else's machine (decentralization ftw!), but victims can't outsource their defense because centralization goes against your ideals.
At least lament the naive infrastructure of the internet or something, sheesh.
We started with "AI crawlers are too aggressive" and you've escalated to volumetric DDoS. These aren't the same problem. OpenAI hitting your API too hard is solved by caching, not by Cloudflare deciding who gets an "agent passport."
"Victim blaming"? Can we please leave these therapy-speak terms back in the 2010s where they belong and out of technical discussions? If expecting basic caching is victim blaming, then so is expecting HTTPS, password hashing, or any technical competence whatsoever.
Your decentralization point actually proves mine: yes, attackers distribute while defenders centralize. That's why we shouldn't make centralization mandatory! Right now you can choose Cloudflare. With attestation, they become the web's border control.
The fine article makes it clear what this is really about - Cloudflare wants to be the gatekeeper for agent traffic. Agent attestation doesn't solve volumetric attacks (those need the DDoS protection they already sell, no new proposal required!) They're creating an allowlist where they decide who's "legitimate."
But sure, let's restructure the entire web's trust model because some sites can't configure a cache. That seems proportional.
OpenAI hitting your static, cached pages too hard and costing you terabytes of extra bandwidth that you have to pay for (both in bandwidth itself and data transfer fees) isn't solved by caching.
The post you're replying to points out that, at a certain scale, even caching things in-memory can cause your system to fall over when a user agent (e.g. AI scraper bots) are behaving like bad actors, ignoring robots.txt, and fetching every URL twenty times a day while completely ignoring cache headers/last modified/etc.
Your points were all valid when we were dealing with either "legitimate users", "legitimate good-faith bots", and "bad actors", but now the AI companies' need for massive amounts of up-to-the-minute content at all costs means that we have to add "legitimate bad-faith bots" to the mix.
> Agent attestation doesn't solve volumetric attacks (those need the DDoS protection they already sell, no new proposal required!) They're creating an allowlist where they decide who's "legitimate."
Agent attestation solves overzealous AI scraping which looks like a volumetric attack, because if you refuse to provide the content to the bots then the bots will leave you alone (or at least, they won't chew up your bandwidth by re-fetching the same content over and over all day).
Well, your post escalated to the broad claim that I responded to.
You didn't just disagree with AI crawler attestation: you're saying that nobody should distinguish earnest users from everything else because they should bear the cost of serving both, which necessarily entails bad traffic and incidental DoS.
Once again, services like CloudFlare exist because a cache isn't sufficient to deal with arbitrary traffic, and the scale of modern abuse is so large that only a few megacorps can provide the service that people want.
> You make unauthenticated requests cheap enough that you don't care about volume.
In the days before mandatory TLS it was so easy to set up a Squid proxy on the edge of my network and cache every plain-HTTP resource for as long as I want.
Like yeah, yeah, sure, it sucked that ISPs could inject trackers and stuff into page contents, but I'm starting to think the downsides of mandatory TLS outweigh the upsides. We made the web more Secure at the cost of making it less Private. We got Google Analytics and all the other spyware running over TLS and simultaneously made it that much harder for any normal person to host anything online.
Modern AI crawlers are indistinguishable from malicious botnets. There's no longer any rate limiting strategy that's effective, that's entirely the point of what cloudflare is attempting to solve
"It needs people to remember that "public" means PUBLIC and implement basic damn rate limiting if they can't handle the traffic."
And publish the acceptable rate.
But anyone who has ever been blocked for sending a _single_ HTTP request with the "wrong" user-agent string knows that the issue website operators are worried about is not necessarily rate (behaviour). Website operators routinely believe there is no such thing as a well-behaved bot. Thus they disregard behaviour and only focus on identity. If their crude heuristics with high probability of false positives suggest "bot" as the identity then their decision is to block, irrespective of behaviour, and ignore any possibility the heuristics may have failed. Operators routinely make (incorrect) assumptions about intent based on identity not behaviour.
Yes, I think that you are right (although rate limiting can sometimes be difficult to work properly).
Delegation of authorization can be useful for things that require it (as in some of the examples given in the article), but public files should not require authorization nor authentication for accessing it. Even if delegation of authorization is helpful for some uses, Cloudflare (or anyone else, other than whoever is delegating the authorization) does not need to be involved in them.
> public files should not require authorization nor authentication for accessing it
Define "public files" in this case?
If I have a server with files, those are my private files. If I choose to make them accessible to the world then that's fine, but they're still private files and no one else has a right to access them except under the conditions that I set.
What Cloudflare is suggesting is that content owners (such as myself, HN, the New York Times, etc.) should be provided with the tools to restrict access to their content if unfettered access to all people is burdensome to them. For example, if AI scraper bots are running up your bandwidth bill or server load, shouldn't you be able to stop them? I would argue yes.
And yet you can't. These AI bots will ignore your robots.txt, they'll change user agents if you start to block their user agents, they'll use different IP subnets if you start to block IP subnets. They behave like extremely bad actors and ignore every single way you can tell them that they're not welcome. They take and take and provide nothing in return, and they'll do so until your website collapses under the weight and your readers or users leave to go somewhere else.
> For example, if AI scraper bots are running up your bandwidth bill or server load, shouldn't you be able to stop them? I would argue yes
I also say yes, but this is not because of a lack of authorization; it is because of excessive server load (which is what you describe).
Allowing other public mirrors of files would be one thing that can be helpful (providing archive files might also sometimes be useful), although that does not actually prevent excessive scraping, due to their bad working (which is also what you describe).
Some people may use Cloudflare, but Cloudflare has its own problems with it; a lot of legitimate accessing is also stopped, while not necessarily preventing all illegitimate accessing, and sometimes causing additional problems (sometimes this might be due to misconfiguration, but not necessarily always).
> These AI bots will ignore your robots.txt, they'll change user agents if you start to block their user agents, they'll use different IP subnets if you start to block IP subnets
In my experience they change user agents and IP subnets whether or not you block them, and regardless of what else you might do.
I agree with pretty much everything the author has said. I’ve been looking at the problem more on the enterprise side of things: how do you control what agents can and can’t do on a complex private network, let alone the internet.
I’ve actually just built an “identity token” using biscuit that you can delegate however you want after. So I can authenticate (to my service, but it could be federated or something just as well), get a token, then choose to create a delegated identity token from that for my agent. Then my agent could do the same for subagents.
In my system, you then have to exchange your identity token for an authorization token to do anything (single scope, single use).
For the internet, I’ve wondered about exchanging the identity token + a small payment (like a minuscule crypto amount) for an authorization token. Human users would barely spend anything. Bots crawling the web would spend a lot.
Maybe the title means something more like "The web should not have gatekeepers (Cloudflare)". They do seem to say as much toward the end:
>We need protocols, not gatekeepers.
But until we have working protocols, many webmasters literally do need a gatekeeper if they want to realistically keep their site safe and online.
I wish this weren't the case, but I believe the "protocol" era of the web was basically ended when proprietary web 2.0 platforms emerged that explicitly locked users in with non-open protocols. Facebook doesn't want you to use Messenger in an open client next to AIM, MSN, and IRC. And the bad guys won.
The funny thing is that this blog post is complaining about a proposed protocol from Cloudflare (one which will identify bots so that good bots can be permitted). The signup form is just a method to ask Cloudflare (or any other website owner/CDN) to be categorized as a good bot.
It's not a great protocol if you're in the business of scraping websites or selling people bots to access websites for them, but it's a great protocol for people who just want their website to work without being overwhelmed by the bad side of the internet.
The whitelist approach Cloudflare takes isn't good for the internet, but for website owners who are already behind Cloudflare, it's better than the alternative. Someone will need to come up with a better protocol that also serves the website owners' needs if they want Cloudflare to fail here. The AI industry simply doesn't want to cooperate, so their hand must be forced, and only companies like Cloudflare are powerful enough to accomplish that.
Conventional crawlers already have a way to identify themselves, via a json file containing a list of IP addresses. Cloudflare is fully aware of this defacto standard.
I think the reality is, we need identity on both the client and server sides.
At some point soon, if not now, assume everything is generated by AI unless proven otherwise using a decentralized ID.
Likewise, on the server side, assume it’s a bot unless proven otherwise using a decentralized ID.
We can still have anonymity using decentralized IDs. An identity can be an anonymous identity, it’s not all (verified by some central official party) or nothing.
It's called an IP address. Since some ISPs don't assign a fixed IP to a subscriber, a timestamp is nowadays necessary. The combination is traceable to a subscriber who is responsible for the line, either to work with law enforcement if subpoenaed or to not send abusive traffic via the line themselves
Why law enforcement doesn't do their job, resulting in people not bothering to report things anymore, is imo the real issue here. Third party identification services to replace a failing government branch is pretty ugly as a workaround, but perhaps less ugly than the commercial gatekeepers popping up today
I pretty much use Perplexity exclusively at this point, instead of Google. I'd rather just get my questions answered than navigate all of the ads and slowness that Google provides. I'm fine with paying a small monthly fee, but I don't want Cloudflare being the gatekeeper.
Perhaps a way to serve ads through the agents would be good enough. I'd prefer that to be some open protocol than controlled by a company.
This has been my experience more recently as well, I've finally migrated from google to Brave Search since google was just slow for me.
I also appreciate the AI search results a bit when im looking for something very specific (like what the yaml definition for a docker swarm deployment constraint looks like) because the AI just gives me the snippet while the search results are 300 medium blog posts about how to use docker and none of them explain the variables/what each does. Even the official docker documentation website is a mess to navigate and find anything relevant!
Perplexity has been one of the AI companies that created the problem that gave rise to this CF proposal. Why doesn't Perplexity invest more into being a responsible scraper?
Perplexity is the problem Cloudflare and companies like it are trying to solve. The company refuses to take no for an answer and will mislead and fake their way through until they've crawled the content they wanted to crawl.
The problem isn't just that ads can't be served. It's that every technical measure to attempt to block their service produces new ways of misleading website owners and the services they use. Perplexity refuses any attempt at abuse detection and prevention from their servers.
None of this would've been necessary if companies like Perplexity would've just acted like a responsible web service and told their customers "sorry, this website doesn't allow Perplexity to act on your behalf".
The open protocol you want already exists: it's the user agent. A responsible bot will set the correct user agent, maybe follow the instructions in robots.txt, and leave it at that. Companies like Perplexity (and many (AI) scrapers) don't want to participate in such a protocol. They will seek out and abuse any loopholes in any well-intended protocol anyone can come up with.
I don't think anyone wants Cloudflare to have even more influence on the internet, but it's thanks to the growth of inconsiderate AI companies like Perplexity that these measure are necessary. The protocol Cloudflare proposes is open (it's just a signature), the problem people have with it is that they have to ask Cloudflare nicely to permit website owners to track and prevent abuse from bots. For any Azure-gated websites, your bot would need to ask permission there as well, as with Akamai-gated websites, and maybe even individual websites.
A new protocol is a technical solution. Technical solutions work for technical problems. The problem Cloudflare is trying to solve isn't a technical problem; it's a social problem.
Cloudflare slows the whole damn websites down. It takes many seconds to deal with their trash. I hope they crash and burn. Let's get back to very low latency websites without the cloudflare garbage.
Cloudflare as a CDN greatly greatly speeds up the web.
All the custom code they write on top of that to transform HTML for you? Ehhhh... don't use those features. Most are easily reproducible on the backend.
We dont need gatekeepers. We do need to verify agents that act, in a reasonable way, on behalf of human vs an agent swarm/bot-mining operation (whether conducted by a large lab or a kid programming claude code to ddos his buddy's next.js deployment).
By not using Cloudflare your website will be indexed by everyone. The gatekeeper aspect only applies if you use Cloudflare to distribute your website (and even then Cloudflare offers options to control this bot shield thing).
The content you host will only be blocked from being indexed if you decide to use a service that blocks indexing. If you host your content on other people's services, then you never had the power to make that decision anyway.
If you want your content to be indexed, simply don't use Cloudflare. Host your own servers. Use a different CDN if you want the benefits of Cloudflare's networks.
The private tracker community have long figured this out. Put content behind invite-only user registration, and treeban users if they ever break the rules.
This doesn't scale to the general web, does it? I think invite-only might work to build communities, but you end up in the situation we're in today where people are buying/selling invites, and that's with treebans in place.
I do fear the actions of the current bot landscape is going to lead to almost everything going behind auth walls though, and perhaps even paid auth walls.
I've been considering making this for the web. Why wouldn't it scale? Those selling invites would get banned soon enough if the people they distribute their invite to then send abusive traffic. Mystery shoppers can also make that a risky business if it's disallowed to sell invites (forcing them to be mostly free, such that the giver has nothing to gain from inviting someone who is willing to pay)
One of the practical problems I rather saw was bootstrapping: how to convince any website owner to use it, when very few people are on the system? Where should they find someone to get invites from?
As for tracking (auth walls), the website needs not know who you are. They just see random tokens with signatures and can verify the signature. If there's abuse, they send evidence to the tree system, where it could be handled similarly to HN: lots of flags from different systems will make an automated system kick in, but otherwise a person looks at the issue and decides whether to issue a warning or timeout. (Of course, the abuse reporting mechanism can also be abused so, again similar to HN, if you abuse the abuse mechanism then you don't count towards future reports.)
Ideally, we'd not need this and let real judges do the job of convicting people of abuse and computer fraud, but until such time, I'd rather use the internet anonymously with whatever setup I like than face blocks regularly while doing nothing wrong
I don't think it scales, because I'm not sure it scales on private trackers already. I'm not deep into that space, but I think there's a lot of problems with it that will scale as adoption scales, particularly around policing the sale of invites - the hope would be it self-police through treebanning, but I'm not sure it does.
I think a sort of pseudo-anonymous auth system with backed in invites and treebans that website owners could easily adopt is interesting though. I'm not sure it's a business - for adoption reasons it likely needs to be a protocol - but it's an interesting idea, if it doesn't just turn into a huge admin headache for publishers.
With what they say about authorization, I think X.509 would help. (Although central certificate authorities are often used with X.509, it does not have to be that way; the service you are operating can issue the certificate to you instead, or they can accept a self-signed certificate which is associated with you the first time it is used to create an account on their service.)
You can use the admin certificate issued to you, to issue a certificate to the agent which will contain an extension limiting what it can be used for (and might also expire in a few hours, and also might be revoked later). This certificate can be used to issue an even more restricted certificate to sub-agents.
This is already possible (and would be better than the "fine-grained personal access tokens" that GitHub uses), but does not seem to be commonly implemented. It also improves security in other ways.
So, it can be done in such a way that Cloudflare does not need to issue authorization to you, or necessarily to be involved at all. Google does not need to be involved either.
However, that is only for things where would should normally require authorization to do anyways. Reading public data is not something that should requires authorization to do; the problem with this is excessive scraping (there seems to be too many LLM scraping and others which is too excessive) and excessive blocking (e.g. someone using a different web browser, or curl to download one file, or even someone using a common browser and configuration but something strange unexpected happens, etc); the above is something unrelated to that, so certificates and stuff like that does not help, because it solves a different problem.
What problem does this solve that a basic API key doesn't solve already? The issue with that approach is that you will require accounts/keys/certificates for all hosts you intend to visit, and malicious bots can create as many accounts as they need. You're just adding a registration step to the crawling process.
Your suggested approach works for websites that want to offer AI access as a service to their customers, but the problem Cloudflare is trying to solve is that most AI bots are doing things that website owners don't want them to do. The goal is to identify and block bad actors, not to make things easier for good actors.
Using mTLS/client certificates also exposes people (that don't use AI bots) to the awful UI that browsers have for this kind of authentication. We'll need to get that sorted before an X509-based solution makes any sense.
I used to joke that I worked for the last DotCom startup, a company that got a funding round after the shit hit the fan.
They were working on an idea that looked a bit like an RSS feed for an entire website, where you would run your own spider and then our search engine could hit an endpoint to get a delta instead of having to scan your entire site.
If they’d made the protocol open instead of proprietary, we maybe could have gotten spiders to play nicer since each spider after the first would be cheaper, and eventually maybe someone could build pub sub hooks into common web frameworks to potentially skip the scan entirely for read-mostly websites, generating delta data when your data changed.
But of course when the next round of funding came due nobody was buying.
I thought about this a lot on my last project, where spiders were our customers’ biggest users. One of those apps where customer interactions were intense but brief and the rank in Google mattered equally with all other concerns. Nobody had architected for the actual read/write workflow of the system of course, and that company sold to a competitor after I left. Who migrated all customers to their solution and EOLed ours for being too fat in a down economy.
I wish Cloudflare would roll out AI poisoning attack as protection for their clients (providing bad data cache to AI bots), instead of this. Would work like a charm.
I think about this as a startup founder building a 'proof-of-human' layer on the Internet.
One of the hard parts in this space is what level of transparency should you have. We're advancing the thesis that behavioral biometrics offers robust continuous authentication that helps with bot/human and good/bad, but people are obviously skeptical to trust black-box models for accuracy and/or privacy reasons.
We've defaulted to a lot of transparency in terms of publishing research online (and hopefully in scientific journals), but we've seen the downside: competitors fake claims about their own best in-house behavioral tools that is behind their company walls in addition to investors constantly worried about an arms race.
As someone genuinely interested (and incentivized!) to build a great solution in this space, what are good protocols/examples to follow?
as a Cloudflare customer, I am happy with their proposition. I personally do not want companies like Perplexity that fake their user-agent and ignore my robots.txt to trespass.
and isn't this why people sign up with Cloudflare in the first place? for bot protection? to me, this is just the same, but with agents.
i love the idea of an open internet, but this requires all party to be honest. a company like Perplexity that fakes their user-agent to get around blocks disrespects that idea.
my attitude towards agents is positive. if a user used an LLM to access my websites and web apps, i'm all for it. but the LLM providers must disclose who they are - that they are OpenAI, Google, Meta, or the snake oil company Perplexity
Your complaints about "faking their user-agent" reminds me of this 15-year-old but still-relevant, classic post about the history of the user-agent string:
The traditional UA fakery (adding Mozilla to the start and then just tacking on browser engine names) was the result of outdated websites breaking browsers.
The problematic fakery here is that bots are pretending to be people by emulating browsers to prevent rate limits and other technical controls.
That second category has also been with us since the dawn of the internet, but it has always been something worth complaining about. No trustworthy tool or service will pretend to be a real browser, at least not by default.
If AI agents just identified themselves as such, we wouldn't need elaborate schemes to block them when they need to be blocked.
I recently ran a test on the page load reliability of Browserbase and I was shocked to see how unreliable it was for a standard set of websites - the top 100 websites in the US by traffic according to SimilarWeb. 29% of page load requests failed. Without an open standard for agent identification, it will always be a cat and mouse game to trap agents, and many agents will predictably fail simple tasks.
Writing backends that can actually handle public traffic and using authentication for expensive resources are fantastic alternatives.
Also, cheaply rate limiting malicious web clients should be something that is trivial to accomplish with competent web tooling (i.e., on your own servers). If this seems out of scope or infeasible, you might be using the wrong tools for the job.
This sounds pretty unrealistic: the web is not better off if the only people who can host content are locking it behind authentication and/or have significant infrastructure budgets and the ability to create heavily tuned static stacks.
Bankruptcy as a surprise gift is not an alternative. Even those that use big cloud providers like AWS and GCP use CDNs like Cloudflare to protect themselves. And there is no free CDN like Cloudflare.
As part of the AWS free Usage Tier you can get started with Amazon CloudFront for free.
Included in Always Free Tier
1 TB of data transfer out to the internet per month
10,000,000 HTTP or HTTPS Requests per month
2,000,000 CloudFront Function invocations per month
2,000,000 CloudFront KeyValueStore reads per month
10 Distribution Tenants
Free SSL certificates
No limitations, all features available
1 TB per month of data is literally nothing. A kid could rent a VPS for an hour and drain all that. What do you do after that? AWS is not going to stop your bill going up is it?
I don't care about any of those fancy serverless services. I am just talking about the cheapest CDN.
Ah, for cheapest CDN, maybe you're right. I think BlazingCDN can also be cheap, but CLoudFlare might be the best deal. OP didn't really say there wasn't any cheaper alternative, just said "no real good alternatives".
Someone can rent a 1Gbps server for cheap (under $50 on OVH) and pull 330TB in a month from your site. That's about $30k of egress on AWS if you don't do anything to stop it.
AWS needs a dedicated AWS engineer while any technical person and some non-technical people have skill to set up Cloudflare. Esp. Without surprise bills.
We were supposed to pentest a website on AWS WAF last week. We encountered three types of blocks:
1) hard block without having done any requests yet. No clue why. Same browser (Burp's built-in Chromium), same clean state, same IP address, but one person got a captcha and the other one didn't. It would just say "reload the page to try again" forever. This person simply couldn't use the site at all; not sure if that would happen if you're on any other browser, but since it allowed the other Burp Suite browser, that doesn't seem to be the trigger for this perma-ban. (The workaround was to clone the cookie state from the other consultant, but normal users won't have that option.)
2) captcha. I got so many captchas, like every 4th request. It broke the website (async functionality) constantly. At some point I wanted to try a number of passwords for an admin username that we had found and, to my surprise, it allowed hundreds of requests without captcha. It blocks humans more than this automated bot...
3) "this website is under construction" would sometimes appear. Similar to situation#1, but it seemed to be for specific requests rather than specific persons. Inputting the value "1e9" was fine, "1e999" also fine, but "1e99" got blocked, but only on one specific page (entering it on a different page was fine). Weird stuff. If it doesn't like whatever text you wrote on a support form, I guess you're just out of luck. There's no captcha or anything you can do about it (since it's pretending the website isn't online at all). Not sure if this was AWS or the customer's own wonky mod_security variant
I dread to think if I were a customer of this place and I urgently needed them (it's not a regular webshop but something you might need in a pinch) and the only thing it ever gives me is "please reload the page to try again". Try what again?? Give me a human to talk to, any number to dial!
On the first fricking pageload I got blocked and couldn't open it at all, no captcha shown. That's a success only insofar as you want to exclude random people who don't have a second person whose cookie state to copy
Also mind that not every request we make is malicious. A lot of it is also seeing what's even there, doing baseline requests, normal things. I didn't get the impression that I got blocked more on malicious requests than normal browsing at all (see also the part where a bot could go to town on a login form while my manual navigation was getting captchas)
Some websites will detect a Burp proxy and act accordingly. If you did your initial page load with any kind of integration like that, that's why the WAF may have blocked your request. I don't know exactly how they do it (my guess is fingerprinting the TLS handshake and TCP packet patterns), but I have seen several services do a great job at blocking any kind of analyzing proxy.
> The same is true online. A cryptographic signature that claims “I am acting on behalf of X” means nothing unless it is tied to something real, like a verifiable infrastructure or a range of IPs. Without that, I can simply hand the passport to another agent, and they can act as if they were me. The passport becomes nothing more than a token anyone can pass around.
at its foundation, the bots issue is in fact 3 main issues:
bots vs humans:
humans are trying to buy tickets that were sold out to a bot
data scrapping:
you index my data (real estate listing) to not to route traffic to my site as people search for my product, as a search engine will do, rather to become my competitor.
spam (and scam):
digital pollution, or even worse, trying to input credit card, gift cards, passwords, etc.
(obviously there are more, most which will fall into those categories, but those are the main ones)
now, in the human assisted AI, the first issue is no longer an issue, since it is obvious that each of us, the internet users, will soon have an agent built into our browser. so we will all have the speedy automated select, click and checkout at our disposal.
Prior to LLM era, there were search engines and academic research on the right side of the internet bots, and scrappers and north to that, on the wrong side of the map. but now we have legitimate human users extending their interaction with an LLM agent, and on top of it, we have new AI companies, larger and smaller which thrive for data in order to train their models.
Cloudflare simply trying to make sense of this, whilst maintaining their bot protection relevant.
I do not appreciate the post content whatsoever, since it lacks or consistency and maturity (a true understanding of how the internet works, rather than a naive one).
when you talk about "the internet", what exactly are you referring to?
a blog? a bank account management app? a retail website? social media?
those are all part of the internet and each is a complete different type of operation.
EDIT:
I've written a few words about this back in January [1] and in fact suggested something similar:
Leading CDNs, CAPTCHA providers, and AI vendors—think
Cloudflare, Google reCAPTCHA, OpenAI, or Anthropic
could collaborate to develop something akin to a
“tokenized machine ID.”
This is like saying companies don't need security gates and checkpoints. Unfortunately the world is filled with bad people, and you need security to keep them off your property.
Are bots using a large number of IP addresses simultaneously, so they look like a DDOS attack? Or are they just making ordinary requests from a small number of addresses. If it's the latter, all you need is some kind of fair queuing so those requests compete with each other for access, not with other users.
Cloudflare is dealing with a couple million faked requests every day just from Perplexity users, and Perplexity is far from the worst player in the field.
The problem would be quite easy to solve with basic rate limiting if it weren't for the attempts to bypass access controls.
They're using state of the art obfuscation that makes them indistinguishable from malicious botnets. It's an arms race with billion dollar companies vying to consume the most content before it all collapses
the open web is dead and whatever's left will be locked be authentication and paywalls
I understand the concerns around a central gatekeeper but I'm confused as to why this specifically is viewed negatively. Don't website owners have to choose to enable cloudflare and to opt-in to this gate that the site owners control?
If this was cloudflare going into some centralized routing of the internet and saying everything must do X then that would be a lot more alarming but at the end of the day the internet is decentralized and site owners are the ones who are using this capability.
Additionally I don't think that I as an individual website owner would actually want / be capable of knowing which agents are good and bad and cloudflare doing this would be helpful to me as a site owner as long as they act in good faith. And the moment they stop acting in good faith I would be able to disable them. This is definitely a problem right now as unrestricted access to the bots means bad bots are taking up many cycles raising costs and taking away resources from real users
Site owners are tricked and scared (by Cloudflare) into using Cloudflare when they don't need to. Cloudflare feels the increase in customer growth and the rest of us feel the pain.
I do like Cloudflare in general, but the whole anti-AI push is just another form of the Luddism surrounding AI since 2022. Cloudflare perhaps wisely picked up on this trend and decided to capitalize on it, but I think it would be a mistake to allow it to become their brand.
Commercial, criminal, and state interests have far more resources than you do, and their interests are in direct conflict with yours.
That would be fine, you could walk away and go home, but if you're going to drive on their digital highways, you're going to need "insurance" just protect you from everyone else.
Ongoing multi-nation WWIII-scale hacking and infiltration campaigns of infrastructure, AI bot crawling, search company and startup crawling, security researchers crawling, and maybe somebody doesn't like your blog and decides to rent a botnet for a week or so.
Bet your ISP shuts you off before then to protect themselves. (Happens all the time via BGP blackholing, DDoS scrubbing services, BGP FlowSpec, etc).
Multi-Tbps DDoS attacks, pervasive scanning of sites for exploits, comically expensive egress bandwidth on services like AWS, and ISPs disallowing hosting services on residential accounts.
Start forcing tighter security on the devices causing the Multi-Tbps DDoS attacks would be a better option, no? Cheap unsecured IoT devices are a problem.
It's not just computers anymore. Web enabled CCTV, doorbell cameras are all culprits.
And home routers, printers, and end user devices themselves. Residential ISP networks can be infiltrated and remote CVE'd through browser calls at this point from a remote website. It's not even hard.
I think it shouldn't require registering /with/ cloudflare. cloudflare should just look up the .well-known referenced and double check for impersonation, and keep score on how well behaved each one is.
Using completely automated means would leave open the possibility to set up a new signature for every single request, or for batches of requests. The manual step is to cut down on the amount of automated abuse.
Cloudflare lost a lot of credibility by backing off its "neutral" stance and booting certain sites--some which were admittedly horrible--from the their service. Now it seems they want to be even more of a gatekeeper.
I’m not necessarily coming to the defense of CF’s proposed solution, but it’s ridiculous and rather telling that the article mounts such a strong defense for agents around the notion they are simply completing user-directed tasks the user would otherwise do themselves, while avoiding the blatantly obvious issues of copyright, attribution, resource overusage, etc. presented by agents.
It’s somewhat ironic to let fly the “free and open internet” battle cry on behalf of an industry that is openly destroying it.
> Without that, I can simply hand the passport to another agent, and they can act as if they were me.
This isn't the problem Cloudflare are trying to solve here. AI scraping bots are a trigger for them to discuss this, but this is actually just one instance of a much larger problem — one that Cloudflare have been trying to solve for a while now, and which ~all other cloud providers have been ignoring.
My company runs a public data API. For QoS, we need to do things like blocking / rate-limiting traffic on a per-customer basis.
This is usually easy enough — people send an API key with their request, and we can block or rate-limit on those.
But some malicious (or misconfigured) systems, may sometimes just start blasting requests at our API without including an API key.
We usually just want to block these systems "at the edge" — there's no point to even letting those requests hit our infra. But to do that, without affecting any of our legitimate users, we need to have some key by which to recognize these systems, and differentiate them from legitimate traffic.
In the case where they're not sending an API key, that distinguishing key is normally the request's IP address / IP range / ASN.
The problematic exception, then, is Workers/Lambda-type systems (a.k.a. Function-as-a-Service [FaaS] providers) — where all workloads of all users of these systems come from the same pool of shared IP addresses.
---
And, to interrupt myself for a moment, in case the analogy isn't clear: centralized LLM-service web-browsing/tool-use backends, and centralized "agent" orchestrators, are both effectively just FaaS systems, in terms of how the web/MCP requests they originate, relate to their direct inbound customers and/or registered "agent" workloads.
Every problem of bucketing traditional FaaS outbound traffic, also applies to FaaSes where the "function" in question happens to be an LLM inference process.
"Agents" have made this concern more urgent/salient to increasingly-smaller parts of the ecosystem, who weren't previously considering themselves to be "data API providers." But you can actually forget about AI, and focus on just solving the problem for the more-general category of FaaS hosts — and any solution you come up with, will also be a solution applicable to the "agent formulation" of the problem.
---
Back to the problem itself:
The naive approach would be to block the entire FaaS's IP range the first time we see an attack coming from it. (And maybe some API providers can get away with that.)
But as long as we have at least one legitimate customer whose infrastructure has been designed around legitimate use of that FaaS to send requests to us, then we can't just block that entire FaaS's IP range.
(And sure, we could block these IP ranges by default, and then try to get such FaaS-using customers to send some additional distinguishing header in their requests to us, that would take priority over the FaaS-IP-range block... but getting a client engineer to implement an implementation-level change to their stack, by describing the needed change in a support ticket as a resolution to their problem, is often an extreme uphill battle. Better to find a way around needing to do it.)
So we really want/need some non-customer-controlled request metadata to match on, to block these bad FaaS workloads. Ideally, metadata that comes from the FaaS itself.
As it turns out, CF Workers itself already provides such a signal. Each outbound subrequest from a Worker gets forcibly annotated "on the way out" with a request header naming the Worker it came from. We can block on / rate-limit by this header. Works great!
But other FaaS providers do not provide anything similar. For example, it's currently impossible to determine which AWS Lambda customer is making requests to our API, unless that customer specifically deigns to attach some identifying info to their requests. (I actually reported this as a security bug to the Lambda team, over three years ago now.)
---
So, the point of an infrastructure-level-enforced public-visible workload-identity system, like what CF is proposing for their "signed agents", isn't just about being able to whitelist "good bots."
It's also about having some differentiable key that can cleanly bucket bot traffic, where any given bucket then contains purely legitimate or purely malicious/misbehaving bot traffic; so that if you set up rate-limiting, greylisting, or heuristic blocking by this distinguishing key, then the heuristic you use will ensure that your legitimate (bot) users never get punished, while your misbehaving/malicious (bot) users automatically trip the heuristic. Which means you never need to actually hunt through logs and manually blacklist specific malicious/misbehaving (bot) users.
If you look at this proposal as an extension/enhancement of what CF has already been doing for years with Workers subrequest originating-identity annotation, the additional thing that the "signed agents" would give the ecosystem on behalf of an adopting FaaS, is an assurance that random other bots not running on one of these FaaS platforms, can't masquerade as your bot (in order to take advantage of your preferential rate-limiting tier; or round-robin your and many others' identities to avoid such rate-limiting; or even to DoS-attack you by flooding requests that end up attributed to you.) Which is nice, certainly. It means that you don't have to first check that the traffic you're looking at originated from one of the trustworthy FaaS providers, before checking / trusting the workload-identity request header as a distinguishing key.
But in the end, that's a minor gain, compared to just having any standard at all — that other FaaSes would sign on to support — that would require them to emit a workload-identity header on outbound requests. The rest can be handled just by consuming+parsing the published IP-ranges JSON files from FaaS providers (something our API backend already does for CF in particular.)
Your ideas are intriguing to me and wish to subscribe to your newsletter.
Joking aside, I think the ideas and substance are great and sorely needed. However, I can only see the idea of a sort of token chain verification as running into the same UX problems that plagued (plagues?) PGP and more encryption-focused processes. The workflow is too opaque, requires too much specialized knowledge that is out of reach for most people. It would have to be wrapped up into something stupid simple like an iOS FaceID modal to have any hope of succeeding with the general public. I think that's the idea, that these agents would be working on behalf of their owners on their own devices, so it has to be absolutely seamless.
The web doesn't need gatekeepers the way you don't need a bank account, driver's license, or a credit card. You can do without it, but it sure makes it harder to interact with modern society. The days of the mainstream internet being a libertarian frontier are more or less over. The capitalist internet is firmly in charge.
The real question is whether there is more business opportunity in supporting "unsigned" agents than signed ones. My hope is that the industry rejects this because there's more money to be made in catering to agents than blocking them. This move is mostly to create a moat for legacy business.
Also, if agents do become the de-facto way of browsing the internet, I'm not a fan of more ways of being tracked for ads and more ways for censorship groups to have leverage.
But the author is making a strawman argument over a "steelman" argument against signed agents. The strongest argument I can see is not that we don't need gatekeepers, but that regulation is anti-business.
This article can easily be dismissed when hardly a moment in you see the headline "Agents Are Inevitable"
I'm sorry, but the "agents" of "agentic AI" is completely different from the original purpose of the World-Wide Web which was to support user agents. User agents are used directly by users—aka browsers. API access came later, but even then it was often directed by user activity…and otherwise quite normally rate-limited or paywalled.
The idea that now every web server must comply with servicing an insane number of automated bots doing god-knows-what without users even understanding what's happening a lot of the time, or without the consent of content owners to have all their IP scraped into massive training datasets is, well, asinine.
That's not the web we built, that's not the web we signed up for; and yes, we will take drastic measures to block your ass.
Speak for yourself. This is just the semantic web: a web not built just for humans, but also for robots or any other types of agents that may wish to build upon the data. User agents never meant just web browsers, and operators blocking based on it necessitated hiding your identity.
Blocking bots is an absurd and unwinnable proposition, just like DRM; there's always the final, nuclear option of the analog hole, a literal video camera pointed at a monitor and using a keyboard and mouse.
If you really need to, deploy a proof of work shield that doesn't discriminate against user agents, just like what Onionsites do.
> When I’m driving, I hand my phone to a friend and say, “Reply ‘on my way’ to my Mom.” They act on my behalf, through my identity, even though the software has no built-in concept of delegation. That is the world we are entering.
That is a very small part of the world we're entering.
The other vast majority of use cases will come from even more abusive bots than we have today, filling the internet with spam, disinformation, and garbage. The dead internet is no longer a theory, and the future we're building will make the internet for bots, by bots. Humans will retreat into niche corners of it, and those who wish to participate in the broader internet will either have to live with this, or abide by new government regulations that invade their privacy and undermine their security.
So, yes, confirming human identity is the only path forward if we want to make the internet usable by humans, but I do agree that the ideal solution will not come from a single company, or a single government, for that matter. It will be a bumpy ride until we figure this out.
Sorry, the "web" isn't "open" and hasn't been for a while.
Most interaction, publication, and dissemination takes place behind authentication:
Most social media, newspapers, etc. throttle, block, or otherwise truncate non-authenticated clients.
Blogs are an extremely small tranche of information that the average netizen consumes.
Everyone loves the dream of a free for all and open web.
But the reality is how can someone small protect their blog or content from AI training bots? E.g.: They just blindly trust someone is sending Agent vs Training bots and super duper respecting robots.txt? Get real...
Or, fine what if they do respect robots.txt, but they buy the data that may or may not have been shielded through liability layers via "licensed data"?
Unless you're reddit, X, Google, or Meta with scary unlimited budget legal teams, you have no power.
Great video: https://www.youtube.com/shorts/M0QyOp7zqcY
> Everyone loves the dream of a free for all and open web... But the reality is how can someone small protect their blog or content from AI training bots?
Aren't these statements entirely in conflict? You either have a free for all open web or you don't. Blocking AI training bots is not free and open for all.
No, that is not true. It is only true if you just equate "AI training bots" with "people" on some kind of nominal basis without considering how they operate in practice.
It is like saying "If your grocery store is open to the public, why is it not open to this herd of rhinoceroses?" Well, the reason is because rhinoceroses are simply not going to stroll up and down the aisles and head to the checkout line quietly with a box of cereal and a few bananas. They're going to knock over displays and maybe even shelves and they're going to damage goods and generally make the grocery store unusable for everyone else. You can say "Well, then your problem isn't rhinoceroses, it's entities that damage the store and impede others from using it" and I will say "Yes, and rhinoceroses are in that group, so they are banned".
It's certainly possible to imagine a world where AI bots use websites in more acceptable ways --- in fact, it's more or less the world we had prior to about 2022, where scrapers did exist but were generally manageable with widely available techniques. But that isn't the world that we live in today. It's also certainly true that many humans are using websites in evil ways (notably including the humans who are controlling many of these bots), and it's also very true that those humans should be held accountable for their actions. But that doesn't mean that blocking bots makes the internet somehow unfree.
This type of thinking that freedom means no restrictions makes sense only in a sort of logical dreamworld disconnected from practical reality. It's similar to the idea that "freedom" in the socioeconomic sphere means the unrestricted right to do whatever you please with resources you control. Well, no, that is just your freedom. But freedom globally construed requires everyone to have autonomy and be able to do things, not just those people with lots of resources.
You have a problem with badly behaved scrapers, not AI.
I can't disagree with being against badly behaved scrapers. But this is neither a new problem or an interesting one from the idea of making information freely available to everyone, even rhinoceroses, assuming they are well behaved. Blocking bad actors is not the same thing as blocking AI.
The thing is that rhinoceroses aren't well-behaved. Even if some small fraction of them in theory might be well-behaved, the effort of trying to account for that is too small to bother. If 99% of rhinoceroses aren't well-behaved, the simple and correct response is to ban them all, and then maybe the nice ones can ask for a special permit. You switch from allow-by-default to block-by-default.
Similarly it doesn't make sense to talk about what happens if AI bots were well-behaved. If they are, then maybe that would be okay, but they aren't, so we're not talking about some theoretical (or past) situation where bots were well-behaved and scraped in a non-disruptive fashion. We're talking about the present reality in which there actually are enormous numbers of badly-behaved bots.
Incidentally, I see that in a lot of your responses on this thread you keep suggesting that people's problem is "not with AI" but with something else. But look at your comment that I initially replied to:
> Blocking AI training bots is not free and open for all.
We're not talking about "AI". We're talking about AI training bots. If people want to develop AI as a theoretical construct and train it on datasets they download separately in a non-disruptive way, great. (Well, actually it's still terrible, but for other reason. :-) ) But that's not what people are responding to in this thread. They're talking about AI training bots that scrape websites in a way that is objectively more harmful than previous generations of scrapers.
ISPs are supposed to disconnect abusive customers. The correct thing to do is probably contact the ISP. Don't complain about scraping, complain about the DDOS (which is the actual problem and I'm increasingly beginning to believe the intent.)
Sure, let me just contact that one ISP located in Russia or India, I am sure they will care a lot about my self-hosted blog
Great! How do I get, say, Google's ISP to disconnect them?
But many people feel that the very act of incorporating your copyrighted words into their for-profit training set is itself the bad behavior. It's not about rate-limiting scrapers, it's letting them in the door in the first place.
Why was it OK for Google to incorporate their words into a for-profit search index which has increasingly sucked all the profit out of the system?
My Ithaca friends on Facebook complain incessantly about the very existence of AI to the extent that I would not want to say I ask Copilot how to use Windows Narrator or Junie where the CSS that makes this text bold or sometimes have Photoshop draw an extra row of bricks in a photograph for me.
The same people seem to have no problem with Facebook using their words for all things Facebook uses them for, however.
They were okay with it when Google was sending them traffic. Now they often don’t. They’ve broken the social contract of the web. So why should the sites whose work is being scraped be expected to continue upholding their end?
Not only are they scraping without sending traffic, they're doing so much more aggressively than Google ever did; Google, at least, respected robots.txt and kept to the same user-agent. They didn't want to index something that a server didn't want indexed. AI bots, on the other hand, want to index every possible thing regardless of what anyone else says.
There's something more obviously nefarious and existential about AI. It takes the idea of "you are the product" to a whole new level.
> Why was it OK for Google to incorporate their words into a for-profit search index which has increasingly sucked all the profit out of the system?
It wasn't okay, it's just that the reasons it wasn't okay didn't become apparent until later.
> The same people seem to have no problem with Facebook using their words for all things Facebook uses them for, however.
Many of those people will likely have a problem with it later, for reasons that are happening now but that they won't become fully aware of until later.
> My Ithaca friends on Facebook complain incessantly about the very existence of AI to the extent that I would not want to say I ask Copilot how to use Windows Narrator or Junie where the CSS that makes this text bold or sometimes have Photoshop draw an extra row of bricks in a photograph for me.
Good! Why would you willingly confess any of that? I'd be humiliated if I did any of that.
Sure. But we're already talking about presumption of free and open here. I'm sure people are also reading my words and incorporating it into their own for-profit work. If I cared, I wouldn't make it free and open in the first place.
> You have a problem with badly behaved scrapers, not AI.
And you have a problem understanding that "freedom and openness" extend only to where the rights (e. g. the freedom) of another legal entity begins. When I don't want "AI" (not just the badly-behaved subset) rifling my website then I should be well within my rights to disallow just that, in the same way as it's your right to allow them access to your playground. It's not rocket science.
This is not what the parent means. What they mean is such behavior is a hypocrisy. Because you are getting access to truly free websites whose owners are interested in having smart chatbots trained on the free web, but you are blocking said chatbots while touting "free Internet" message.
Badly behaved scrapers are not a new problem, but badly behaved scrapers run by multibillion-dollar companies which use every possible trick to bypass every block or restriction or rate limit you put in front of them is a completely new problem on a scale we've never seen before.
You can always stop bots. Add login/password. But people want their content to be accessible to as large audience as possible, but at the same time they don't want that data to be accessible to the same audience via other channels. logic. Bots are not consuming your data - humans are. At the end of the day humans will eventually read it and take actions. For example chatgpt will mention your site, the user will visit it.
And no, nothing was different before 2022. Just look at google, the largest bot scraping network in the world. Since 1996.
> And no, nothing was different before 2022. Just look at google, the largest bot scraping network in the world. Since 1996.
I'm sorry, but this statement shows you have no recent experience with these crawlernets.
Google, from the beginning, has done their best to work with server owners. They respect robots.txt. I think they were the first to implement Crawl-Delay. They crawl based on how often things change anyway. They have an additional safeguard that when they notice a slowdown in your responses, they back off.
Compare this with Anthropic. On their website they say they follow robots.txt and Crawl-Delay. I have an explicit ban on Claudebot in there and a Crawl-Delay for everyone else. It ignores both. I send an email to them about this, and their answer didn't address the discrepancy between the docs and the behaviour. They just said they'll add me to their internal whitelist and that I should've sent 429s when they were going too fast. (Fuck off, how about you follow your public documentation?)
That's just my experience, but if you Google around you'll find that Anthropic is notorious for ignoring robots.txt.
And still, Claudebot is one of the better behaved bots. At least they identify themselves, have a support email they respond to, and use identifiable IP-addresses.
A few weeks ago I spend four days figuring out why I had 20x the traffic I normally have (which maxed out the server; causing user complaints). Turns out there are parties that crawl using millions of (residential) IPs and identify themselves as normal browsers. Only 1 or 2 connections per IP at the time. Randomization of identifying properties. Even Anthropics 429 solution wouldn't have worked there.
I managed to find a minor identifying property in some of the requests that wasn't catching too many real users. I used that to start firewalling IPs on sight and then their own randomization caused every IP to fall into the trap in the end. But it took days.
In the end I had to firewall nearly 3 million non-consecutive IP addresses.
So no, Google in 1996 or 2006 or 2016 is not the same as the modern DDoSing crawlernet.
> "If your grocery store is open to the public, why is it not open to this herd of rhinoceroses?"
What this scenario actually reveals is that the words "open to the public" are not intended to mean "access is completely unrestricted".
It's fine to not want to give completely unrestricted access to something. What's not fine, or at least what complicates things unnecessarily, is using words like "open and free" to describe this desired actually-we-do-want-to-impose-certain-unstated-restrictions contract.
I think people use words like "open and free" to describe the actually-restricted contracts they want to have because they're often among like-minded people for whom these unstated additional restrictions are tacitly understood -- or, simply because it sounds good. But for precise communication with a diverse audience, using this kind of language is at best confusing, at worst disingenuous.
Nobody has ever meant "access is completely unrestricted".
As a trivial example: what website is going to welcome DDoS attacks or hacking attempts with open arms? Is a website no longer "open to the public" if it has DDoS protection or a WAF? What if the DDoS makes the website unavailable to the vast majority of users: does blocking the DDoS make it more or less open?
Similarly, if a concert is "open to the public", does that mean they'll be totally fine with you bringing a megaphone and yelling through the performance? Will they be okay with you setting the stage on fire? Will they just stand there and say "aw shucks" if you start blocking other people from entering?
You can try to rules-lawyer your way around commonly-understood definitions, but deliberately and obtusely misinterpreting such phrasing isn't going to lead to any kind of productive discussion.
>You can try to rules-lawyer your way around commonly-understood definitions
Despite your assertions to the contrary, "actually free to use for any purpose" is a commonly understood interpretation of "free to use for any purpose" -- see permissive software licenses, where licensors famously don't get to say "But I didn't mean big companies get to use it for free too!"
The onus is on the person using a term like "free" or "open" to clarify the restrictions they actually intend, if any. Putting the onus anywhere else immediately opens the way for misunderstandings, accidental or otherwise.
To make your concert analogy actually fit: A scraper is like a company that sends 1000 robots with tape recorders to your "open to the public" concert. They do only the things an ordinary member of the public do; they can't do anything else. The most "damage" they can do is to keep humans who would enjoy the concert from being able to attend if there aren't enough seats; whatever additional costs they cause (air conditioning, let's say) are the same as the costs that would have been incurred by that many humans.
they're basically describing the tragedy of the commons, but if a handful of the people have bulldozers to rip up all the grass and trees.
We can't have nice things because the powerful cannot be held accountable. The powerful are powerful due to their legal teams and money, and power is the ability to carve exceptions to rules.
Bingo. Thanks for clarifying exactly my point
I think that was the point. Everyone loves the dream, but the reality is different.
How so? If you don't want AI bots reading information on the web, you don't actually want a free and open web. The reality of an open web is that such information is free and available for anyone.
> If you don't want AI bots reading information on the web, you don't actually want a free and open web.
This is such a bad faith argument.
We want a town center for the whole community to enjoy! What, you don't like those people shooting up drugs over there? But they're enjoying it too, this is what you wanted right? They're not harming you by doing their drugs. Everyone is enjoying it!
If an AI bot is accessing my site the way that regular users are accessing my site -- in other words everyone is using the town center as intended -- what is the problem?
Seems to be a lot of conflating of badly coded (intentionally or not) scrapers and AI. That is a problem that predates AI's existence.
So if I buy a DDoS service and DDoS your site, it's ok as long as it accesses it the same way regular people do? In sorry for extreme example, it's obviously not, but that's how I understand your position as written.
We can also consider 10 exploit attempts per second that my site sees.
Unironically, if we want everyone to enjoy the town center, we should let people do drugs.
Clearly you don't want the whole community to enjoy it then. Openness is incompatible with keeping the riff raff out
Set aside that there's a pretty big difference between AI scraping and illegal drug usage.
If the person using illegal drugs is on no way harming anyone but themselves and not being a nuisance, then yeah, I can get behind that. Put whatever you want in your body, just don't let it negatively impact anyone around you. Seems reasonable?
I think this is actually a good example despite how stark the differences are - both the nuisance AI scrapers and the drug addicts have negative externalities that while possible for them to self regulate, they are for whatever reasons proving unable to do so, and therefore cause other people to have a bad time.
Other commenters saying the usual “drugs are freedom” type opinions, but now having lived in China and Japan where drugs are dealt with very strictly (and basically don’t have a drug problem today), I can see the other side of the argument where in fact places feeling dirty and dangerous because of drugs - even if you think of addicts sympathetically as victims who need help - makes everyone else less free to live the lifestyle they would like to have.
More freedom for one group (whether to ruin their own lives for a high; or to train their AI models) can mean less freedom for others (whether to not feel safe walking in public streets; or to publish their little blog in the public internet).
> just don't let it negatively impact anyone around you.
Exactly! Which is why we don't want AI bots siphoning our bandwidth & processing power.
> information is free and available for anyone.
Bots aren't people.
You can want public water fountains without wanting a company attaching a hose to the base to siphon municipal water for corporate use, rendering them unusable for everyone else.
You can want free libraries without companies using their employees' library cards to systematically check out all the books at all times so they don't need to wait if they want to reference one.
Does allow bots to access my information prevent other people from accessing my information? No. If it did, you'd have a point and I would be against that. So many strange arguments are being made in this thread.
Ultimately it is the users of AI (and am I one of them) that benefit from that service. I put out a lot of open code and I hope that people are able to make use of it however they can. If that's through AI, go ahead.
> Does allow bots to access my information prevent other people from accessing my information? No.
Yes it does, that's the entire point.
The flood of AI bots is so bad that (mainly older) servers are literally being overloaded and (newer servers) have their hosting costs spike so high that it's unaffordable to keep the website alive.
I've had to pull websites offline because badly designed & ban-evading AI scraper bots would run up the bandwidth into the TENS OF TERABYTES, EACH. Downloading the same jpegs every 2-3 minutes into perpetuity. Evidently all that vibe coding isn't doing much good at Anthropic and Perplexity.
Even with my very cheap transfer racks up $50-$100/mo in additional costs. If I wanted to use any kind of fanciful "app" hosting it'd be thousands.
That's a problem with scrapers, not with AI. I'm not sure why there are way more AI scraper bots now than there were search scraper bots back when that was the new thing. However that's still an issue of scapers and rate limiting and nothing to do with wanting or not wanting AI to read your free and open content.
This whole discussion is about limiting bots and other unwanted agents, not about AI specifically (AI was just an obvious example)
Do the AI training bots provide free access to the distillation of the content they drain from my site repeatedly? Don't they want a free and open web?
I don’t feel a particular need to subsidize multi–billion even trillion dollar corporations with my content, bandwidth, and server costs since their genius vibe coded bots apparently don’t know how to use modified-GETs or caching, let alone parse and respect robots.txt.
Is the problem they exist or the problem they are badly accessing your site? Because there are two conflating issues here. If humans or robots are causing you issues, as both can do, that's bad. But that has nothing to do with AI in particular.
Problem one is they do not honor the conventions of the web and abuse the sites. Problem two is they are taking content for free, distilling it into a product, and limiting access to that product.
Problem one is not specific to AI and not even about AI.
Problem two is not anything new. Taking freely available content and distilling it into a product is something valuable and potentially worth paying for. People used to buy encyclopedias too. There are countless examples.
At present, problem one is almost entirely AI companies.
And a few decades ago, it would have been search engine scrapers instead.
And that problem was largely solved by robots.txt. AI scrapers are ignoring robots.txt and beating the hell out of sites. Small sites that have decades worth of quality information are suffering the most. Many of the scrapers are taking extreme measures to avoid being blocked, like using large numbers of distinct IP addresses (perhaps using botnets).
There's actually not much evidence of this, since the attack traffic is anonymous.
HN people working in these AI companies have commented to say they do this, and the timing correlates with the rise of AI companies/funding.
I haven't tried to find it in my own logs, but others have said blocking an identifiable AI bot soon led to the same pattern of requests continuing through a botnet.
Did HN people present evidence?
The problem is not AI bot scraping, per se, but "AI bot scraping while disregarding all licenses and ethical considerations".
Freedom, the word, while implies no boundaries, is always bound by ethics, mutual respect and "do no harm" principle. The moment you trip either one of these wires and break them, the mechanisms to counter it becomes active.
Then we cry "but, freedom?!". Freedom also contains the consequences of one's actions.
Freedom without consequences is tyranny of the powerful.
The problem isn't "AI bot scraping while disregarding all licenses and ethical considerations". The problem is "AI bot scraping while ignoring every good practice to reduce bandwidth usage".
If you ask me "every good practice to reduce bandwidth usage" falls under ethics pretty squarely, too.
While this is certainly a problem, it's not the only problem.
> The problem is not AI bot scraping, per se, but "AI bot scraping while disregarding all licenses and ethical considerations".
What licenses? Free and open web. Go crazy. What ethical considerations? Do I police how users use the information on my site? No. If they make a pipe bomb using an 6502 CPU using code taken from my website -- am I supposed to do something about that?
Creative Commons, GFDL, Unlicense, GPL/AGPL, MIT, WTFPL. Go crazy. I have the freedom to police how users use the information on my site. Yes.
Real examples: My blog is BY-NC-SA and digital garden is GFDL. You can't take them, mangle and sell them. Especially, the blog.
AI companies take these posts, and sell derivatives, without any references, consent or compensation. BY-NC-SA is complete opposite of what they do.
This is why I'm not uploading any photos I take publicly anymore.
Absolutely. If you want to put all kinds of copyright, license, and even payment restrictions on your content go ahead. And if AI companies or people abuse that, that's bad on them.
But I do think if you're serious about free and open information than why are you doing that in the first place? It's perfectly reasonable to be restrictive; I write both very open software and very closed software. But I see a lot of people want to straddle the line when it comes to AI without a rational argument.
Let me try to make my point as compact as possible. I may fail, but please bear with me.
I prefer Free Software to Open Source software. My license of choice is A/GPLv3+. Because, I don't want my work to be used by people/entities in a single sided way. The software I put out is the software I develop for myself, with the hope of being useful for somebody else. My digital garden is the same. My blog is a personal diary in the open. These are built on my free time, for myself, and shared.
See, permissive licenses are for "developer freedom". You can do whatever you do with what you can grab, as long as you write a line to credits. A/GPL family is different. Wants reciprocity. It empowers the user vs. the developer. You have to give the source. Who modifies the source, shares the modifications. It stays in the open. It has to stay open.
I demand this reciprocity for what I put out there. The licenses reflect that. It's "restricting the use to keep the information/code open". I share something I spent my time on, and I want it to live on the open, want a little respect for putting out what I did. That respect is not fame or superiority. Just not take it and run with it, keeping all the improvements to yourself.
It's not yours, but ours. You can't keep it to yourself.
When it comes to AI, it's an extension of this thinking. I do not give consent to a faceless corporation to close, twist and earn money from what I put out for public good. I don't want a set of corporations act as a middleman to get what I put out, repackage and corrupt it in the process and sell it. It's not about money; it's about ethics, doing the right thing and being respectful. It's about exploitation. Same is applicable to my photos.
I'm not against AI/LLM/Generative technology/etc. I'm against exploitation of people, artists, musicians, software developers, other companies. I equally get angry when a company's source available code is scraped and used for suggestions as well as an academic's LGPL high performance matrix library which is developed via grants over the years. This thing affect livelihoods of people.
I get angry when people say "if we take permission for what we do, AI industry will collapse", or "this thing just learns like humans, this is fair use".
I don't buy their "we're doing something awesome, we need no permission" attitude. No, you need permission to use my content. Because I say so. Read the fine print.
I don't want knowledge to be monopolized by these corporations. I don't want the small fish to be eaten by the bigger one and what remains is buried into the depths of information ocean.
This is why I stopped sharing my photos for now, and my latest research won't be open source for quite some time.
What I put out is for humans' direct consumption. Middlemen are not welcome.
If you have any questions or left any holes up there, please let me know.
I respect the desire for reciprocity, but strong copyleft isn't the only, or even the best, way to protect user freedom or public knowledge. My opinion is that permissive licensing and open access to learn from public materials have created enormous value precisely because they don't pre-empt future uses. Requiring permission for every new kind of reuse (including ML training) shrinks the commons, entrenches incumbents who already have data deals, and reduces the impact of your work. The answer to exploitation is transparency, attribution, and guardrails against republication, not copyright enforced restrictions.
I used to be much more into the GPL than I am now. Perhaps it was much more necessary decades ago or perhaps our fears were misguided. I license all my own stuff as Apache. If companies want to use it, great. It doesn't diminish what I've done. But those who prefer GPL, I completely understand.
> as well as an academic's LGPL high performance matrix library which is developed via grants over the years.
The academic got paid with grants. So now this high performance library exists in the world, paid for by taxes, but it can't be used everywhere. Why is it bad to share this with everyone for any purpose?
> What I put out is for humans' direct consumption. Middlemen are not welcome.
Why? Why must it be direct consumption? I've use AI tools to accomplish things that I wouldn't be able to do on my own in my free time -- work that is now open source. Tons of developers this week are benefiting from what I was able to accomplish using a middle man. Not all middlemen, by definition, are bad. Middlemen can provide value. Why is that value not welcome?
> I'm not against AI/LLM/Generative technology/etc. I'm against exploitation of people, artists, musicians, software developers, other companies.
If you define AI/LLM/Generative technology/etc as the exploitation of exploitation of people, artists, musicians, software developers, other companies then you are against it. As software developers our work directly affects the livelihoods of people. Everything we create is meant to automate some human task. To be a software developer and then complain that AI is going to take away jobs is to be a hypocrite.
Your whole argument is easily addressed by requiring the AI models to be open source. That way, they obviously respect the AGPL and any other open license, and contribute to the information being kept free. Letting these companies knowingly and obviously infringe licenses and all copyright as they do today is obviously immoral, and illegal.
AGPL doesn't pre-empt future uses or require permission for any kind of re-use. You just have to share alike. It's pretty simple.
AGPL lets you take a bunch of data and AI-train on it. You just have to release the data and source code to anyone who uses the model. Pretty simple. You don't have to rent them a bunch of GPUs.
Actually it can be annoying because of the specific mechanism by which you have to share alike - the program has to have a link to its own source code - you can't just offer the source alongside the binary. But it's doable.
How is it available for everyone if the AI bots bring down your server?
Is that really the problem we are discussing? I've had people attack my server and bring it down. But that has nothing to do with being free and open to everyone. A top hacker news post could take my server.
Yes, because a top hacker news post takes your server down because a large number of actual humans are looking to gain actual value from your posts. Meanwhile, you stand to benefit from the HN discussion by learning new things and perspectives from the community.
The AI bot assault, on the other hand, is one company (or a few companies) re-fetching the same data over and over again, constantly, in perpetuity, just in case it's changed, all so they can incorporate it into their training set and make money off of it while giving you zero credit and providing zero feedback.
But then we get to use those AI tools.
The refrain here comes down not to "AI" but mostly to "the AI bot assault" which is a different thing. Sure lets have an discussion about badly behaved and overzealous web scrapers. As for credit, I've asked AI for it's references and gotten them. If my information is merely mushed into AI training model I'm not sure why I need credit. If you discuss this thread with your friends are you going to give me credit?
"If you discuss this thread with your friends are you going to give me credit?"
Yes. How else would I enable my friends to look it up for themselves?
No, you don't "get to" use the AI tools. You have to buy access to them (beyond some free trials).
Everyone can get it from the bots now?
Rate-limits? Use a CDN? Lots of traffic can be a problem whether it's bots or humans.
You realize this entire thread is about a pitch from a CDN company trying to solve an issue that has presented itself at such a scale that this is the best option they can think of to keep the web alive, right?
"Use a CDN" is not sufficient when these bots are so incredibly poorly behaved, because you're still paying for that CDN and this bad behavior is going to cost you a fortune in CDN costs (or cost the CDN a fortune instead, which is why Cloudflare is suggesting this).
Build better
Ultimately, you have to realize that this is a losing battle, unless we have completely draconian control over every piece of silicon. Captchas are being defeated; at this point they're basically just mechanisms to prove you Really Want to Make That Request to the extent that you'll spend some compute time on it, which is starting to become a bit of a waste of electricity and carbon.
Talented people that want to scrape or bot things are going to find ways to make that look human. If that comes in the form of tricking a physical iPhone by automatically driving the screen physically, so be it; many such cases already!
The techniques you need for preventing DDoS don't need to really differentiate that much between bots and people unless you're being distinctly targeted; Fail2Ban-style IP bans are still quite effective, and basic WAF functionality does a lot.
Nothing is „free“. AI bots eat up my blog like crazy and I have to pay for its hosting.
Don't you have rate-limits? And how much are you paying for the instance where you're hosting it? I've run/helped run projects with something like ~10 req/s easily on $10 VPSs, surely hosting HTML can't cost you that much?
Of course it won't be free, but you can get pretty close to free but employing the typical things you'd put in place to restrict the amount of resources used, like rate-limits, caches and so on.
And? Paying Cloudflare or someone else to block bad actors is required these days unless you have the scale and expertise to do it yourself.
Why is outsourcing this to Cloudflare bad and doing it yourself ok? Am I allowed to buy a license to a rate limiter or do I need to code my own? Am I allowed to use a firewall or is blocking people from probing my server not free enough?
Why are bots or any other user entitled to unlimited visits to my website? The entitlement is kind of unreal at this point
> And? Paying Cloudflare or someone else to block bad actors is required these days unless you have the scale and expertise to do it yourself.
Where are people getting this from? No, Cloudflare or any other CDN is not required for you to host your own stuff. Sure, it's easy, and probably the best way to go if you just wanna focus on shipping, but lets not pretend it's a requirement today.
> Why are bots or any other user entitled to unlimited visits to my website? The entitlement is kind of unreal at this point
I don't think they are, that's why we have rate limiters, right? :) I think the point is that if you're allowing a user to access some content in one way, why not allow that same user to access the content in the same way, but using a different user-agent? That's the original purpose of that header after all, to signal what the user used as an agent on their behalf. Commonly, I use Firefox as my agent for browsing, but I should be free to use any user-agent, if we want the web to remain open and free.
That's a very "BSD is freedom and GPL isn't" kind of philosophy.
Nothing is truly free unless you give equal respect to fellow hobbyists and megacorps using your labor for their profit.
GPL doesn't care if you use it for profit or not (good), it just says that the resultant model needs to be open too. And open models exist in droves nowadays. Even closed models can be distilled into open ones.
The dream is real, man. If you want open content on the Internet, it's never been a better time. My blog is open to all - machine or man. And it's hosted on my home server next to me. I don't see why anyone would bother trying to distinguish humans from AI. A human hitting your website too much is no different from an AI hitting your website too much.
I have a robots.txt that tries to help bots not get stuck in loops, but if they want to, they're welcome to. Let the web be open. Slurp up my stuff if you want to.
Amazonbot seems to love visiting my site, and it is always welcome.
> I don't see why anyone would bother trying to distinguish humans from AI.
Because a hundred thousand people reading a blog post is more beneficial to the world than an AI scraper bot fetching my (unchanged) blog post a hundred thousand times just in case it's changed in the last hour.
If AI bots were well-behaved, maintained a consistent user agent, used consistent IP subnets, and respected robots.txt, I wouldn't have a problem with them. You could manage your content filtering however you want (or not at all) and that would be that. Unfortunately at the moment, AI bots do everything they can to bypass any restrictions or blocks or rate limits you put on them; they behave as though they're completely entitled to overload your servers in their quest to train their AI bots so they can make billions of dollars on the new AI craze while giving nothing back to the people whose content they're misappropriating.
I've not seen an AI scraper reading a blog post 100,000 times in an hour to see if it's changed. As far as I can tell, that's a NI hallucination. Typical fetch rates are more like 3 times per second (10k per hour) and fetch a different URL each time.
The only bot that bugs the crap out of me is Anthropic's one. They're the reason I set up a labyrinth using iocaine (https://iocaine.madhouse-project.org/). Their bot was absurdly aggressive, particularly with retries.
It's probably trivial in the whole scheme of things, but I love that anthropic spent months making about 10rps against my stupid blog, getting markov chain responses generated from the text of Moby Dick. (looks like they haven't crawled my site for about a fortnight now)
No wonder Anthropic isn't working well! The "Moby Dicked" explanation of the state of AI!
But seriously, Why must someone search even a significant part of the public Internet to develop an AI? Is it believed that missing some text will cripple the AI?
Isn't there some sort of "law of diminishing returns" where, once some percentage of coverage is reached, further scraping is not cost-effective?
On the contrary, AI training techniques require gigantic amounts of data to do anything, and there is no upper limit whatsoever - the more relevant data you have to train on, the better your model will be, period.
In fact, the biggest thing that is making it unlikely that LLM scaling will continue is that the current LLMs have already been trained on virtually every piece of human text we have access to today. So, without new training data (in large amounts), the only way they'll scale more is by new discoveries on how to train more efficiently - but there is no way to put a predictable timeline on that.
It's traditional to include a link when claiming to be invulnerable. :)
Haha, sounds a bit self-promotional to do that but link in profile.
Not claiming that the site is technologically invulnerable. Just that it's not a big deal if LLMs scrape it (which bizarrely they do).
By developing Free Software combating these hostile softwares.
Corporations develop hostile AI agents,
Capable hackers develop anti-AI-agents.
This defeatist atittude "we have no power".
Yes, I obviously agree with you. My comment's point is missed a little I think by you. CF is making these tools and giving access to it to millions of people.
Well there's open source stuff like https://github.com/TecharoHQ/anubis; one doesn't need a top-down mandated solution coming from a corporation.
In general Cloudflare has been pushing DRMization of the web for quite some time, and while I understand why they want to do it, I wish they didn't always show off as taking the moral high ground.
Anubis doesn’t necessarily stop the most well funded actors.
If anything we’ve seen the rise in complaints about it just annoying average users.
The actual response to which Anubis was created is seemingly a strange kind of DDOS attack that has been misattributed to LLMs, but is some kind of attacker that makes partial GET requests that are aborted soon after sending the request headers, mostly coming from residential proxies. (Yes, it doesn’t help that the author of Anubis also isn’t fully aware of the mechanics of the attack. In fact, there is no proper write up of the mechanism of the attack which I hope to write about someday).
Having said that, the solution is effective enough, having a lightweight proxy component that issues proof of work tokens to such bogus requests works well enough, as various users on HN seem to point out.
> a strange kind of DDOS attack that has been misattributed to LLMs, , but is some kind of attacker that makes partial GET requests that are aborted soon after sending the request headers, mostly coming from residential proxies.
um, no? Where did you get this strange bit of info.
The original reports say nothing of that sort: https://news.ycombinator.com/item?id=42790252 ; and even original motivation for Anubis was Amazon AI crawler https://news.ycombinator.com/item?id=42750420
(I've seen more posts with the analysis, including one which showed an AI crawler which would identify properly, but once it hits the ratelimit, would switch to fake user agent from proxies.. but I cannot find it now)
So basically cloudflare but self-hosted (with all the pain that comes from that)?
What’s so painful about self hosting? I’ve been self hosting since before I hit puberty. If 12 year old me can run a httpd, anyone can.
And if you don’t want to self host, at least try to use services from organisations that aren’t hostile to the open web
I self-host lots of stuff. But yes it is more pain to host a WAF that can handle billions of request per minute. Even harder to do it for free like Cloudflare. And in the end the end result for the user is exactly the same if you use a self-hosted WAF or let someone else host it for you.
If you're handling billions of requests per second, you're not a self hoster. That's a commercial service with a dedicated team to handle traffic around the clock. Most ISPs probably don't even operate lines that big
To put that in perspective, even if they're sending empty TCP packets, "several billion" pps is 200 to 1800 gigabits of traffic, depending on what you mean by that. Add a cookieless HTTP payload and you're at many terabits per second. The average self hoster is more likely to get struck by lightning than encounter and need protection from this (even without considering the, probably modest, consequences of being offline a few hours if it does happen)
Edit: off by a factor of 60, whoops. Thanks to u/Gud for pointing that out. I stand by the conclusion though: less likely to occur than getting struck by lightning (or maybe it's around equally likely now? But somewhere in that ballpark) and the consequences of being down for a few hours are generally not catastrophic anyway. You can always still put big brother in front if this event does happen to you and your ISP can't quickly drop the abusive traffic
If somebody decides they hate you, your site that could handle, say, 100,000 legitimate requests per day could suddenly get billions of illegitimate requests.
They could. Let me know when it happens
I have this argument every time self hosting comes up, and every time I wonder if someone will do it to me to make a point. Or if one of the like million other comments I post upsets someone or one of the many tools that I host. Yet to happen, idk. It's like arguing whether you need a knife on the street at all times because someone might get angry from a look. It happens, we have a word for it in NL (zinloos geweld) and tiles in sidewalks (lady bug depictions) and everything, but no normal person actually wears weapons 24/7 (drug dealers surely yeah) or has people talk through a middle person
I'd suspect other self hosters just see more shit than I do, were it not for that nobody ever says it happened to them. The only argument I ever hear is that they want to be "safe" while "self hosting with cloudflare". Who's really hosting your shit then?
Not everybody wants to manage some commercial grade packet filter that can handle some DDoSing script kiddie, it’s a strong argument.
But another argument against using the easiest choice, the near monopoly, is that we need a diverse, thriving ecosystem.
We don’t want to end up in a situation where suddenly Cloudflare gets to dictate what is allowed on the web.
We have already lost email to the tech giants, try running your own mail sometime. The technical aspect is easy, the problem is you will end up in so many spam folders it’s disgusting.
What we need are better decentralized protocols.
Please do try running your own mail some time. It's not nearly as hard as doomers would have you think. And if you only receive, you don't have any problems at all.
At first, you can use it for less serious stuff until you see how much it works.
I do, I host my own mail server.
Technically it's not very challenging. The problem is the total dominance of a few actors and a lot of spammers.
I haven't had spam issues since using a catch-all and giving everyone a unique address, blocking ones that receive spam
Won't work if you need a fixed address on a business card or something, but in case you don't...
Waiting for the day they catch on. Then it's time for a challenge-response protocol I guess
To be fair, he did say per minute :-)
Oh, whoops. Divide everything by 60, quick!
That does make it a bit less ludicrous even if I think the conclusion of my response still applies
But you don't get billions of requests per minute. You get maybe five requests per second (300 per minute) on a bad day. The sites that seem to be getting badly attacked, they get 200 per second, which is still within reach of a self hosted firewall. Think about how many CPU cycles per packet that allows for. Hardly a real DDoS.
The only reason you even want to firewall 200 requests per second is that the code downstream of the firewall takes more than 5ms to service a request, so you could also consider improving that. And if you're only getting <5 and your server isn't overloaded then why block anything at all?
Such entitlement.
How much additional tax money should I spend at work so the AI scum can make 200 searches per second?
Human and 'nice' bots make about 5 per second.
That's a mantra, not a solution.
Sometimes it's a hardware problem, not a software problem.
For that matter, sometimes it's a social/political problem and not a technological problem.
How does an agent help my website not get crushed by traffic load, and how is this proposal any different from the gatekeeping problem to the open web, except even less transparent and accountable because now access is gated by logic inside an impenetrable web of NN weights?
This seems like slogan-based planning with no actual thought put into it.
Whatever is working against the AI doesn’t have to be an AI agent.
So proof of work checks everywhere?
Sure, as long as it doesn't discriminate against user agents.
This is the attitude I like to see. As they say, actually I hate this because of past connotations but "freedom isn't free"
“But the reality is how can someone small protect their blog or content from AI training bots?”
Why would you need to?
If your inability to assemble basic HTML forces you to adopt enormous, bloated frameworks that require two full cores of a cpu to render your post…
… or if you think your online missives are a step in the road to content creator riches …
… then I suppose I see the problem.
Otherwise there’s no problem.
It's not a question of languages or frameworks, but hardware. I cannot finance servers large enough to keep up with AI bots constantly scrapping my host, bypassing cache indications, or changing IP to avoid bans.
I have had to disable at least one service because AI bots kept hitting it and it started impacting other stuff I was running that I am more interested in. Part of it was the CPU load on the database rendering dozens of 404s per second (which still required a database call), part of it was that the thumbnail images were being queried over and over again with seemingly different parameters for no reason.
I'm sure there are AI bots that are good and respect the websites they operate on. Most of them don't seem to, and I don't care enough about the AI bubble to support them.
When AI companies stop people from using them as cheap scrapers, I'll rethink my position. So far, there's no way to distinguish any good AI bot from a bad one.
So by a free and open for all web you mean only for the tech priests competent enough to build the skills and maintain them in light of changes to the spec(hope these people didn’t run across xml/xslt dependent techniques building their site), or have a rich enough family that you can casually learn a skill while not worry about putting food on the table?
There’s going to be bad actors taking advantage of people who cannot fight back without regulations and gatekeepers, suggesting otherwise is about as reasonable as ancaps idea of government
We have thousands of engineers of these companies right here on hackernews and they cry and scream about privacy and data governance on every topic but their own work. If you guys need a mirror to do some self reflection I am offering to buy.
In the recent days, the biggest delu-lulz was delivered by that guy who'd bravely decided to boycott Grok out of... environmental concerns, apparently. It's curious how everybody is so anxious these days, about AI among other things in our little corner of the web. I swear, every other day it's some new big fight against something... bad. Surely it couldn't ALL be attributed to policy in the US!
I'll contribute for the mirror. The hypocrisy is so loud, aliens in outer space can hear it (and sound doesn't even travel in vacuum).
What we need is some legal teeth behind robots.txt. It won't stop everyone, but Big Corp would be a tasty target for lawsuits.
I don't know about this. This means I'd get sued for using a feed reader on Codeberg[1], or for mirroring repositories from there (e.g. with Forgejo), since both are automated actions that are not caused directly by a user interaction (i.e. bots, rather than user agents).
[1]: https://codeberg.org/robots.txt#:~:text=Disallow:%20/.git/,....
To be more specific, if we assume good faith upon our fine congresspeople to craft this well... ok yeah, well for hypothetical case I'll continue...
What legal teeth I would advocate would be targeted to crawlers (a subset of bot) and not include your usage. It would mandate that Big Corp crawlers (for search indexing, AI data harvesting, etc.) be registered and identify themselves in their requests. This would allow serverside tools to efficiently reject them. Failure to comply would result in fines large enough to change behavior.
Now that I write that out, if such a thing were to come to pass, and it was well received, I do worry that congress would foam at the mouth to expand it to bots more generally, Microsoft-Uncertified-Devices, etc.
Yeah, my main worry here is how we define the unwanted traffic, and how that definition could be twisted by bigcorp lawyers.
If it's too loose and similar to "wanted traffic is how the authors intend the website to be accessed, unwanted traffic is anything else", that's an argument that can be used against adblocks, or in favor of very specific devices like you mention. Might even give slightly more teeth to currently-unenforceable TOS.
If it's too strict, it's probably easier to find loopholes and technicalities that just lets them say "technically it doesn't match the definition of unwanted traffic".
Even if it's something balanced, I bet bigcorp lawyers will find a way to twist the definitions in their favor and set a precedent that's convenient for them.
I know this is a mini-rant rather than a helpful comment that tries to come up with a solution, it's just that I'm pessimistic because it seems the internet becomes a bit worse day by day no matter what we try to do :c
You don't get sued for using a service as it is meant to be used (using an RSS reader on their feed endpoint; cloning repositories that it is their mission to host). It doesn't anger anyone so they wouldn't bother trying to enforce a rule, and secondly it's a fruitless case because the judge would say it's not a reasonable claim they're making
Robots.txt is meant for crawlers, not user agents such as a feed reader or git client
I agree with you, generally you can expect good faith to be returned with good faith (but here I want to make heavy emphasis that I only agree on the judge part iff good faith can be assumed and the judge is informed enough to actually be able to make an informed decision).
But not everyone thinks that's the purpose of robots.txt. Example, quoting Wikipedia[1] (emphasis mine):
> indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.
Quoting the linked `web robots` page[2]:
> An Internet bot, web robot, robot, or simply bot, is a software application that runs automated tasks (scripts) on the Internet, usually with the intent to imitate human activity, such as messaging, on a large scale. [...] The most extensive use of bots is for web crawling, [...]
("usually" implying that's not always the case; "most extensive use" implying it's not the only use.)
Also a quick HN search for "automated robots.txt"[3] shows that a few people disagree that it's only for crawlers. It seems to be only a minority, but the search results are obviously biased towards HN users, so it could be different outside HN.
Besides all this, there's also the question of whether web scraping (not crawling) should also be subject to robots.txt or not; where "web scraping" includes any project like "this site has useful info but it's so unusable that I made a script so I can search it from my terminal, and I cache the results locally to avoid unnecessary requests".
The behavior of alternative viewers like Nitter could also be considered web scraping if they don't get their info from an API[4], and I don't know if I'd consider Nitter the bad actor here.
But yeah, like I said I agree with your comment and your interpretation, but it's not the only interpretation of what robots.txt is meant for.
[1]: https://en.wikipedia.org/wiki/Robots.txt
[2]: https://en.wikipedia.org/wiki/Internet_bot
[3]: https://hn.algolia.com/?dateRange=all&query=automated%20robo...
[4]: I don't know how Nitter actually works or where does it get its data from, I just mention it so it's easier to explain what I mean by "alternative viewer".
> This means I'd get sued for using a feed reader on Codeberg
you think codeberg would sue you?
Probably not.
But it's the same thing with random software from a random nobody that has no license, or has a license that's not open-source: If I use those libraries or programs, do I think they would sue me? Probably not.
It wouldn’t stop anyone. The bots you want to block already operate out of places where those laws wouldn’t be enforced.
Then that is a good reason to deny the requests from those IPs
I've run a few hundred small domains for various online stores with an older backend that didn't scale very well for crawlers and at some point we started blocking by continent.
It's getting really, really ugly out there.
What we need is stop fighting robots and start welcoming and helping them. I se zero reasons to oppose robots visiting any website I would build. The only purpose I ever tried disallowed robots for was preventing search engines from indexing incomplete versions or going the paths which really make no sense for them to go. Now I think we should write separate instructions for different kinds of robots: a search engine indexer shouldn't open pages which have serious side-effects (e.g. place an order) or display semi-realtime technical details but an LLM agent may be on a legitimate mission involving this.
> I see zero reasons to oppose robots visiting any website I would build.
> preventing search engines from indexing incomplete versions or going the paths which really make no sense for them to go.
What will you do when the bots ignore your instructions, and send a million requests a day to these URLs from half a million different IP addresses?
Let my site go down and then restart my server a few hours later. I'm a dude with a blog I'm not making uptime guarantees. I think you're overestimating the harm and how often this happens.
Misbehaving scrapers have been a problem for years not just from AI. I've written posts on how to properly handle scraping and the legal grey area it puts you in and how to be a responsible one. If companies don't want to be responsible the solution isn't abdicate an open web. It's make better law and enforcement of said law.
Sue them / press charges. DDoS is a felony.
> What we need is stop fighting robots and start welcoming and helping them. I se zero reasons to oppose robots visiting any website I would build.
Well, I'm glad you speak for the entire Internet.
Pack it in folks, we've solved the problem. Tomorrow, I'll give us the solution to wealth inequality (just stop fighting efforts to redistribute wealth and political power away from billionaires hoarding it), and next week, we'll finally get to resolve the old question of software patents.
The funny thing about the good old WWW is the first two W's stand for world-wide.
So
Which legal teeth?
It should have the same protections as an EULA, where the crawler is the end user, and crawlers should be required to read it and apply it.
So none at all? EULAs are mostly just meant to intimidate you so you won't exercise your inalienable rights.
I have the feeling that it's the small players that cause problems.
Dumb bots that don't respect robot.txt or nofollow are the ones trying all combinations of the filters available in your search options and requesting all pages for each such combination.
The number of search pages can easily be exponential in the number of filters you offer.
Bots walking around in these traps, do it because they are dumb. But even a small degenerate bot can send more requests than 1M MAUs.
At least that's my impression of the problem we're sometimes facing.
Signed agents seems like a horrific solution. And many serving the traffic is just better.
No we dont
- Moral rules are never really effective
- Legal threats are never really effective
Effective solutions are:
- Technical
- Monetary
I like the idea of web as a blockchain of content. If you want to pull some data, you have to pay for it with some kind of token. You either buy that token to consume information if you're of the leecher type, or get some by doing contributions that gain back tokens.
It's more or less the same concept as torrents back in the day.
This should be applied to emails too. The regular person send what, 20 emails per day max ? Say it costs $0.01 per mail, anyone could pay that. But if you want to spam 1,000,000 everyday that becomes prohibitive.
>This should be applied to emails too. The regular person send what, 20 emails per day max ? Say it costs $0.01 per mail, anyone could pay that.
This seems flawed.
Poor people living in 3rd world countries that make like $2.00/day wouldn't be able to afford this.
>But if you want to spam 1,000,000 everyday that becomes prohibitive.
Companies and people with $ can easily pay this with no issues. If it costs $10,000 to send 1M emails that inbox but you profit $50k, its a non issue.
I recently found out my website has been blocked by AI agents, when I had never asked for it. It seems to be opt-out by default, but in an obscure way. Very frustrating. I think some of these companies (one in particular) are risking burning a lot of goodwill, although I think they have been on that path for a while now.
Are you talking about Cloudflare? The default seems indeed to be to block AI crawlers when you set up a new site with them.
You can lock it up with a user account and payment system. The fact the site is up on the internet doesn’t mean you can or cannot profit from it. It’s up to you. What I would like it’s a way to notify my isp and say, block this traffic to my site.
> What I would like it’s a way to notify my isp and say, block this traffic to my site.
I would love that, and make it automated.
A single message from your IP to your router: block this traffic. That router sends it upstream, and it also blocks it. Repeat ad nauseum until source changes ASN or (if the originator is on the same ASN) reaches the router from the originator, routing table space notwithstanding. Maybe it expires after some auto-expiry -- a day or month or however long your IP lease exists. Plus, of course, a way to query what blocks I've requested and a way to unblock.
> Great video: https://www.youtube.com/shorts/M0QyOp7zqcY
Here's an even greater video: https://www.youtube.com/watch?v=mAUpxN-EIgU&t=4m24s
You can't trust everyone will be polite or follow "standards".
However, you can incentivize good behavior. Let's say there's a scraping agent, you could make a x402 compatible endpoint and offer them a discount or something.
Kinda like piracy; if you offer a good, simple, cheap service people will pay for it versus go through the hassle of pirating.
Onion sites have bots and scrapers.
They don't use cloudlfare AFAIK.
They normally use a puzzle that the website generates, or the use a proof of work based capcha. I've found proof of work good enough out of these two, and it also means that the site owner can run it themselves instead of being reliant on cloudflare and third parties.
> But the reality is how can someone small protect their blog or content from AI training bots?
A paywall.
In reality, what some want is to get all the benefits of having their content on the open internet while still controlling who gets to access it. That is the root cause here.
This. We need to get rid of the ad-supported free internet economy. If you want your content to be free, you release it and have no issues with AI. If you want to make money of your content, add a paywall.
We need micropayments going forward, Lightning (Bitcoin backend) could be the solution.
> If you want your content to be free, you release it and have no issues with AI. If you want to make money of your content, add a paywall.
What about licenses like CC-BY-NC (Creative Commons - Non Commercial)?
What about them? As we can see scrapers don’t care about copyright at all, so public licenses don’t really matter to them either.
Which is really all that cloudflare is building here that people are mad about. It’s a way to give bots access to paywalled content.
Where everyone needs a cloudflare account to be able to pay*
“Everyone” in this context being bot operators who want to access websites who have decided to use cloudflare to block unauthenticated bot traffic.
Which is not everyone.
> Everyone loves the dream of a free for all and open web.
> protect their blog or content from AI training bots
It strikes me that one needs to chose one of these as their visionary future.
Specifically: a free and open web is one where read access is unfettered to humans and AI training bots alike.
So much of the friction and malfunction of the web stems from efforts to exert control over the flow (and reuse) of information. But this is in conflict with the strengths of a free and open web, chief of which is the stone cold reality that bytes can trivially be copied and distributed permissionlessly for all time.
It's the new "ban cassette tapes to prevent people from listening to unauthorized music," but wrapped in an anti-corporate skin delivered by a massive, powerful corporation that could sell themselves to Microsoft tomorrow.
The AI crawlers are going to get smarter at crawling, and they'll have crawled and cached everything anyway; they'll just be reading your new stuff. They should literally just buy the Internet Archive jointly, and only read everything once a week or so. But people (to protect their precious ideas) will then just try to figure out how to block the IA.
One thing I wish people would stop doing is conflating their precious ideas and their bandwidth. The bandwidth is one very serious issue, because it's a denial of service attack. But it can be easily solved. Your precious ideas? Those have to be protected by a court. And I don't actually care iff the copyright violation can go both ways; wealthy people seem to be free to steal from the poor at will, even rewarded, "normal" (upper-middle class) people can't even afford to challenge obviously fraudulent copyright claims, and the penalties are comically absurd and the direct result of corruption.
Maybe having pay-to-play justice systems that punish the accused before conviction with no compensation was a bad idea? Even if it helped you to feel safe from black people? Maybe copyright is dumb now that there aren't any printers anymore, just rent-seekers hiding bitfields?
You might have this the wrong way around.
It's not the publishers who need to do the hard work, it's the multi-billion dollar investments into training these systems that need to do the hard work.
We are moving to a position whereby if you or I want to download something without compensating the publisher, that's jail time, but if it's Zuck, Bezos or Musk, they get a free pass.
That's the system that needs to change.
I should not have to defend my blog from these businesses. They should be figuring out how to pay me for the value my content adds to their business model. And if they don't want to do that, then they shouldn't get to operate that model, in the same way I don't get to build a whole set of technologies on papers published by Springer Nature without paying them.
This power imbalance is going to be temporary. These trillion-dollar market cap companies think if they just speed run it, they'll become too big, too essential, the law will bend to their fiefdom. But in the long term, it won't - history tells us that concentration of power into monarchies descends over time, and the results aren't pretty. I'm not sure I'll see the guillotine scaffolds going up in Silicon Valley or Seattle in my lifetime, but they'll go up one day unless these companies get a clue from history as to what they need to do.
Maybe this is a naive question, but why not just cut an IP off temporarily if it sends too many requests or sends them too fast?
They use many IPs, often not identifiable as the same bot.
It is a service available to Cloudflare customers and is opt-in. I fail to see how they’re being gatekeepers when site owners have option not to use it.
I care more about the dream of a wide open free web than a small time blogger’s fears of their content being trained on by an AI that might only ever emit text inspired by their content a handful of times in their life.
"I want an open web!"
"Okay, that means AI companies can train on your content."
"Well, actually, we need some protections..."
"So you want a closed web with access controls?"
"No no no, I support openness! Can't we just have, like, ethical openness? Where everyone respects boundaries but there's no enforcement mechanism? Why are you making this so black and white?"
> “When we started the “free speech movement,” we had a bold new vision. No longer would dissenters’ views be silenced. With the government out of the business of policing the content of speech, robust debate and the marketplace of ideas would lead us toward truth and enlightenment. But it turned out that freedom of the press meant freedom for those who owned one. The wealthy and powerful dominated the channels of speech. The privileged had a megaphone and used free speech protections to immunize their own complacent or even hateful speech. Clearly, the time has come to denounce the naïve idealism of the past and offer a new movement, Speech 2.0, which will pay more attention to the political economy of media and aim at “free-ish” speech — the good stuff without the bad.”
https://openfuture.eu/paradox-of-open-responses/misunderesti...
> Everyone loves the dream of a free for all and open web. But the reality is how can someone small protect their blog or content from AI training bots?
I'm old enough to remember when people asked the same questions of Hotbot, Lycos, Altavista, Ask Jeeves, and -- eventually -- Google.
Then, as now, it never felt like the right way to frame the question. If you want your content freely available, make it freely available... including to the bots. If you want your content restricted, make it restricted... including to the humans.
It's also not clear to me that AI materially changes the equation, since Google has for many years tried to cut out links to the small sites anyway in favor of instant answers.
(FWIW, the big companies typically do honor robots.txt. It's everyone else that does what they please.)
What if I want my content freely available to humans, and not to bots? Why is that such an insane, unworkable ask? All I want is a copyleft protection that specifically allows humans to access my work to their heart's content, but disallows AI use of it in any form. Is that truly so unreasonable?
Yes, it is an unreasonable and absurd ask. You cannot want freedom while restricting it. You forget that it is people that use AI agents, essentially, being cyborgs. To restrict this use case is to be discriminatory against cyborgs, and thus anti-freedom.
We are lucky that there is no way to detect it.
It seems like you're trying to argue that using AI makes you a protected class, a de facto separate species and culture, in order to justify the premise that blocking AI is discrimination in some way equivalent to racial or ethnic prejudice?
If so, no. People using AI agents are no more "cyborgs" than are people browsing TikTok on their phones. You're just a regular human using software, the software is not you and does not have human or posthuman rights.
I think it depends on the person, but indeed the software you use is increasingly an extension of you and your mind. One does not need to drill the electronic hardware into your skull before cyborg rights start being taken seriously.
Also, I'm not a human.
> What if I want my content freely available to humans, and not to bots? Why is that such an insane, unworkable ask?
Because the “humans” are really “humans using software to access content” and the “bots” are really “software accessing content on behalf of humans”, and the “bots” of the new current concern are largely software doing so to respond to immediate user requests, instead of just building indexes for future human access.
It's not unreasonable to ask but I think it probably is unreasonable to expect a strictly technical solution. It feels like we're in the realm of politics, policy, and law.
Google (and the others) crawl from a published IP range, with "Google" in the user agent. They read robots.txt. They are very easy to block
The AI scum companies crawl from infected botnet IPs, with the user agent the same as the latest Chrome or Safari.
Why do you think the bots you see are AI scum companies?
Okay. Which, specifically, are the "AI scum" companies you're speaking of?
There are plenty of non-AI companies that also use dubiously sourced IPs and hide behind fake User-Agents.
I don't know which companies, of course. They hide their identity by using a botnet.
This traffic is new, and started around when many AI startups started.
I see traffic from new search engines and other crawlers, but it generally respects robots.txt and identifies itself, or else comes from a small pool of IP addresses.
Don't publish things if you don't want them published.
Get real yourself.
nonsense.
I'm routinely denied access to websites now.
enable javascript and unblock cookies to continue
You could run https://zadzmo.org/code/nepenthes/ to punish the AI scrapers.
Everyone loves a free for all and open web because it works really well.
Basic tools like Anubis and fail2ban are very effective at keeping most of this evil at bay.
Nobody cares about robots.txt, nor should they.
If this is your primary argument against being scraped (viz that your robots.txt said not to) then you’re naive and you’re doing it wrong.
If the internet is open, then data on it is going to be scraped lol. You can’t have it both ways.
It seems the Open Internet is idealistic.
If others respected robots.txt, we would not need solutions like what Cloudflare is presenting here. Since abuse is rampant, people are looking for mitigations and this CF offering is an interesting one to consider.
how about we discuss and design and implement a system that charges them for their actions? we could put some dark patterns in our sites that specifically have this cost through some sort of problem solving thing in the site that harvests their energetic scraping/LLM tools into directing their energy onto causes that give us profit on our site, in exchange for revealing some content in return that achieves their mission of scraping too. Looks like these exist to degrees.
Why should your blog be protected? Information wants to be free.
It's amazing how this catchphrase has reversed meanings for some people. It was previously used against walled gardens and paywalls, but these corporate LLMs are the ultimate walled garden for information because in most cases you can't even find out who created the information in the first place.
"Information wants to be free! That's why I support hiding it behind a chatbot paywall that makes a few people billionaires"
> But the reality is how can someone small protect their blog or content from AI training bots?
First off, there's no harm from well-behaved bots. Badly behaved bots that cause problems for the server are easily detected (by the problems they cause), classified, and blocked or heavily throttled.
Of course, if you mean "protect" in the sense of "keep AI companies from getting a copy" (which you may have, given that you mentioned training) - you simply can't, unless you consider "don't put it on the web" a solution.
It's impossible to make something "public, but not like that". Either you publish or you don't.
If anything, it's a legal issue (copyright/fair use), not a technical one. Technical solutions won't work.
I'm not sure why people are so confused by this. The Mastodon/AP userbase put their public content on a publicly federated protocol then lost their shit and sent me death threats when I spidered and indexed it for network-wide search.
There are upsides and downsides to publishing things you create. One of the downsides is that it will be public and accessible to everyone.
I personally love the idea of a free and open internet and also have no issues with bots scraping or training off of my data.
I would much rather have it open for all, including companies, than the coming dystopian landscape of paywall gates. I don’t care about respecting robots.txt or any other types of rules. If it’s on the internet it’s for all to consume. The moment you start carving out certain parties is the moment it becomes a slippery slope.
For what it’s worth, I think CF will lose this battle and fundamentally feeding the bots will just become normal and wanted
I have zero issue with Ai Agents, if there's a real user behind there somewhere. I DO have a major issue with my sites being crawled extremely aggressively by offenders including Meta, Perplexity and OpenAI - it's really annoying realising that we're tying up several cpu cores on AI crawling. Less than on real users and google et al.
I've some personal apps online and I had to turn the cloudflare ai bot protection on because one of them got 1.6TB of data accessed by the bots in the last month, 1.3 million requests per day, just non stop hammering it with no limits.
They're getting to the point of 200-300RPS for some of my smaller marketing sites, hallucinating URLs like crazy. It's fucking insane.
You'd think they would have an interest in developing reasonable crawling infrastructure, like Google, Bing or Yandex. Instead they go all in on hosts with no metering. All of the search majors reduce their crawl rate as request times increase.
On one hand these companies announce themselves as sophisticated, futuristic and highly-valued, on the other hand we see rampant incompetence, to the point that webmasters everywhere are debating the best course of action.
I suspect it's because they're dealing with such unbelievable levels of bandwidth and compute for training and inference that the amount required to blast the entire web like this barely registers to them.
Honestly it's just tragedy of the commons. Why put the effort in when you don't have to identify yourself, just crawl and if you get blocked move the job to another server.
At this point I'm blocking several ASNs. Most are cloud provider related, but there are also some repurposed consumer ASNs coming out of the PRC. Long term, this devalues the offerings of those cloud providers, as prospective customers will not be able to use them for crawling.
I'm seeing around the same, as a fairly constant base load. Even more annoying when it's hitting auth middleware constantly, over and over again somehow expecting a different answer.
I wonder how many CPU cycles are spent because of AI companies scraping content. This factor isn't usually considered when estimating “environmental impact of AI.” What’s the overhead of this on top of inference and training?
To be fair, an accurate measurement would need to consider how many of those CPU cycles would be spent by the human user who is driving the bot. From that perspective, maybe the scrapers can “make up for it” by crawling efficiently, i.e. avoid loading tracker scripts, images, etc unless necessary to solve the query. This way they’ll still burn CPU cycles but at least it’ll be less cycles than a human user with a headful browser instance.
Same with me. If there is a real user behind the use of the AI agents and they do not make excessive accesses in order to do what they are trying to do, then I do not have a complaint (the use of AI agents is not something I intend, but that is up to whoever is using them and not up to me). I do not like the excessive crawling.
However, what is more important to me than AI agents, is that someone might want to download single files with curl, or use browsers such as Lynx, etc, and this should work.
Cloudflare is trying to gatekeep which user-initated agents are allowed to read website content, which is of course very different from scraping website for training data. Meta, Perplexity and OpenAI all have some kind of web-search functionality where they sent requests based on user prompts. These are not requests that get saved to train the next LLM. Cloudflare intentionally blurs the line between both types of bots, and in that sense it is a bait-and-switch where they claim to 'protect content creators' by being the man in the middle and collecting tolls from LLM providers to pay creators (and of course take a cut for themselves). Its not something they do because it would be fair, theres financial motivation.
> Cloudflare is trying to gatekeep which user-initated agents are allowed to read website content, which is of course very different from scraping website for training data.
That distinction requires you to take companies which benefit from amassing as much training data as possible at their word when they pinky swear that a particular request is totally not for training, promise.
If you look at the current LLM landscape, the frontier is not being pushed by labs throwing more data at their models - most improvements come from using more compute and improving training methods. In that sense I dont have to take their word, more data just hasnt been the problem for a long time.
Just today Anthropic announced that they will begin using their users data for training by default - they still want fresh data so badly that they risked alienating their own paying customers to get some more. They're at the stage of pulling the copper out of the walls to feed their crippling data addiction.
> I DO have a major issue with my sites being crawled extremely aggressively by offenders including Meta, Perplexity and OpenAI
Gee, if only we had, like, one central archive of the internet. We could even call it the internet archive.
Then, all these AI companies could interface directly with that single entity on terms that are agreeable.
Internet Archive is missing enormous chunks of the internet though. And I don't mean weird parts of the internet, just regional stuff.
Not even news articles from top 10 news websites from my country are usually indexed there.
you think they care about that ? they’d still crawl like this just in case which is why they don’t rate limit atm
I use uncommon web browsers that don't leak a lot of information. To Cloudflare, I am indistingushable from a bot.
Privacy cannot exist in an environment where the host gets to decide who access the web page. I'm okay with rate limiting or otherwise blocking activity that creates too much of a load, but trying to prevent automated access is impossible withou preventing access from real people.
And god forbid you live in an authoritarian country and must use VPN to protect your freedom. Internet becomes captcha hell run by 2-3 companies.
I've had far fewer issues with my own bots that access cloudflare protected websites, than during my regular browsing with privacy respecting browsers and a VPN.
As a side note: I'm at least thankful Microsoft isn't behind web gatekeeping. Try and solve any microsoft captcha behind a VPN - its like writing a thesis, you gotta dedicate like 5 minutes, full attention.
The website owner has rights too. Are you arguing they cannot choose to implement such gatekeeping to keep their site operating in a financially viable manner?
The first article of our constitution says people shall be treated equally in equal situations. I presume that most countries have similar clauses but, beyond legalese, it's also simply in line with my ethics to treat everyone equally
There are people behind those connection requests. I don't try to guess on my server who is a bot and who is not; I'll make mistakes and probably bias against people who use uncommon setups (those needing accessibility aids or using e.g. experimental software that improves some aspect like privacy or functionality)
Sure, I have rights as a website owner. I can take the whole thing offline; I can block every 5th request; I can allow each /16 block to make 1000 requests per day; I can accept requests only from clients that have a Firefox user agent string. So long as it's equally applied to everyone and it's not based on a prohibited category such as gender or religious conviction, I am free to decide on such cuts and I'd encourage everyone to apply a policy that they believe is fair
Cloudflare and its competitors, as far as I can tell, block arbitrary subgroups of people based on secret criteria. It does not appear to be applied fairly, such as allowing everyone to make the same number of requests per unit time. I'm probably bothered even more because I happen to be among the blocked subgroup regularly (but far from all the time, just little enough to feel the pain)
If by "our constitution" you mean the U.S. Constitution then no, it says nothing of the sort. The first article of the U.S. Constitution concerns the organization of the legislative branch. You may be referencing the Equal Protection and Due Process clauses, in the Fifth and Fourteenth amendments, but neither of those applies in this situation either since there are no laws or governmental actions at issue here, and random sites on the internet are not universally considered to be public accommodations. Even in the ADA context, the law isn't actually clear, since websites aren't specified anywhere in the text at the federal level and there's no SCOTUS precedent on point.
Some states are more stringent with their own disability regulations or state constitutions, but no state anywhere in the U.S. has a law that says every visitor to a website has to be treated equally.
You can assume it's the USA and that I'm just dead wrong, but the third word of my profile specifies where I'm from and you'd find that this Dutch constitution matches the comment's contents
Equal protection is indeed not the same as equal treatment. No, it really does say that everyone shall be treated equally so long as the circumstances are equal (gelijke behandeling in gelijke gevallen)
I didn't assume, that's why I started my comment with "if by what you mean." Good to know that you were referencing a different place, but it's unrealistic to expect people to delve into your account bio to understand what you intended by "our constitution," especially when the parent comment also contained no geographic or cultural references. Perhaps you know the parent commenter and know that they share your geography? If so, that would also have been helpful context.
As an aside, I'm curious by how that language in the Dutch constitution actually works in practice. Is it just a game of distinguishing between situations or people to excuse disparate conduct? It seems like it would be unworkable if interpreted literally.
I never said there was anything prohibiting them, just that they will be losing users. (Although, blocking some access can be illegal, for example when accessability tools are blocked.)
There's a whole spectrum of gatekeeping on communications with users, from static sites that broadcast their information to anyone, and stores that let you order without even making an account, to organizations that require you install local software to even access data and perform transactions. The latter means 90%+ of your users will hate you for it, and half will walk away, but it's still very common, collectively causing business that do so billions of dollers a year. (https://www.forbes.com/sites/johnkoetsier/2021/02/15/91-of-u... to-install-apps-to-do-business-costing-brands-billions/)
When companies get big enough to have entire departments devoted tasks, those departments will follow the fads that bring them the most prestige, at the cost of the rest of the company. Eventually the company will lose out to newer more efficient businesses that forgo fads in favor of serving customers, and the cycle continues.
I'm just point out how a new fad is hurting businesses, but by no means wish to limit their ability to do so. They just won't be getting my business, nor business from a quickly growing cohort that desires anonymitiy, or even requires it to get around growing local censorship.
If you put your information freely on the web, you should have minimal expectations on who uses it and how. If you want to make money from it, put up a paywall.
If you want the best of both worlds, i.e. just post freely but make money from ads, or inserting hidden pixels to update some profile about me, well good luck. I'll choose whether I want to look at ads, or load tracking pixels, and my answer is no.
> If you put your information freely on the web, you should have minimal expectations on who uses it and how.
Does this only apply to "information" or should we treat all open source code as public domain?
In a lot of circumstances, that is exactly the case. What the open source license stops is redistribution under terms that violate the license, not usage itself. An individual can very well take your open source code, make any changes they want, compile and use it for their own purposes without adhering to the terms of your license - as long as they don't redistribute it.
All "open source" code was already pretty much public domain. All they'd have to do was put a page of OSI-approved licenses up on the site, right? An index of Open Source projects and their authors? Is this more than a weeks work to comply?
Free Software is the only place where this is a real abridgement of rights and intention, and it's over. They've already been trained on all of it, and no judge will tell them to stop, and no congressman will tell them to stop.
I'm not talking about ads or pixels, I'm referring to bot operators creating so much traffic that the network bill makes the hosting financially impossible
> my answer is no.
Rights for me, but not for thee?
You have every right to take the content offline, or to put any technical barriers you desire in place to access it - but that's about all you should be able to do.
If you don't want to lose money and don't feel confident that you can protect your content with technical measures, best to take your stuff off the internet.
I also do the same and get caught up by bot blockers.
However, I do believe the host can do whatever they want with my request also.
This issue becomes more complex when you start talking about government sites, since ideally they have a much stronger mandate to serve everyone fairly.
I agree with you, but the website owners just don't seem to understand that they are making their small problem into a big problem for real people, some of which will drop off.
Well, if you have a better way to solve this that’s open I’m all ears. But what Cloudflare is doing is solving the real problem of AI bots. We’ve tried to solve this problem with IP blocking and user agents, but they do not work. And this is actually how other similar problems have been solved. Certificate authorities aren’t open and yet they work just fine. Attestation providers are also not open and they work just fine.
> Well, if you have a better way to solve this that’s open I’m all ears.
Regulation.
Make it illegal to request the content of a webpage by crawler if a website operator doesn't explicitly allows it via robots.txt. Institute a government agency that is tasked with enforcement. If you as a website operator can show that traffic came from bots, you can open a complaint with the government agency and they take care of shaking painful fines out of the offending companies. Force cloud hosts to keep books on who was using what IP addresses. Will it be a 100% fix, no, will it have a massive chilling effect if done well, absolutely.
The biggest issue right now seems to be people renting their residential IP addresses to scraper companies, who then distribute large scrapes across these mostly distinct IPs. These addresses are from all over the world, not just your own country, so we'll either need a World Government, or at least massive intergovernmental cooperation, for regulation to help.
I don't think we need a world government to make progress on that point.
The companies buying these services, are buying them from other companies. Countries or larger blocks like the EU can exert significant pressure on such companies by declaring the use of such services as illegal when interacting with websites hosted in the country or block or by companies in them.
It just seems too easy to skirt around via middlemen. The EU (say) could prosecute an EU company directly doing this residential scraping, and it could probably keep tabs on a handful of bank accounts of known bad actors in other countries, and then investigate and prosecute EU companies transferring money to them. But how do you stop an EU company paying a Moldovan company (that has existed for 10 days) for "internet services", that pays a Brazilian company, that pays a Russian company to do the actual residential scraping? And then there's all the crypto channels and other quid pro quo payment possibilities.
this is hilarious
you are either from the EU or living a couple decades in the past
Agreed. It might not be THE BEST solution, but it is a solution that appears to work well.
Centralization bad yada yada. But if Cloudflare can get most major AI players to participate, then convince the major CDN's to also participate.... ipso facto columbo oreo....standard.
yep, that's why I am writing this now :)
You can see it in the web vs mobile apps.
Many people may not see a problem on wallet gardens but reality is that we have much less innovation in mobile than in web because anyone can spawn a web server vs publish an app in the App Store (apple)
I'm not sure if things are as fine as you say they are. Certificate authorities were practically unheard of outside of corporate websites (and even then mostly restricted to login pages) until Let's Encrypt normalized HTTPS. Without the openness of Let's Encrypt, we'd still be sharing our browser history and search queries with our ISPs for data mining. Attestation providers have so far refused to revoke attestation for known-vulnerable devices (because customers needing to replace thousands of devices would be an unacceptable business decision), making the entire market rather useless.
That said, what I am missing from these articles is an actual solution. Obviously we don't want Cloudflare from becoming an internet gatekeeper. It's a bad solution. But: it's a bad solution to an even worse problem.
Alternatives do exist, even decentralised ones, in the form of remote attestation ("can't access this website without secure boot and a TPM and a known-good operating system"), paying for every single visit or for subscriptions to every site you visit (which leads to centralisation because nobody wants a subscription to just your blog), or self-hosted firewalls like Anubis that mostly rely on AI abuse being the result of lazy or cheap parties.
People drinking the AI Kool-Aid will tell you to just ignore the problem, pay for the extra costs, and scale up your servers, because it's *the future*, but ignoring problems is exactly why Cloudflare still exists. If ISPs hadn't ignored spoofing, DDoS attacks, botnets within their network, """residential proxies""", and other such malicious acts, Cloudflare would've been an Akamai competitor rather than a middle man to most of the internet.
Certificate authorities don't block humans if they 'look' like a bot
AI poisoning is a better protection. Cloudflare is capable of serving stashes of bad data to AI bots as protective barrier to their clients.
AI poisoning is going to get a lot of people killed, be cause the AI won't stop being used.
The current state of the art in AI poisoning is Nightshade from the University of Chicago. It's meant to eventually be an addon to their WebGlaze[1] which is an invite-only tool meant for artists to protect their art from AI mimicry
Nobody is dying because artists are protecting their art
[0] https://nightshade.cs.uchicago.edu/whatis.html
[1] https://glaze.cs.uchicago.edu/webglaze.html
By that logic AI already killing people. We can't presume that whatever can be found on the internet is reliable data, can't we?
If science taught us anything it's that no data is ever reliable. We are pretty sure about so many things, and it's the best available info so we might as well use it, but in terms of "the internet can be wrong" -> any source can be wrong! And I'd not even be surprised if internet in aggregate (with the bot reading all of it) is right more often than individual authors of pretty much anything
Okay, let them
You don't think that the AI companies will take efforts to detect and filter bad data for training? Do you suppose they are already doing this, knowing that data quality has an impact on model capabilities?
The current state of the art in AI poisoning is Nightshade from the University of Chicago. It's meant to eventually be an addon to their WebGlaze[1] which is an invite-only tool meant for artists to protect their art from AI mimicry
If these companies are adding extra code to bypass artists trying to protect their intellectual property from mimicry then that is an obvious and egregious copyright violation
More likely it will push these companies to actually pay content creators for the content they work on to be included in their models.
[0] https://nightshade.cs.uchicago.edu/whatis.html
[1] https://glaze.cs.uchicago.edu/webglaze.html
They will learn to pay for high quality data instead of blindly relying on internet contents.
Are they? Until Let's Encrypt came along and democratise the CA scene, it was a hell hole. Web Security was depending on how deep your pockets are. One can argue that the same path is being laid in front us until a Let's Encrypt comes along and democratise it? And here as it's about attestation, how are we going to prevent gatekeeper's doing "selective attestations with arguable criteria"? How will we prevent political forces?
We have far too many gatekeepers as it is. Any attempt to add any more should be treated as an act of aggression.
Cloudflare seems very vocal about its desire to become yet another digital gatekeeper as of late, and so is Google. I want both reduced to rubble if they persist in it.
Several companies are looking to provide a solution for the AI bot problem. Cloudflare stands to make a lot of money if people pick their solution. But Cloudflare backing down won't make the problem go away, and someone else's bad solution will be chosen instead.
The gatekeeping described here is gatekeeping a website owner chooses. It's an alternative to pay walls, bespoke bot detection, or some kind of ID verification. Cloudflare already provides a service, but standardising the service will open up the market (at the cost of competitors adopting Cloudflare's standard).
The freedom of the open web also extends to the owners of the websites people visit.
What do you mean Google "desires" to become a gatekeeper? They have been a gatekeeper for years, since they control the browser everyone uses, and Firefox usage is now in the noise. Google just steers the www where they want it to go. Killing ublock, pushing .webp trash, etc.
> An allowlist run by ONE company?
An allowlist run by one company that site owners chose to engage with. But the irony of taking an ideological stance about fairness while using AI generated comics for blog posts…
Cloudflare is implementing the (still-emerging) Web Bot Auth standard. We're working on the same at Stytch for https://IsAgent.dev .
The discourse around this is a little wild and I'm glad you said this. The allowlist is a Cloudflare feature and their customers are free to use it. The core functionality involving HTTP Message Signatures is decentralized and open, so anyone can adopt it and benefit.
> An allowlist run by one company that site owners chose to engage with.
Exactly, no problem with that, just hinting that's not a protocol.
> But the irony of taking an ideological stance about fairness while using AI generated comics for blog posts
Wait, what?
> Wait, what?
I was referring to the following image:
https://substackcdn.com/image/fetch/$s_!zRK-!,w_1250,h_703,c...
I know the image, what I do not understand is the argument between using it being incompatible with "fairness" and "openness"
I can't speak for the other commenter, but I think companies like Midjourney and OpenAI are robber barons exploiting people's creative work in ways that obviously aren't fair, but that our legal system wasn't equipped to prevent.
GenAI image generation is not fair.
also: “Cloudelare” ;-P
It's a frying pan/fire choice that could create a de-facto standard we end up depending on, during a critical moment where the hot topic could have a protocol or standards based solution. Cloudflare is actively trying to make a blue ocean for themselves of a real issue affecting everyone.
>But the irony of taking an ideological stance about fairness while using AI generated comics for blog posts…
"But you participate in society!"
This is sort of like how email is based on Internet standards but a large percentage of email users use Gmail. The Internet standards Cloudflare is promoting are open, but Cloudflare has a lot of power due to having so many customers.
(What are some good alternatives to Cloudflare?)
Another way the situation is similar: email delivery is often unreliable and hard to implement due to spam filters. A similar thing seems to be happening to the web.
It is a big problem. There is no good alternative to Cloudflare as a free CDN. They put servers all over the world and they are giving them away for free. And making their money on premium serverless services.
Not to mention the big cloud providers are unhinged with their egress pricing.
> Not to mention the big cloud providers are unhinged with their egress pricing.
I always wonder why this status quo persisted even after Cloudflare. Their pricing is indeed so unhinged, that they're not even in consideration for me for things where egress is a variable.
Why is egress seemingly free for Cloudflare or Hetzner but feels like they launch spaceships at AWS and GCP every time you send a data packet to the outside world?
They are just greedy. And they know nobody can compete with them for availability in every country. Except for Cloudflare, which is why it is so popular.
The web doesn't need attestation. It doesn't need signed agents. It doesn't need Cloudflare deciding who's a "real" user agent. It needs people to remember that "public" means PUBLIC and implement basic damn rate limiting if they can't handle the traffic.
The web doesn't need to know if you're a human, a bot, or a dog. It just needs to serve bytes to whoever asks, within reasonable resource constraints. That's it. That's the open web. You'll miss it when it's gone.
Basic damn rate limiting is pretty damn exploitable. Even ignoring botnets (which is impossible), usefully rate limiting IPv6 is anything but basic. If you just pick some prefix from /48 to /64 to key your rate limits on, you'll either be exploitable by IPs from providers that hand out /48s like candy or you'll bucket a ton of mobile users together for a single rate limit.
You make unauthenticated requests cheap enough that you don't care about volume. Reserve rate limiting for authenticated users where you have real identity. The open web survives by being genuinely free to serve, not by trying to guess who's "real."
A basic Varnish setup should get you most of the way there, no agent signing required!
Your response to unauthenticated requests could be <h1>Hello world</h1> served from memory and your server/link will still fail under a volumetric attack, and you still get the pleasure of paying for the bandwidth.
So no, this advice has been outdated for decades.
Also you're doing some sort of victim blaming where everyone on earth has to engineer their service to withstand DoS instead of outsourcing that to someone else. Abusers outsource their attacks to everyone else's machine (decentralization ftw!), but victims can't outsource their defense because centralization goes against your ideals.
At least lament the naive infrastructure of the internet or something, sheesh.
We started with "AI crawlers are too aggressive" and you've escalated to volumetric DDoS. These aren't the same problem. OpenAI hitting your API too hard is solved by caching, not by Cloudflare deciding who gets an "agent passport."
"Victim blaming"? Can we please leave these therapy-speak terms back in the 2010s where they belong and out of technical discussions? If expecting basic caching is victim blaming, then so is expecting HTTPS, password hashing, or any technical competence whatsoever.
Your decentralization point actually proves mine: yes, attackers distribute while defenders centralize. That's why we shouldn't make centralization mandatory! Right now you can choose Cloudflare. With attestation, they become the web's border control.
The fine article makes it clear what this is really about - Cloudflare wants to be the gatekeeper for agent traffic. Agent attestation doesn't solve volumetric attacks (those need the DDoS protection they already sell, no new proposal required!) They're creating an allowlist where they decide who's "legitimate."
But sure, let's restructure the entire web's trust model because some sites can't configure a cache. That seems proportional.
OpenAI hitting your static, cached pages too hard and costing you terabytes of extra bandwidth that you have to pay for (both in bandwidth itself and data transfer fees) isn't solved by caching.
The post you're replying to points out that, at a certain scale, even caching things in-memory can cause your system to fall over when a user agent (e.g. AI scraper bots) are behaving like bad actors, ignoring robots.txt, and fetching every URL twenty times a day while completely ignoring cache headers/last modified/etc.
Your points were all valid when we were dealing with either "legitimate users", "legitimate good-faith bots", and "bad actors", but now the AI companies' need for massive amounts of up-to-the-minute content at all costs means that we have to add "legitimate bad-faith bots" to the mix.
> Agent attestation doesn't solve volumetric attacks (those need the DDoS protection they already sell, no new proposal required!) They're creating an allowlist where they decide who's "legitimate."
Agent attestation solves overzealous AI scraping which looks like a volumetric attack, because if you refuse to provide the content to the bots then the bots will leave you alone (or at least, they won't chew up your bandwidth by re-fetching the same content over and over all day).
Well, your post escalated to the broad claim that I responded to.
You didn't just disagree with AI crawler attestation: you're saying that nobody should distinguish earnest users from everything else because they should bear the cost of serving both, which necessarily entails bad traffic and incidental DoS.
Once again, services like CloudFlare exist because a cache isn't sufficient to deal with arbitrary traffic, and the scale of modern abuse is so large that only a few megacorps can provide the service that people want.
> You make unauthenticated requests cheap enough that you don't care about volume.
In the days before mandatory TLS it was so easy to set up a Squid proxy on the edge of my network and cache every plain-HTTP resource for as long as I want.
Like yeah, yeah, sure, it sucked that ISPs could inject trackers and stuff into page contents, but I'm starting to think the downsides of mandatory TLS outweigh the upsides. We made the web more Secure at the cost of making it less Private. We got Google Analytics and all the other spyware running over TLS and simultaneously made it that much harder for any normal person to host anything online.
You can still do that, you have the caching reverse proxy at the edge of the network be the thing that terminates TLS.
Not really. At minimum you will break all of these sites on the HSTS preload list: https://source.chromium.org/chromium/chromium/src/+/main:net...
Public key pinning was rejected so you just need your proxy to also supply a certificate that's trusted by your clients.
I guess you should start a Cloudflare competitor that just puts a cheap Varnish VM in front of websites to solve bots forever.
What you're proposing is that a lot of small websites should simply shut down, in the name of the open internet. The goals seem self contradictory.
Modern AI crawlers are indistinguishable from malicious botnets. There's no longer any rate limiting strategy that's effective, that's entirely the point of what cloudflare is attempting to solve
"It needs people to remember that "public" means PUBLIC and implement basic damn rate limiting if they can't handle the traffic."
And publish the acceptable rate.
But anyone who has ever been blocked for sending a _single_ HTTP request with the "wrong" user-agent string knows that the issue website operators are worried about is not necessarily rate (behaviour). Website operators routinely believe there is no such thing as a well-behaved bot. Thus they disregard behaviour and only focus on identity. If their crude heuristics with high probability of false positives suggest "bot" as the identity then their decision is to block, irrespective of behaviour, and ignore any possibility the heuristics may have failed. Operators routinely make (incorrect) assumptions about intent based on identity not behaviour.
Yes, I think that you are right (although rate limiting can sometimes be difficult to work properly).
Delegation of authorization can be useful for things that require it (as in some of the examples given in the article), but public files should not require authorization nor authentication for accessing it. Even if delegation of authorization is helpful for some uses, Cloudflare (or anyone else, other than whoever is delegating the authorization) does not need to be involved in them.
> public files should not require authorization nor authentication for accessing it
Define "public files" in this case?
If I have a server with files, those are my private files. If I choose to make them accessible to the world then that's fine, but they're still private files and no one else has a right to access them except under the conditions that I set.
What Cloudflare is suggesting is that content owners (such as myself, HN, the New York Times, etc.) should be provided with the tools to restrict access to their content if unfettered access to all people is burdensome to them. For example, if AI scraper bots are running up your bandwidth bill or server load, shouldn't you be able to stop them? I would argue yes.
And yet you can't. These AI bots will ignore your robots.txt, they'll change user agents if you start to block their user agents, they'll use different IP subnets if you start to block IP subnets. They behave like extremely bad actors and ignore every single way you can tell them that they're not welcome. They take and take and provide nothing in return, and they'll do so until your website collapses under the weight and your readers or users leave to go somewhere else.
> For example, if AI scraper bots are running up your bandwidth bill or server load, shouldn't you be able to stop them? I would argue yes
I also say yes, but this is not because of a lack of authorization; it is because of excessive server load (which is what you describe).
Allowing other public mirrors of files would be one thing that can be helpful (providing archive files might also sometimes be useful), although that does not actually prevent excessive scraping, due to their bad working (which is also what you describe).
Some people may use Cloudflare, but Cloudflare has its own problems with it; a lot of legitimate accessing is also stopped, while not necessarily preventing all illegitimate accessing, and sometimes causing additional problems (sometimes this might be due to misconfiguration, but not necessarily always).
> These AI bots will ignore your robots.txt, they'll change user agents if you start to block their user agents, they'll use different IP subnets if you start to block IP subnets
In my experience they change user agents and IP subnets whether or not you block them, and regardless of what else you might do.
> within reasonable resource constraints
And let’s all hold hands and sing koombaya
I agree with pretty much everything the author has said. I’ve been looking at the problem more on the enterprise side of things: how do you control what agents can and can’t do on a complex private network, let alone the internet.
I’ve actually just built an “identity token” using biscuit that you can delegate however you want after. So I can authenticate (to my service, but it could be federated or something just as well), get a token, then choose to create a delegated identity token from that for my agent. Then my agent could do the same for subagents.
In my system, you then have to exchange your identity token for an authorization token to do anything (single scope, single use).
For the internet, I’ve wondered about exchanging the identity token + a small payment (like a minuscule crypto amount) for an authorization token. Human users would barely spend anything. Bots crawling the web would spend a lot.
Maybe the title means something more like "The web should not have gatekeepers (Cloudflare)". They do seem to say as much toward the end:
>We need protocols, not gatekeepers.
But until we have working protocols, many webmasters literally do need a gatekeeper if they want to realistically keep their site safe and online.
I wish this weren't the case, but I believe the "protocol" era of the web was basically ended when proprietary web 2.0 platforms emerged that explicitly locked users in with non-open protocols. Facebook doesn't want you to use Messenger in an open client next to AIM, MSN, and IRC. And the bad guys won.
But like I said, I hope I'm wrong.
>We need protocols, not gatekeepers
The funny thing is that this blog post is complaining about a proposed protocol from Cloudflare (one which will identify bots so that good bots can be permitted). The signup form is just a method to ask Cloudflare (or any other website owner/CDN) to be categorized as a good bot.
It's not a great protocol if you're in the business of scraping websites or selling people bots to access websites for them, but it's a great protocol for people who just want their website to work without being overwhelmed by the bad side of the internet.
The whitelist approach Cloudflare takes isn't good for the internet, but for website owners who are already behind Cloudflare, it's better than the alternative. Someone will need to come up with a better protocol that also serves the website owners' needs if they want Cloudflare to fail here. The AI industry simply doesn't want to cooperate, so their hand must be forced, and only companies like Cloudflare are powerful enough to accomplish that.
Conventional crawlers already have a way to identify themselves, via a json file containing a list of IP addresses. Cloudflare is fully aware of this defacto standard.
I think the reality is, we need identity on both the client and server sides.
At some point soon, if not now, assume everything is generated by AI unless proven otherwise using a decentralized ID.
Likewise, on the server side, assume it’s a bot unless proven otherwise using a decentralized ID.
We can still have anonymity using decentralized IDs. An identity can be an anonymous identity, it’s not all (verified by some central official party) or nothing.
It comes down to different levels of trust.
Decoupling identity and trust is the next step.
It's called an IP address. Since some ISPs don't assign a fixed IP to a subscriber, a timestamp is nowadays necessary. The combination is traceable to a subscriber who is responsible for the line, either to work with law enforcement if subpoenaed or to not send abusive traffic via the line themselves
Why law enforcement doesn't do their job, resulting in people not bothering to report things anymore, is imo the real issue here. Third party identification services to replace a failing government branch is pretty ugly as a workaround, but perhaps less ugly than the commercial gatekeepers popping up today
DID spec, also used in ATProto, is quite flexible. It would be nice to see it used in more places and processes
https://www.w3.org/TR/did-1.1/
I pretty much use Perplexity exclusively at this point, instead of Google. I'd rather just get my questions answered than navigate all of the ads and slowness that Google provides. I'm fine with paying a small monthly fee, but I don't want Cloudflare being the gatekeeper.
Perhaps a way to serve ads through the agents would be good enough. I'd prefer that to be some open protocol than controlled by a company.
This has been my experience more recently as well, I've finally migrated from google to Brave Search since google was just slow for me.
I also appreciate the AI search results a bit when im looking for something very specific (like what the yaml definition for a docker swarm deployment constraint looks like) because the AI just gives me the snippet while the search results are 300 medium blog posts about how to use docker and none of them explain the variables/what each does. Even the official docker documentation website is a mess to navigate and find anything relevant!
Perplexity has been one of the AI companies that created the problem that gave rise to this CF proposal. Why doesn't Perplexity invest more into being a responsible scraper?
https://blog.cloudflare.com/perplexity-is-using-stealth-unde...
Re-read what I wrote.
Perplexity is the problem Cloudflare and companies like it are trying to solve. The company refuses to take no for an answer and will mislead and fake their way through until they've crawled the content they wanted to crawl.
The problem isn't just that ads can't be served. It's that every technical measure to attempt to block their service produces new ways of misleading website owners and the services they use. Perplexity refuses any attempt at abuse detection and prevention from their servers.
None of this would've been necessary if companies like Perplexity would've just acted like a responsible web service and told their customers "sorry, this website doesn't allow Perplexity to act on your behalf".
The open protocol you want already exists: it's the user agent. A responsible bot will set the correct user agent, maybe follow the instructions in robots.txt, and leave it at that. Companies like Perplexity (and many (AI) scrapers) don't want to participate in such a protocol. They will seek out and abuse any loopholes in any well-intended protocol anyone can come up with.
I don't think anyone wants Cloudflare to have even more influence on the internet, but it's thanks to the growth of inconsiderate AI companies like Perplexity that these measure are necessary. The protocol Cloudflare proposes is open (it's just a signature), the problem people have with it is that they have to ask Cloudflare nicely to permit website owners to track and prevent abuse from bots. For any Azure-gated websites, your bot would need to ask permission there as well, as with Akamai-gated websites, and maybe even individual websites.
A new protocol is a technical solution. Technical solutions work for technical problems. The problem Cloudflare is trying to solve isn't a technical problem; it's a social problem.
>but I don't want Cloudflare being the gatekeeper
Cloudflare is not the gatekeeper, it's the owner of the site that blocks Perplexity that's "gatekeeping" you. You're telling me that's not right?
Cloudflare is being really annoying lately. It looks like they desesperately want to close the web to get their 30% fee on AI crawling fees.
Cloudflare slows the whole damn websites down. It takes many seconds to deal with their trash. I hope they crash and burn. Let's get back to very low latency websites without the cloudflare garbage.
Cloudflare as a CDN greatly greatly speeds up the web.
All the custom code they write on top of that to transform HTML for you? Ehhhh... don't use those features. Most are easily reproducible on the backend.
We dont need gatekeepers. We do need to verify agents that act, in a reasonable way, on behalf of human vs an agent swarm/bot-mining operation (whether conducted by a large lab or a kid programming claude code to ddos his buddy's next.js deployment).
So Cloudflare becomes the gatekeeper then?
I kind of want my site to be indexed with agents and used without any interference
By not using Cloudflare your website will be indexed by everyone. The gatekeeper aspect only applies if you use Cloudflare to distribute your website (and even then Cloudflare offers options to control this bot shield thing).
I want it to be indexed by everyone, thats the whole point.
So what then Cloudflare can use all these websites as leverage against Google, OpenAI and Microsoft? I kind of want my content to be indexed.
The content you host will only be blocked from being indexed if you decide to use a service that blocks indexing. If you host your content on other people's services, then you never had the power to make that decision anyway.
If you want your content to be indexed, simply don't use Cloudflare. Host your own servers. Use a different CDN if you want the benefits of Cloudflare's networks.
The private tracker community have long figured this out. Put content behind invite-only user registration, and treeban users if they ever break the rules.
This doesn't scale to the general web, does it? I think invite-only might work to build communities, but you end up in the situation we're in today where people are buying/selling invites, and that's with treebans in place.
I do fear the actions of the current bot landscape is going to lead to almost everything going behind auth walls though, and perhaps even paid auth walls.
I've been considering making this for the web. Why wouldn't it scale? Those selling invites would get banned soon enough if the people they distribute their invite to then send abusive traffic. Mystery shoppers can also make that a risky business if it's disallowed to sell invites (forcing them to be mostly free, such that the giver has nothing to gain from inviting someone who is willing to pay)
One of the practical problems I rather saw was bootstrapping: how to convince any website owner to use it, when very few people are on the system? Where should they find someone to get invites from?
As for tracking (auth walls), the website needs not know who you are. They just see random tokens with signatures and can verify the signature. If there's abuse, they send evidence to the tree system, where it could be handled similarly to HN: lots of flags from different systems will make an automated system kick in, but otherwise a person looks at the issue and decides whether to issue a warning or timeout. (Of course, the abuse reporting mechanism can also be abused so, again similar to HN, if you abuse the abuse mechanism then you don't count towards future reports.)
Ideally, we'd not need this and let real judges do the job of convicting people of abuse and computer fraud, but until such time, I'd rather use the internet anonymously with whatever setup I like than face blocks regularly while doing nothing wrong
I don't think it scales, because I'm not sure it scales on private trackers already. I'm not deep into that space, but I think there's a lot of problems with it that will scale as adoption scales, particularly around policing the sale of invites - the hope would be it self-police through treebanning, but I'm not sure it does.
I think a sort of pseudo-anonymous auth system with backed in invites and treebans that website owners could easily adopt is interesting though. I'm not sure it's a business - for adoption reasons it likely needs to be a protocol - but it's an interesting idea, if it doesn't just turn into a huge admin headache for publishers.
With what they say about authorization, I think X.509 would help. (Although central certificate authorities are often used with X.509, it does not have to be that way; the service you are operating can issue the certificate to you instead, or they can accept a self-signed certificate which is associated with you the first time it is used to create an account on their service.)
You can use the admin certificate issued to you, to issue a certificate to the agent which will contain an extension limiting what it can be used for (and might also expire in a few hours, and also might be revoked later). This certificate can be used to issue an even more restricted certificate to sub-agents.
This is already possible (and would be better than the "fine-grained personal access tokens" that GitHub uses), but does not seem to be commonly implemented. It also improves security in other ways.
So, it can be done in such a way that Cloudflare does not need to issue authorization to you, or necessarily to be involved at all. Google does not need to be involved either.
However, that is only for things where would should normally require authorization to do anyways. Reading public data is not something that should requires authorization to do; the problem with this is excessive scraping (there seems to be too many LLM scraping and others which is too excessive) and excessive blocking (e.g. someone using a different web browser, or curl to download one file, or even someone using a common browser and configuration but something strange unexpected happens, etc); the above is something unrelated to that, so certificates and stuff like that does not help, because it solves a different problem.
What problem does this solve that a basic API key doesn't solve already? The issue with that approach is that you will require accounts/keys/certificates for all hosts you intend to visit, and malicious bots can create as many accounts as they need. You're just adding a registration step to the crawling process.
Your suggested approach works for websites that want to offer AI access as a service to their customers, but the problem Cloudflare is trying to solve is that most AI bots are doing things that website owners don't want them to do. The goal is to identify and block bad actors, not to make things easier for good actors.
Using mTLS/client certificates also exposes people (that don't use AI bots) to the awful UI that browsers have for this kind of authentication. We'll need to get that sorted before an X509-based solution makes any sense.
I used to joke that I worked for the last DotCom startup, a company that got a funding round after the shit hit the fan.
They were working on an idea that looked a bit like an RSS feed for an entire website, where you would run your own spider and then our search engine could hit an endpoint to get a delta instead of having to scan your entire site.
If they’d made the protocol open instead of proprietary, we maybe could have gotten spiders to play nicer since each spider after the first would be cheaper, and eventually maybe someone could build pub sub hooks into common web frameworks to potentially skip the scan entirely for read-mostly websites, generating delta data when your data changed.
But of course when the next round of funding came due nobody was buying.
I thought about this a lot on my last project, where spiders were our customers’ biggest users. One of those apps where customer interactions were intense but brief and the rank in Google mattered equally with all other concerns. Nobody had architected for the actual read/write workflow of the system of course, and that company sold to a competitor after I left. Who migrated all customers to their solution and EOLed ours for being too fat in a down economy.
I wish Cloudflare would roll out AI poisoning attack as protection for their clients (providing bad data cache to AI bots), instead of this. Would work like a charm.
I can see a future where I don't use the internet at all.
Maybe not the Internet for me, but certainly the web. But I totally agree with the sentiment.
While I concur with the effective tech, I don't think this is something that's a net win for society.
Just because you can, doesn't mean you should and I don't feel any one entity (private or public) should be an arbiter on these matters.
This is something that can, and should, be negotiated at the "last virtual mile".
>Just because you can, doesn't mean you should and I don't feel any one entity (private or public) should be an arbiter on these matters.
What do you mean by private? Should I not be allowed to block AI agents on my sites using Cloudflare?
100% needs to be done at the last mile.
I think about this as a startup founder building a 'proof-of-human' layer on the Internet.
One of the hard parts in this space is what level of transparency should you have. We're advancing the thesis that behavioral biometrics offers robust continuous authentication that helps with bot/human and good/bad, but people are obviously skeptical to trust black-box models for accuracy and/or privacy reasons.
We've defaulted to a lot of transparency in terms of publishing research online (and hopefully in scientific journals), but we've seen the downside: competitors fake claims about their own best in-house behavioral tools that is behind their company walls in addition to investors constantly worried about an arms race.
As someone genuinely interested (and incentivized!) to build a great solution in this space, what are good protocols/examples to follow?
as a Cloudflare customer, I am happy with their proposition. I personally do not want companies like Perplexity that fake their user-agent and ignore my robots.txt to trespass.
and isn't this why people sign up with Cloudflare in the first place? for bot protection? to me, this is just the same, but with agents.
i love the idea of an open internet, but this requires all party to be honest. a company like Perplexity that fakes their user-agent to get around blocks disrespects that idea.
my attitude towards agents is positive. if a user used an LLM to access my websites and web apps, i'm all for it. but the LLM providers must disclose who they are - that they are OpenAI, Google, Meta, or the snake oil company Perplexity
Your complaints about "faking their user-agent" reminds me of this 15-year-old but still-relevant, classic post about the history of the user-agent string:
https://webaim.org/blog/user-agent-string-history/
TLDR the UA string has always been "faked", even in the scenarios you might think are most legitimate.
The traditional UA fakery (adding Mozilla to the start and then just tacking on browser engine names) was the result of outdated websites breaking browsers.
The problematic fakery here is that bots are pretending to be people by emulating browsers to prevent rate limits and other technical controls.
That second category has also been with us since the dawn of the internet, but it has always been something worth complaining about. No trustworthy tool or service will pretend to be a real browser, at least not by default.
If AI agents just identified themselves as such, we wouldn't need elaborate schemes to block them when they need to be blocked.
The point is "should everyone just have an account with Cloudflare then"
Brought to you by substack. ;-) Seriously though, great post and a great conversation starter.
I actually thought about this before publishing it hahaha
Good thing they are not the only place to post!
I recently ran a test on the page load reliability of Browserbase and I was shocked to see how unreliable it was for a standard set of websites - the top 100 websites in the US by traffic according to SimilarWeb. 29% of page load requests failed. Without an open standard for agent identification, it will always be a cat and mouse game to trap agents, and many agents will predictably fail simple tasks.
https://anchorbrowser.io/blog/page-load-reliability-on-the-t...
Here's to working together to develop a new protocol that works for agents and website owners alike.
I would love to get off Cloudflare but there are no real good alternatives
Writing backends that can actually handle public traffic and using authentication for expensive resources are fantastic alternatives.
Also, cheaply rate limiting malicious web clients should be something that is trivial to accomplish with competent web tooling (i.e., on your own servers). If this seems out of scope or infeasible, you might be using the wrong tools for the job.
If it were this easy, we wouldn't have had about 10 HN posts on the topic in the last few months.
The technical skills of the majority of the HN community are way below those of the typical computing community a generation ago.
Even if you write the best backend in the world where do you host them? AFAIK Cloudflare is the only free CDN.
GitHub pages?
You still have the network traffic issues which is very substantial
This sounds pretty unrealistic: the web is not better off if the only people who can host content are locking it behind authentication and/or have significant infrastructure budgets and the ability to create heavily tuned static stacks.
AWS is an alternative no?
Bankruptcy as a surprise gift is not an alternative. Even those that use big cloud providers like AWS and GCP use CDNs like Cloudflare to protect themselves. And there is no free CDN like Cloudflare.
> And there is no free CDN like Cloudflare.
Their pricing page says:
No-nonsense Free Tier
As part of the AWS free Usage Tier you can get started with Amazon CloudFront for free.
Included in Always Free Tier
1 TB of data transfer out to the internet per month 10,000,000 HTTP or HTTPS Requests per month 2,000,000 CloudFront Function invocations per month 2,000,000 CloudFront KeyValueStore reads per month 10 Distribution Tenants Free SSL certificates No limitations, all features available
1 TB per month of data is literally nothing. A kid could rent a VPS for an hour and drain all that. What do you do after that? AWS is not going to stop your bill going up is it?
I don't care about any of those fancy serverless services. I am just talking about the cheapest CDN.
Ah, for cheapest CDN, maybe you're right. I think BlazingCDN can also be cheap, but CLoudFlare might be the best deal. OP didn't really say there wasn't any cheaper alternative, just said "no real good alternatives".
> Included in Always Free Tier
> 1 TB of data
Someone can rent a 1Gbps server for cheap (under $50 on OVH) and pull 330TB in a month from your site. That's about $30k of egress on AWS if you don't do anything to stop it.
True, CloudFlare DDoS protection is unmatched, they just eat the cost for free.
AWS needs a dedicated AWS engineer while any technical person and some non-technical people have skill to set up Cloudflare. Esp. Without surprise bills.
I thought the whole point of paying a fortune for AWS was to avoid having a dedicated engineer. It’s the cobol of the 21st century.
I always hear this, but honestly I'm not sure it's true.
It's hard to assess the validity of this versus Cloudflare having a really good marketing department.
I've used neither, so I can't say, but I've also never seen anyone truly explain why/why-not.
Why not use both and find out? Cloudflare is much less technical than AWS, but still a bit technical.
We were supposed to pentest a website on AWS WAF last week. We encountered three types of blocks:
1) hard block without having done any requests yet. No clue why. Same browser (Burp's built-in Chromium), same clean state, same IP address, but one person got a captcha and the other one didn't. It would just say "reload the page to try again" forever. This person simply couldn't use the site at all; not sure if that would happen if you're on any other browser, but since it allowed the other Burp Suite browser, that doesn't seem to be the trigger for this perma-ban. (The workaround was to clone the cookie state from the other consultant, but normal users won't have that option.)
2) captcha. I got so many captchas, like every 4th request. It broke the website (async functionality) constantly. At some point I wanted to try a number of passwords for an admin username that we had found and, to my surprise, it allowed hundreds of requests without captcha. It blocks humans more than this automated bot...
3) "this website is under construction" would sometimes appear. Similar to situation#1, but it seemed to be for specific requests rather than specific persons. Inputting the value "1e9" was fine, "1e999" also fine, but "1e99" got blocked, but only on one specific page (entering it on a different page was fine). Weird stuff. If it doesn't like whatever text you wrote on a support form, I guess you're just out of luck. There's no captcha or anything you can do about it (since it's pretending the website isn't online at all). Not sure if this was AWS or the customer's own wonky mod_security variant
I dread to think if I were a customer of this place and I urgently needed them (it's not a regular webshop but something you might need in a pinch) and the only thing it ever gives me is "please reload the page to try again". Try what again?? Give me a human to talk to, any number to dial!
Shouldn't this be seen as success? You weren't a normal user, you were trying to penetrate the site, and you got a bunch of friction?
On the first fricking pageload I got blocked and couldn't open it at all, no captcha shown. That's a success only insofar as you want to exclude random people who don't have a second person whose cookie state to copy
Also mind that not every request we make is malicious. A lot of it is also seeing what's even there, doing baseline requests, normal things. I didn't get the impression that I got blocked more on malicious requests than normal browsing at all (see also the part where a bot could go to town on a login form while my manual navigation was getting captchas)
Some websites will detect a Burp proxy and act accordingly. If you did your initial page load with any kind of integration like that, that's why the WAF may have blocked your request. I don't know exactly how they do it (my guess is fingerprinting the TLS handshake and TCP packet patterns), but I have seen several services do a great job at blocking any kind of analyzing proxy.
I hear you, but I find it suspicious. I mean CloudFront is used by over 10% of all CDN content online, and is used by Amazon itself.
It wouldn't just randomly block something.
It must be based on something no?
> The same is true online. A cryptographic signature that claims “I am acting on behalf of X” means nothing unless it is tied to something real, like a verifiable infrastructure or a range of IPs. Without that, I can simply hand the passport to another agent, and they can act as if they were me. The passport becomes nothing more than a token anyone can pass around.
how does this person think jwt’s work?
Hi, "this person here" Cloudflare will block that request that has a jwt because "it does not come from a person".
What I was trying to say is that even the discussion "is this a bot 100% sure or not" makes no sense.
at its foundation, the bots issue is in fact 3 main issues:
bots vs humans:
humans are trying to buy tickets that were sold out to a bot
data scrapping:
you index my data (real estate listing) to not to route traffic to my site as people search for my product, as a search engine will do, rather to become my competitor.
spam (and scam): digital pollution, or even worse, trying to input credit card, gift cards, passwords, etc.
(obviously there are more, most which will fall into those categories, but those are the main ones)
now, in the human assisted AI, the first issue is no longer an issue, since it is obvious that each of us, the internet users, will soon have an agent built into our browser. so we will all have the speedy automated select, click and checkout at our disposal.
Prior to LLM era, there were search engines and academic research on the right side of the internet bots, and scrappers and north to that, on the wrong side of the map. but now we have legitimate human users extending their interaction with an LLM agent, and on top of it, we have new AI companies, larger and smaller which thrive for data in order to train their models.
Cloudflare simply trying to make sense of this, whilst maintaining their bot protection relevant.
I do not appreciate the post content whatsoever, since it lacks or consistency and maturity (a true understanding of how the internet works, rather than a naive one).
when you talk about "the internet", what exactly are you referring to? a blog? a bank account management app? a retail website? social media?
those are all part of the internet and each is a complete different type of operation.
EDIT:
I've written a few words about this back in January [1] and in fact suggested something similar:
https://blog.tarab.ai/p/bot-management-reimagined-in-theThis is like saying companies don't need security gates and checkpoints. Unfortunately the world is filled with bad people, and you need security to keep them off your property.
If the broader economic system wasn't based on what is essentially theft, security wouldn't be as necessary as it is.
Are bots using a large number of IP addresses simultaneously, so they look like a DDOS attack? Or are they just making ordinary requests from a small number of addresses. If it's the latter, all you need is some kind of fair queuing so those requests compete with each other for access, not with other users.
Bots are probing for access from various servers, eventually falling back to executing requests from residential IP addresses: https://blog.cloudflare.com/perplexity-is-using-stealth-unde...
Cloudflare is dealing with a couple million faked requests every day just from Perplexity users, and Perplexity is far from the worst player in the field.
The problem would be quite easy to solve with basic rate limiting if it weren't for the attempts to bypass access controls.
Often it is rotating residential proxies. It is virtually impossible to mitigate this behavior from the IP level.
They're using state of the art obfuscation that makes them indistinguishable from malicious botnets. It's an arms race with billion dollar companies vying to consume the most content before it all collapses
the open web is dead and whatever's left will be locked be authentication and paywalls
Discussion from yesterday: https://news.ycombinator.com/item?id=45055452
I understand the concerns around a central gatekeeper but I'm confused as to why this specifically is viewed negatively. Don't website owners have to choose to enable cloudflare and to opt-in to this gate that the site owners control?
If this was cloudflare going into some centralized routing of the internet and saying everything must do X then that would be a lot more alarming but at the end of the day the internet is decentralized and site owners are the ones who are using this capability.
Additionally I don't think that I as an individual website owner would actually want / be capable of knowing which agents are good and bad and cloudflare doing this would be helpful to me as a site owner as long as they act in good faith. And the moment they stop acting in good faith I would be able to disable them. This is definitely a problem right now as unrestricted access to the bots means bad bots are taking up many cycles raising costs and taking away resources from real users
Site owners are tricked and scared (by Cloudflare) into using Cloudflare when they don't need to. Cloudflare feels the increase in customer growth and the rest of us feel the pain.
I do like Cloudflare in general, but the whole anti-AI push is just another form of the Luddism surrounding AI since 2022. Cloudflare perhaps wisely picked up on this trend and decided to capitalize on it, but I think it would be a mistake to allow it to become their brand.
I would love that vision to become reality but what Cloudflare is doing is unfortunately necessary atm.
Ok, I'll bite. Why is turning the Internet into a walled garden necessary now?
Commercial, criminal, and state interests have far more resources than you do, and their interests are in direct conflict with yours.
That would be fine, you could walk away and go home, but if you're going to drive on their digital highways, you're going to need "insurance" just protect you from everyone else.
Ongoing multi-nation WWIII-scale hacking and infiltration campaigns of infrastructure, AI bot crawling, search company and startup crawling, security researchers crawling, and maybe somebody doesn't like your blog and decides to rent a botnet for a week or so.
Bet your ISP shuts you off before then to protect themselves. (Happens all the time via BGP blackholing, DDoS scrubbing services, BGP FlowSpec, etc).
Multi-Tbps DDoS attacks, pervasive scanning of sites for exploits, comically expensive egress bandwidth on services like AWS, and ISPs disallowing hosting services on residential accounts.
Start forcing tighter security on the devices causing the Multi-Tbps DDoS attacks would be a better option, no? Cheap unsecured IoT devices are a problem.
It's not just computers anymore. Web enabled CCTV, doorbell cameras are all culprits.
And home routers, printers, and end user devices themselves. Residential ISP networks can be infiltrated and remote CVE'd through browser calls at this point from a remote website. It's not even hard.
How would you secure someone else's devices?
or we could just require postage. (HTTP status code 402)
one potential solution: https://www.l402.org/
I think it shouldn't require registering /with/ cloudflare. cloudflare should just look up the .well-known referenced and double check for impersonation, and keep score on how well behaved each one is.
Using completely automated means would leave open the possibility to set up a new signature for every single request, or for batches of requests. The manual step is to cut down on the amount of automated abuse.
This is one of the main points :+1:
Don't think the base game plan here is necessarily all that bad. It being concentrated in one for profit entity however very much is
Should use a public blockchain for this? Its good for it, store public keys, verify signatures etc.. none of that token stuff tho
No
I suppose it’s time AI proselytization rediscovered the tragedy of the commons.
Cloudflare lost a lot of credibility by backing off its "neutral" stance and booting certain sites--some which were admittedly horrible--from the their service. Now it seems they want to be even more of a gatekeeper.
"In the 90s, Microsoft tried to “embrace and extend” the web, but failed. And that failure was a blessing."
Basically MS tried to kill the web with their Win95 release, the infamous Internet Explorer and their shitty IIS/Frontpage tandem.
I deeply hate them since that day.
many people don't remember/know history though
Good. Accelerate.
I’m not necessarily coming to the defense of CF’s proposed solution, but it’s ridiculous and rather telling that the article mounts such a strong defense for agents around the notion they are simply completing user-directed tasks the user would otherwise do themselves, while avoiding the blatantly obvious issues of copyright, attribution, resource overusage, etc. presented by agents.
It’s somewhat ironic to let fly the “free and open internet” battle cry on behalf of an industry that is openly destroying it.
Wait till these robots get out in the real world and start overwhelming real world resources.
> Without that, I can simply hand the passport to another agent, and they can act as if they were me.
This isn't the problem Cloudflare are trying to solve here. AI scraping bots are a trigger for them to discuss this, but this is actually just one instance of a much larger problem — one that Cloudflare have been trying to solve for a while now, and which ~all other cloud providers have been ignoring.
My company runs a public data API. For QoS, we need to do things like blocking / rate-limiting traffic on a per-customer basis.
This is usually easy enough — people send an API key with their request, and we can block or rate-limit on those.
But some malicious (or misconfigured) systems, may sometimes just start blasting requests at our API without including an API key.
We usually just want to block these systems "at the edge" — there's no point to even letting those requests hit our infra. But to do that, without affecting any of our legitimate users, we need to have some key by which to recognize these systems, and differentiate them from legitimate traffic.
In the case where they're not sending an API key, that distinguishing key is normally the request's IP address / IP range / ASN.
The problematic exception, then, is Workers/Lambda-type systems (a.k.a. Function-as-a-Service [FaaS] providers) — where all workloads of all users of these systems come from the same pool of shared IP addresses.
---
And, to interrupt myself for a moment, in case the analogy isn't clear: centralized LLM-service web-browsing/tool-use backends, and centralized "agent" orchestrators, are both effectively just FaaS systems, in terms of how the web/MCP requests they originate, relate to their direct inbound customers and/or registered "agent" workloads.
Every problem of bucketing traditional FaaS outbound traffic, also applies to FaaSes where the "function" in question happens to be an LLM inference process.
"Agents" have made this concern more urgent/salient to increasingly-smaller parts of the ecosystem, who weren't previously considering themselves to be "data API providers." But you can actually forget about AI, and focus on just solving the problem for the more-general category of FaaS hosts — and any solution you come up with, will also be a solution applicable to the "agent formulation" of the problem.
---
Back to the problem itself:
The naive approach would be to block the entire FaaS's IP range the first time we see an attack coming from it. (And maybe some API providers can get away with that.)
But as long as we have at least one legitimate customer whose infrastructure has been designed around legitimate use of that FaaS to send requests to us, then we can't just block that entire FaaS's IP range.
(And sure, we could block these IP ranges by default, and then try to get such FaaS-using customers to send some additional distinguishing header in their requests to us, that would take priority over the FaaS-IP-range block... but getting a client engineer to implement an implementation-level change to their stack, by describing the needed change in a support ticket as a resolution to their problem, is often an extreme uphill battle. Better to find a way around needing to do it.)
So we really want/need some non-customer-controlled request metadata to match on, to block these bad FaaS workloads. Ideally, metadata that comes from the FaaS itself.
As it turns out, CF Workers itself already provides such a signal. Each outbound subrequest from a Worker gets forcibly annotated "on the way out" with a request header naming the Worker it came from. We can block on / rate-limit by this header. Works great!
But other FaaS providers do not provide anything similar. For example, it's currently impossible to determine which AWS Lambda customer is making requests to our API, unless that customer specifically deigns to attach some identifying info to their requests. (I actually reported this as a security bug to the Lambda team, over three years ago now.)
---
So, the point of an infrastructure-level-enforced public-visible workload-identity system, like what CF is proposing for their "signed agents", isn't just about being able to whitelist "good bots."
It's also about having some differentiable key that can cleanly bucket bot traffic, where any given bucket then contains purely legitimate or purely malicious/misbehaving bot traffic; so that if you set up rate-limiting, greylisting, or heuristic blocking by this distinguishing key, then the heuristic you use will ensure that your legitimate (bot) users never get punished, while your misbehaving/malicious (bot) users automatically trip the heuristic. Which means you never need to actually hunt through logs and manually blacklist specific malicious/misbehaving (bot) users.
If you look at this proposal as an extension/enhancement of what CF has already been doing for years with Workers subrequest originating-identity annotation, the additional thing that the "signed agents" would give the ecosystem on behalf of an adopting FaaS, is an assurance that random other bots not running on one of these FaaS platforms, can't masquerade as your bot (in order to take advantage of your preferential rate-limiting tier; or round-robin your and many others' identities to avoid such rate-limiting; or even to DoS-attack you by flooding requests that end up attributed to you.) Which is nice, certainly. It means that you don't have to first check that the traffic you're looking at originated from one of the trustworthy FaaS providers, before checking / trusting the workload-identity request header as a distinguishing key.
But in the end, that's a minor gain, compared to just having any standard at all — that other FaaSes would sign on to support — that would require them to emit a workload-identity header on outbound requests. The rest can be handled just by consuming+parsing the published IP-ranges JSON files from FaaS providers (something our API backend already does for CF in particular.)
Related:
Web Bot Auth
https://news.ycombinator.com/item?id=45055452
and associated blog post:
The age of agents: cryptographically recognizing agent traffic
https://blog.cloudflare.com/signed-agents/
Your ideas are intriguing to me and wish to subscribe to your newsletter.
Joking aside, I think the ideas and substance are great and sorely needed. However, I can only see the idea of a sort of token chain verification as running into the same UX problems that plagued (plagues?) PGP and more encryption-focused processes. The workflow is too opaque, requires too much specialized knowledge that is out of reach for most people. It would have to be wrapped up into something stupid simple like an iOS FaceID modal to have any hope of succeeding with the general public. I think that's the idea, that these agents would be working on behalf of their owners on their own devices, so it has to be absolutely seamless.
Otherwise, rock on.
The web doesn't need gatekeepers the way you don't need a bank account, driver's license, or a credit card. You can do without it, but it sure makes it harder to interact with modern society. The days of the mainstream internet being a libertarian frontier are more or less over. The capitalist internet is firmly in charge.
The real question is whether there is more business opportunity in supporting "unsigned" agents than signed ones. My hope is that the industry rejects this because there's more money to be made in catering to agents than blocking them. This move is mostly to create a moat for legacy business.
Also, if agents do become the de-facto way of browsing the internet, I'm not a fan of more ways of being tracked for ads and more ways for censorship groups to have leverage.
But the author is making a strawman argument over a "steelman" argument against signed agents. The strongest argument I can see is not that we don't need gatekeepers, but that regulation is anti-business.
This article can easily be dismissed when hardly a moment in you see the headline "Agents Are Inevitable"
I'm sorry, but the "agents" of "agentic AI" is completely different from the original purpose of the World-Wide Web which was to support user agents. User agents are used directly by users—aka browsers. API access came later, but even then it was often directed by user activity…and otherwise quite normally rate-limited or paywalled.
The idea that now every web server must comply with servicing an insane number of automated bots doing god-knows-what without users even understanding what's happening a lot of the time, or without the consent of content owners to have all their IP scraped into massive training datasets is, well, asinine.
That's not the web we built, that's not the web we signed up for; and yes, we will take drastic measures to block your ass.
Speak for yourself. This is just the semantic web: a web not built just for humans, but also for robots or any other types of agents that may wish to build upon the data. User agents never meant just web browsers, and operators blocking based on it necessitated hiding your identity.
Blocking bots is an absurd and unwinnable proposition, just like DRM; there's always the final, nuclear option of the analog hole, a literal video camera pointed at a monitor and using a keyboard and mouse.
If you really need to, deploy a proof of work shield that doesn't discriminate against user agents, just like what Onionsites do.
> When I’m driving, I hand my phone to a friend and say, “Reply ‘on my way’ to my Mom.” They act on my behalf, through my identity, even though the software has no built-in concept of delegation. That is the world we are entering.
That is a very small part of the world we're entering.
The other vast majority of use cases will come from even more abusive bots than we have today, filling the internet with spam, disinformation, and garbage. The dead internet is no longer a theory, and the future we're building will make the internet for bots, by bots. Humans will retreat into niche corners of it, and those who wish to participate in the broader internet will either have to live with this, or abide by new government regulations that invade their privacy and undermine their security.
So, yes, confirming human identity is the only path forward if we want to make the internet usable by humans, but I do agree that the ideal solution will not come from a single company, or a single government, for that matter. It will be a bumpy ride until we figure this out.