Can there be a challenge that actually does some maliciously useful compute? Like make their crawlers mine bitcoin or something.
Did you just say use the words “useful” and “bitcoin” in the same sentence? o_O
The saddest part is, we thought crypto was the biggest waste of energy ever and then the LLMs entered the chat.
ouch. I never made that comparison, but that is on point.
Bro couldn’t even bring himself to mention protein folding because that’s too socialist I guess.
You’re 100% right. I just grasped at the first example I could think of where the crawlers could do free work. Yours is much better. Left is best.
LLMs can’t do protein folding. A specifically-trained Machine Learning model called AlphaFold did. Here’s the paper.
Developing, training and fine tuning that model was a research effort led by two guys who got a Nobel for it. Alphafold can’t do conversation or give you hummus recipes, it knows shit about the structure of human language but can identify patterns in the domain where it has been specifically and painstakingly trained.
It wasn’t “hey chatGPT, show me how to fold a protein” is all I’m saying and the “superhuman reasoning capabilities” of current LLMs are still falling ridiculously short of much simpler problems.
They can’t bitcoin mine either, so technical feasibility wasn’t the goal of my reply
The crawlers for LLM are not themselves LLMs.
Hey dipshits:
The number of mouth-breathers who think every fucking “AI” is a fucking LLM is too damn high.
AlphaFold is not a language model. It is specifically designed to predict the 3D structure of proteins, using a neural network architecture that reasons over a spatial graph of the protein’s amino acids.
- Every artificial intelligence is not a deep neural network algorithm.
- Every deep neural network algorithm is not a generative adversarial network.
- Every generative adversarial network is not a language model.
- Every language model is not a large language model.
Fucking fart-sniffing twats.
$ ./end-rant.sh
I went back and added “malicious” because I knew it wasn’t useful in reality. I just wanted to express the AI crawlers doing free work. But you’re right, bitcoin sucks.
To be fair: it’s a great tool for scamming people (think ransomware) :/
Not without making real users also mine bitcoin/avoiding the site because their performance tanked.
The Monero community spent a long time trying to find a “useful PoW” function. The problem is that most computations that are useful are not also easy to verify as correct. javascript optimization was one direction that got pursued pretty far.
But at the end of the day, a crypto that actually intends to withstand attacks from major governments requires a system that is decentralized, trustless, and verifiable, and the only solutions that have been found to date involve algorithms for which a GPU or even custom ASIC confers no significant advantage over a consumer-grade CPU.
I mean, we really have to ask ourselves - as a civilization - whether human collaboration is more important than AI data harvesting.
I think every company in the world is telling everyone for a few months now that what matter is AI data harvesting. There’s not even a hint of it being a question. You either accept the AI overlords or get out of the internet. Our ONLY purpose it to feed the machine, anything else is irrelevant. Play along or you shall be removed.
get out of the internet.
At some point, this would be the best option, sadly
I know this is the most ridiculous idea, but we need to pack our bags and make a new internet protocol, to separate us from the rest, at least for a while. Either way, most “modern” internet things (looking at you, JavaScript) are not modern at all, and starting over might help more than any of us could imagine.
Like Gemini?
From official Website:
Gemini is a new internet technology supporting an electronic library of interconnected text documents. That’s not a new idea, but it’s not old fashioned either. It’s timeless, and deserves tools which treat it as a first class concept, not a vestigial corner case. Gemini isn’t about innovation or disruption, it’s about providing some respite for those who feel the internet has been disrupted enough already. We’re not out to change the world or destroy other technologies. We are out to build a lightweight online space where documents are just documents, in the interests of every reader’s privacy, attention and bandwidth.
Yep! That was exactly the protocol on my mind. One thing, though, is that the Fediverse would need to be ported to Gemini, or at least for a new protocol to be created for Gemini.
Won’t the bots just adapt and move there too?
We had a trust based system for so long. No one is forced to honor robots.txt, but most big players did. Almost restores my faith in humanity a little bit. And then AI companies came and destroyed everything. This is why we can’t have nice things.
Big players are the ones behind most AIs though.
Is there a migration tool? If not would be awesome to migrate everything including issues and stuff. Bet even more people would move.
Codeberg has very good migration tools built in. You need to do one repo at a time, but it can move issues, releases, and everything.
There are migration tools, but not a good bulk one that I could find. It worked for my repos except for my unreal engine fork.
Eventually we’ll have “defensive” and “offensive” llm’s managing all kinds of electronic warfare automatically, effectively nullifying each other.
Places like cloudflare and akamai are already using machine learning algorithms to detect bot traffic at a network level. You need to use similar machine learning to evade them. And since most of these scrapers are for AI companies I’d expect a lot of the scrapers to be LLM generated.
Obligatory AI ≠ LLM. How would scrapers benefit from the LLMs they help train? The defense is obvious, LLM-generated slop traps against scrapers already exist.
Question: those artificial stupidity bots want to steal the issues or want to steal the code? Because why they’re wasting a lot of resources scraping millions of pages when they can steal everything via SSH (once a month, not 120 times a second)
That would require having someone with real intelligence running the scraper.
I knew that was the worse option. Use the one that traps them in an infinite maze.
And once again a WebApplicationFirewall(WAF) was defeated and it turns out that blocklists and bot detection tools like fail2ban are the way to go…
Who could have seen this coming…
Just provide a full dump.zip plus incremental daily dumps and they won’t have to scrape ?
Isn’t that an obvious solution ? I mean, it’s public data, it’s out there, do you want it public or not ?
Do you want it only on openai and google but nowhere else ? If so then good luck with the piranhasThe Wikimedia Foundation does just that, and still, their infrastructure is under stress because of AI scrapers.
Dumps or no dumps, these AI companies don’t care. They feel like they’re entitled to taking or stealing what they want.
That’s crazy, it makes no sense, it takes as much bandwidth and processing power on the scraper side to process and use the data as it takes to serve it.
They also have an open API that makes scraper entirely unnecessary too.
Here are the relevant quotes from the article you posted
“Scraping has become so prominent that our outgoing bandwidth has increased by 50% in 2024.”
“At least 65% of our most expensive requests (the ones that we can’t serve from our caching servers and which are served from the main databases instead) are performed by bots.”
“Over the past year, we saw a significant increase in the amount of scraper traffic, and also of related site-stability incidents: Site Reliability Engineers have had to enforce on a case-by-case basis rate limiting or banning of crawlers repeatedly to protect our infrastructure.”
And it’s wikipedia ! The entire data set is trained INTO the models already, it’s not like encyclopedic facts change that often to begin with !
The only thing I imagine is that it is part of a larger ecosystem issue, there the rare case where a dump and API access is so rare, and so untrust worthy that the scrapers are just using scrape for everything, rather than taking the time to save bandwidth by relying on dumps.
Maybe it’s consequences from the 2023 API wars, where it was made clear that data repositories would be leveraging their place as pool of knowledge to extract rent from search and AI and places like wikipedia and other wikis and forums are getting hammered as a result of this war.
If the internet wasn’t becoming a warzone, there really wouldn’t be a need for more than one scraper to scrape a site, even if the site was hostile, like facebook, it only need to be scraped once and then the data could be shared over a torrent swarm efficiently.
I think the issue is that the scrapers are fully automatically collecting text, jumping from link to link like a search engine indexer.