I've contemplated this a bit, and I think I have a bit of an unconventional take:
First, this is really impressive.
Second, with that out of the way, these models are not playing the same game as the human contestants, in at least two major regards. First, and quite obviously, they have massive amounts of compute power, which is kind of like giving a human team a week instead of five hours. But the models that are competing have absolutely massive memorization capacity, whereas the teams are allowed to bring a 25-page PDF with them and they need to manually transcribe anything from that PDF that they actually want to use in a submission.
I think that, if you gave me the ability to search the pre-contest Internet and a week to prepare my submissions, I would be kind of embarrassed if I didn't get gold, and I'd find the contest to be rather less interesting than I would find the real thing.
Second, with that out the way, these cars are not playing the same game as horses… first, and quite obviously they have massive amounts of horsepower, which is kind of like giving a team of horses… many more horses. But also cars have an absolutely massive fuel capacity. Petrol is such an efficient store of chemical energy compared to hay and cars can store gallons of it.
I think if you give my horse the ability of 300 horses and fed it pure gasoline, I would be kind of embarrassed if it wasn’t able to win a horse race.
Yeah man, and it would be wild to publish an article titled "Ford Mustang and Honda Civic win gold in the 100 meter dash at the Olympics" if what happened was the companies drove their cars 100 meters and tweeted that they did it faster than the Olympians had run.
Actually that's too generous, because the humans are given a time limit in ICPC, and there's no clear mapping to say how the LLM's compute should be limited to make a comparison.
It IS an interesting result to see how models can do on these tests - and it's also a garbage headline.
> what happened was the companies drove their cars 100 meters and tweeted that they did it faster than the Olympians had run
That would be indeed an interesting race around the time cars were invented. Today that would be silly, since everyone knows what cars are capable of, but back then one can imagine a lot more skepticism.
Just as there is a ton of skepticism today of what LLMs can achieve. A competition like this clearly demonstrates where the tech is, and what is possible.
> there's no clear mapping to say how the LLM's compute should be limited to make a comparison
There is a very clear mapping of course. You give the same wall clock time to the computer you gave to the humans.
Because what it is showing is that the computer can do the same thing a human can under the same conditions. With your analogy here they are showing that there is such a thing as a car and it can travel 100 meters.
Once it is a foregone conclusion that an LLM can solve the ICPC problems and that question has been sufficiently driven home to everyone who cares we can ask further ones. Like “how much faster can it solve the problems compared to the best humans” or “how much energy it consumes while solving them”? It sounds like you went beyond the first question and already asking these follow up questions.
You're right, they did limit to 5 hours and, I think, 3 models, which seems analogous at least.
Not enough to say they "won gold". Just say what actually happened! The tweets themselves do, but then we have this clickbait headline here on HN somehow that says they "won gold at ICPC".
Cars going faster than humans or horses isn't very interesting these days, but it was 100+ years ago when cars were first coming on the scene.
We are at that point now with AI, so a more fitting headline analogy would be "In a world first, automobile finishes with gold-winning time in horse race".
Headlines like those were a sign that cars would eventually replace horses in most use-cases, so the fact that we could be in the the same place now with AI and humans is a big deal.
It was more than interesting 100+ years ago -- it was the subject of wildly inconsistent, often fear-based (or incumbent-industry-based) regulation.
A vetoed 1896 Pennsylvania law would have required drivers who encountered livestock to "disassemble the automobile" and "conceal the various components out of sight, behind nearby bushes until [the] equestrian or livestock is sufficiently pacified". The Locomotive on Highways Act of 1865 required early motorized vehicles to be preceded by a person on foot waving a red flag or carrying a red lantern and blowing a horn.
It might not quite look like that today, but wild-eyed, fear-based regulation as AI use grows is a real possibility. And at least some of it will likely seem just as silly in hindsight.
I think your analogy is interesting but it falls apart because “moving fast” is not something we consider uniquely human, but “solving hard abstract problems” is
The point is that up until now, humans were the best at these competitions, just like horses were the best at racing up until cars came around.
The other commenter is pointing out how ridiculous it would be for someone to downplay the performance of cars because they did it differently from horses. It doesn't matter if they did it using different methods, that fact that the final outcome was better had world-changing ramifications.
The same applies here. Downplaying AI because it has different strengths or plays by different rules is foolish, because that doesn't matter in the real world. People will choose the option that that leads to the better/faster/cheaper outcome, and that option is quickly becoming AI instead of humans - just like cars quickly became the preferred option over horses. And that is crazy to think about.
I feel the main difference is cars can't compress time in the way an array of computers can. I could win this competition with an infinitely parallel array of random characters typed by infinite monkeys on infinite typewriters instantly since one of them would be perfectly right given infinite submissions. When I make my tweet I would pick a single monkey cus I need infinite money to feed my infinite workforce and that's more impressive clearly.
Now obviously it's more impressive as they don't have infinite compute and had finite time but the car only has one entry in each race unless we start getting into some anime ass shit with divergent timelines and one of the cars (and some lesser amount of horses) finishing instantly.
To your last point we don't know that this was cheaper since they don't disclose the cost. I would blindly guess a mechanical turk for the same cost would outperform at least today.
I think you missed that the whole point of this race was:
"did we build a vehicle faster than a horse, yes/no?"
Which matters a lot when horses are the fastest land vehicle available. (We're so used to thinking of horses as a quaint and slow mean of transport that maybe we don't realize that for millennia they've been the fastest possible way to get from one place to another.)
Yeah I think the only thing OP was passing judgement on is on the competition aspect of it, not the actual achievement of any human or non human participant
That’s how I read it at least - exactly how you put it
I was struck how the argument is also isomorphic to how we talked about computers and chess. We're at the stage where we are arguing the computer isn't _really_ understanding chess, though. It's just doing huge amounts of dumb computation with huge amounts of opening book and end tables and no real understanding, strategy or sense of whats going on.
Even though all the criticism were, in a sense, valid, in the end none of it amounted to a serious challenge to getting good at the task at hand.
Comparing power with reasoning does not make any sense at all.
Humans have surpassed their own strength since the invention of the lever thousands of years ago. Since then, it has been a matter of finding power sources millions of times greater such as nuclear energy
Snark aside, I would expect a car partaking in a horse race to beat all of the horses. Not because it's a better horse, but because it's something else altogether.
Ergo, it's impressive with nuance. As the other commenter said.
The massive amounts of compute power is not the major issue. The major issue is unlimited amount of reference material.
If a human can look up similar previous problems just as the "AI" can, it is a huge advantage.
Syzygy tables in chess engines are a similar issue. They allow perfect play, and there is no reason why a computer gets them and a human does not (if you compare humans against chess engines). Humans have always worked with reference material for serious work.
Humans are allowed to look up and learn from as many previous problems as they want before the competition. The AI is also trained on many previous problems before the competition. What's the difference?
Deleted, because the "AI" geniuses and power users pointed out that Tao does not have a point. You can get this one to -4 as well, since that seems to be the primary pleasure for "AI" one armed bandit users.
It doesn't say anywhere that Gemini used any of those things at ICPC, or that it used more real-world time than the humans.
Also, who cares? It's a self contained non-human system that could solve an ICPC problem it hasn't seen before on its own, which hasn't been achieved before.
If there was a savant human contestant with photographic memory who could remember every previous ICPC problem verbatim and can think really fast you wouldn't say they're cheating, just that they're really smart. Same here.
If there was a man behind the curtain that was somehow making this not an AI achievement then you would have a point, but there isn't.
I think "hasn't seen before" is a bit of an overstatement. Sure, the problem is new in the literal sense that it does exist verbatim elsewhere, but arguably, any competition problem is hardly novel: they are all some permutation of problems that exist and have been solved before: pathfinding, optimization, etc. I don't think anyone is pretending to break new scientific ground in 5 hours.
> I think that, if you gave me the ability to search the pre-contest Internet and a week to prepare my submissions, I would be kind of embarrassed if I didn't get gold, and I'd find the contest to be rather less interesting than I would find the real thing.
I don't know what your personal experience with competitive programming is, so your statement may be true for yourself, but I can confidently state that this is not true for the VAST majority of programmers and software engineers.
Much like trying to do IMO problems without tons of training/practice, the mid-to-hard problems in the ICPC are completely unapproachable to the average computer science student (who already has a better chance than the average software engineer) in the course of a week.
In the same way that LLMs have memorized tons of stuff, the top competitors capable of achieving a gold medal at the ICPC know algorithms, data structures, and how to pattern match them to problems to an extreme degree.
> I can confidently state that this is not true for the VAST majority of programmers and software engineers.
That may well be true. I think it's even more true in cases where the user is not a programmer by profession. I once watched someone present their graduate-level research in a different field and explain how they had solved a real-world problem in their field by writing a complicated computer program full of complicated heuristics to get it to run fast enough and thinking "hmm, I'm pretty sure that a standard algorithm from computer graphics could be adapted to directly solve your problem in O(n log n) time".
If users can get usable algorithms that approximately match the state of the art out of a chatbot (or a fancy "agent") without needing to know the magic words, then that would be amazing, regardless of whether those chatbots/agents ever become creative enough to actually advance the state of the art.
(I sometimes dream of an AI producing a piece of actual code that comes even close to state of the art for solving mixed-integer optimization problems. That's a whole field of wonderful computer science / math that is mostly usable via a couple of extraordinarily expensive closed-source offerings.)
OR-Tools is a whole grab-bag of tools, most of which are wrappers around various solvers, including Gurobi and CPLEX. It seems like CP-SAT is under the OR-Tools umbrella, and CP-SAT may well be state-of-the-art for the specific sets of problems that it's well-suited for.
I think that's because the framing around this (and similar stories about eg IMO performances) is imo slightly wrong. It's not interesting that they can get a gold medal in the sense of trying to rank them against human competitors. As you say, the direct comparisons are, while not entirely meaningless, at least very hard to interpret in the best of cases. It's very much an apples to oranges situation.
Rather, the impressive thing is simply that an AI is capable of solving these problems at all. These are novel (ie not in training set) problems that are really hard and beyond the ability of most professional programmers. The "gold medal" part is informative more in the sense that it gives an indication of how many problems the AI was able to solve & how well it was able to do them.
When talking with some friends about chatgpt just a couple years ago I remember being very confident that there was no way this technology would be able to solve this kind of novel, very challenging reasoning problem, and that there was no way it would be able to solve IMO problems. It's remarkable how quickly I've been proven wrong.
> whereas the teams are allowed to bring a 25-page PDF
This is where I see the biggest issue. LLMs are first-and-foremost text compression algorithms. They have a compressed version of a very good chunk of human writing.
After being text compression engines, LLMs are really good at interpolating text based on the generalization induced by the lossy compression.
What this result really tells us is that, given a reasonably well compressed corpus of human knowledge, the ICPC can be view as an interpolation task.
- compress (in a relatively recoverable way) the entire domain of human knowledge
- interpolate across the entire domain of human knowledge
- draw connections or conclusions that haven't previously been stated explicitly
- verify or disprove those conclusions or connections
- update its internal model based on that (further expanding the domain it can interpolate within)
Then I think we're cooking with gasoline. I guess the question becomes whether those new conclusions or connections result in a convergent or divergent increase in the number of new conclusions and connections the model can draw (e.g. do we understand better the domains we already know or does updating the model with these new conclusions/connections allow us to expand the scope of knowledge we understand to new domains).
It doesn't matter how many instances were running. All that matters is the wall clock time and the cost.
The fact that they don't disclose the cost is a clue that it's probably outrageous today. But costs are coming down fast. And hiring a team of these guys isn't exactly cheap either.
This is what the argument is? 10 years ago if you said you could do this with every computer on the planet and every computer scientist focused on trying to create the code to do this I would’ve given you absurd odds against it getting 12 problems right on ICPC. 10 years ago it couldn’t even reliably parse the question statement.
As someone who has been to the ICPC finals around a decade ago I agree that the limited time is really the big problem that these machine learning models don't really experience in the same way. Though that being said these problems are hard, the actual coding of the algorithms is pretty easy (most of the questions use one of a handful of algorithms that you've implemented a hundred times by the time you're in the finals) but recognizing which one will actually solve the problem correctly is not obvious at all. I know a lot of people that struggled in their undergrad algorithms class and I think a lot of those people given the ICPC finals problems would struggle even with being able to research.
The human teams also get limited to one computer shared between 3 people. The models have access to an effectively unbounded number of computers.
My argument does feel a bit like the “Watson doesn’t need to physically push the button” equivalents from when that system beat Jeopardy for the first time. I assume 5 hours on a single high-end Mac would probably still be enough compute in the near future.
I found the Watson match to be rather absurd. It would have been much more interesting if the rules had been modified so that all contestants had, say, two seconds two press the buzzer and that the contestant who got to answer first would be chosen by random selection among those who pressed the button. This would at least have made the competition be about who could come up with the most correct answers (questions).
I think your analogy is lacking. Human brain is much more efficient, so it is not right to say "giving a human team a week instead of five hours". Most likely, the whole OpenAI compute cannot match one brain in terms of connections and relations and computation power.
I think your assessment is spot on.
But I also think there's a bigger picture that's getting lost in the sauce, not just in your comment but in the general discourse around AI progress:
- We're currently unlocking capabilities to solve many tasks which could previously only be solved by the top-1% of the experts in the field.
- Almost all of that progress is coming from large scale deep learning. Turns out transformers with autoregression + RL are mighty generalists (tho yet far from AGI).
Once it becomes cheap enough so the average joe can tinker with models of this scale, every engineering field can apply it to their niche interest. And ultimately nobody cares if you're playing by the same rules as humans outside of these competitions, they only care that you make them wealthy, healthy and comfy.
If you want to play that game, let's compute how much energy was spent to grow, house and educate one team since they were born, over 20 years against how much was spent training the model.
This is a fair analogy, but let's also consider that these human beings weren't designed with the express purpose of becoming experts in their field and performing in this way for this specific purpose (albeit in a generalist manner).
We are most definitely in agreement about the folly of comparing the abilities of LLMs to humans, since LLMs are to a greater extent the product of much collective human endeavour. "Living memories" would perhaps be a better description of their current state, and their resultant impact on the human psyche.
More information on OpenAI's result (which seems better than DeepMind's) from the X thread:
> our OpenAI reasoning system got a perfect score of 12/12
> For 11 of the 12 problems, the system’s first answer was correct. For the hardest problem, it succeeded on the 9th submission. Notably, the best human team achieved 11/12.
> We had both GPT-5 and an experimental reasoning model generating solutions, and the experimental reasoning model selecting which solutions to submit. GPT-5 answered 11 correctly, and the last (and most difficult problem) was solved by the experimental reasoning model.
I'm assuming that "GPT-5" here is a version with the same model weights but higher compute limits than even GPT-5 Pro, with many instances working in parallel, and some specific scaffolding and prompts. Still, extremely impressive to outperform the best human team. The stat I'd really like to see is how much money it would cost to get this result using their API (with a realistic cost for the "experimental reasoning model").
They likely had a prompt that gave considerable guidance.
Hopefully that prompt was the same for all questions (I think that is what they did for the IMO submission, or maybe it was Google that did that, not sure).
It was within the allotted time. If I'm reading the scoreboard correctly [edit: I wasn't], the human teams typically submitted dozens or hundreds of attempts at each problem.
For problems that human teams eventually get correct, they seem to have submitted mostly 1 time -- occasionally 2 or 3. For problems that they did not get correct, there are some problems with up to 16 submissions.
I went to ICPC's web pages, downloaded the first problem (problem A) and gave it to GPT-5, asking it for code to solve it (stating it was a problem from a recent competitive programming contest).
Here is the prompt I just gave to GPT-5 Pro - its chugging on it. Not sure if it will succeed. Let's see what happens. I did think about converting the PDF to markdown, but figured this prompt is more fair.
-
You are a gold level math olympiad competitor participating in the ICPC 2025 Baku competition. You will be given a competitive programming problem to solve completely.
Here is the problem you need to solve and only solve this problem:
<problem>
Problem B located on Page 3 of the PDF that starts with this text - but has other text so ensure you go to the PDF and look at all of page 3
To help her elementary school students understand the concept of prime factorization, Aisha has invented a game for them to play on the blackboard. The rules of the game are as follows.
The game is played by two players who alternate their moves. Initially, the integers from 1 to n are
written on the blackboard. To start, the first player may choose any even number and circle it. On every subsequent move, the current player must choose a number that is either the circled number multiplied by some prime, or the circled number divided by some prime. That player then erases the circled number and circles the newly chosen number. When a player is unable to make a move, that player loses the game.
To help Aisha’s students, write a program that, given the integer n, decides whether it is better to move first or second, and if it is better to move first, figures out a winning first move.</problem>
Your task is to provide a complete solution that includes:
1. A thorough analysis and solution approach
2. Working code implementation
3. Unit test cases with random inputs
4. Performance optimization to run within 1 second
Use your scratchpad to think through the problem systematically before providing your final solution.
<scratchpad>
Think through the following steps:
1. Problem Understanding:
- What exactly is the problem asking for?
- What are the input constraints and output requirements?
- Are there any edge cases to consider?
2. Solution Strategy:
- What algorithm or mathematical approach should be used?
- What is the time complexity of your approach?
- What is the space complexity?
- Will this approach work within the given constraints?
3. Implementation Planning:
- What data structures will you need?
- How will you handle input/output?
- What are the key functions or components?
4. Testing Strategy:
- What types of test cases should you create?
- How will you generate random inputs within the problem constraints?
- What edge cases need specific testing?
5. Optimization Considerations:
- Are there any bottlenecks in your initial approach?
- Can you reduce time or space complexity?
- Are there language-specific optimizations to apply?
</scratchpad>
Now provide your complete solution with the following components:
<analysis>
Provide a detailed analysis of the problem, including:
- Problem interpretation and requirements
- Chosen algorithm/approach and why
- Time and space complexity analysis
- Key insights or mathematical observations
</analysis>
<solution>
Provide your complete, working code solution. Make sure it:
- Handles all input/output correctly
- Implements your chosen algorithm efficiently
- Includes proper error handling if needed
- Is well-commented for clarity
</solution>
<unit_tests>
Create comprehensive unit test cases that:
- Test normal cases with random inputs within constraints
- Test edge cases (minimum/maximum values, boundary conditions)
- Include at least 5-10 different test scenarios
- Show expected outputs for each test case
</unit_tests>
<optimization>
Explain any optimizations you made or could make:
- Performance improvements implemented
- Memory usage optimizations
- Language-specific optimizations
- Verification that solution runs within 1 second for maximum constraints
</optimization>
Take all the time you need to solve this problem thoroughly and correctly.
I have the 20$ plan and I think I found a weird bug, at least with the thinking version. It gets stuck in the same local minima super quickly, even though the "fake solution" is easily disproved on random tests.
It's at the point where sometimes I've fed it the editorial and it still converges to the fake solution.
I'm sure that the model is capable of solving it, but seriously I've tried across multiple generations (since about when o3 came out) to get GPT to solve this problem and it's not hampered by its innate ability I don't think, it literally just refuses to think critically about the problem. Maybe with better prompting it doesn't get stuck as hard?
They apparently managed gold in the IOI as well. A result that was extremely surprising for me and causes me to rethink a lot of assumptions I have about current LLMs. Unfortunately there was very little transparency on how they managed those results and the only source was a Twitter post. I want to know if there was any third party oversight, what kind of compute they used, how much power what kind of models and how they were set up? In this case I see that DeepMind at least has a blog post, but as far as I can see it does not answer any of my questions.
I think this is huge news, and I cannot imagine anything other than models with this capability having a massive impact all over the world. It causes me to be more worried than excited, it is very hard to tell what this will lead which is probably what makes it scary for me.
However with so little transparency from these companies and extreme financial pressure to perform well in these contests, I have to be quite sceptical of how truthful these results are. If true I think it is really remarkable, but I really want some more solid proof before I change my worldview.
So outside of human intervention, I don't think the specifics really matter. What this means is that it is possible and that this capability will in time be commoditized.
This is helpful in framing the conversation, especially with "skeptics" of what these models are capable of.
To a certain extent I agree. But as far as I know I cannot go to chatgpt.com and paste the newest ICPC problems and get full solutions. And there is no information about what they do differently. For a competition like the ICPC, which is academic in its nature, I think it is very unfortunate to setup a seperate AI track like this without publishing clear public information about what that actually entails. And have clear requirements for these AI companies to publish their methology. I know it is a nice source of sponsorships for them, but the ICPC should afford to stand up a bit for academic integrity.
Without any of this I can't even know for sure if there was any human intervention. I don't really think so, but as I mentioned the financial pressure to perform well is extreme so I can totally see that happening. Maybe ICPC did have some oversight, but please write a bit about it then.
If you assume no human intervention then all of this is of course irrelevant if you only care about the capabilities that exist. But still the implications of a general model performing at this level vs something more like a chess model trained specifically on competitive programming are of course different, even if the gap may close in the future. And how much compute/power was used, are we talking hundreds of kWhs? And does that just means larger models than normally or intelligent bruteforcing through a huge solutionspace? If so, then it is not clear how much they will be able to scale down the compute usage while keeping the performance at the same level
If you assume the brain is a computer (why wouldn't it be is my stance), we have a long ways to go in the optimization department, both in hardware and in software. If it's possible to do at all using hundreds of kilowatt-hours of electricity, no reason it shouldn't be possible within a few hundred Wh (which is a scary prospect, I agree, with consequences hard to imagine when realized.)
The best thing of the ICPC is the first C, which stands for "collegiate". It means that you get to solve a set of problems with three persons, but with only one computer.
This means that you have to be smart about who is going to spend time coding, thinking, or debugging. The time pressure is intense, and it really is a team sport.
It's also extra fun if one of the team members prefers a Dvorak keyboard layout and vi, and the others do not.
I wonder how three different AI vendors would cooperate. It would probably lift reinforcement learning to the next level.
Here is the published 2025 ICPC World Finals problemset. The "Time limit: X seconds" printed on each ICPC World Finals problem is the maximum runtime your program is allowed. If any judged run of your program takes longer than that, the submission fails, even if other runs finish in time.
My understanding is that the way they do this is have some number of model instances generating solution proposals, and then another model which chooses which candidates to submit.
I haven't been able to find information on how many proposals were generated before a solution was chosen to submit. I'm curious to know whether this is "you can get ICPC gold medal performance with a handful of GPT-5 instances" or "you will drown yourself in API credit debt if you try this".
I think in the future information will be more walled -- because AI companies are not paying anyone for that piece of information, and I encourage everyone to put their knowledge on their own website, and for each page, put up a few urls that humans won't be able to find (but can still click if he knows where to find), but can be crawled by AI, which link to pages containing falsified information (such as, oh the information on url blah is actually incorrect, here you can find the correct version, with all those explanations, blah blah -- but of course page blah is the only correct version).
Essentially, we need to poison AI in all possible ways, without impacting human reading. They either have to hire more humans to filter the information, or hire more humans to improve the crawlers.
Or we can simply stop sharing knowledge. I'm fine with it, TBF.
Why the AI hate? How is it different from sharing your knowledge with another individual or writing a book to share it?
> AI companies are not paying anyone for that piece of information
So? For the vast majority of human existence, paying for content was not a thing, just like paying for air isn't. The copyright model you are used to may just be too forced. Many countries have no moral qualms about "pirating" Windows and other pieces of software or games (they won't afford to purchase anyway.) There's no inherent morality or entitlement for author receiving payment for everything they "create" (to wit, Bill Gates had to write a letter to Homebrew Computer Club to make a case for this, showing that it was hardly the default and natural viewpoint.) It's just a legal/social contract to achieve specific goals for the society. Frankly the wheels of copyright have been falling off since the dawn of the Internet, not LLM.
Its different because the AI model will then automate the use of that knowledge, which for most people in this forum is how they make their livelihood. If OpenAI were making robots to replace plumbers, I wouldn't be surprised when plumbers said "we should really stop giving free advice and training to these robots." Its in the worker's best interest to avoid getting undercut by an automated system that can only be built with the worker's free labor. And its in the interest of the company to take as much free labor output (e.g. knowledge) as possible to automate a process so they can profit.
I have received free advice that reduced future need from such actual plumbers (and mechanics and others for that matter)
> we should really stop giving free advice and training to these robots
People routinely freely give advice and teach students, friends, potential competitors, actual competitors, etc on this same forum. Robots? Many also advocate for immigration and outsourcing, presumably because they make the calculus that it is net beneficial in some scenarios. People on this forum contribute to an entire ecosystem of free software, on top of which two kids can and have built $100 billion companies that utilize all such technology freely and without cost. Let's ban it all?
Sure, I totally get if you want to make an individual choice for yourself to keep a secret sauce, not share your code, put stuff behind paywall. That is not the tone and the message here. There is some deep animosity advocating for everyone shutting down their pipes to AI as if some malevolent thing, similar to how Ted Kaczynski saw technology at large.
Which ones in particular? Is your belief all that are companies are inherently malevolent? If not why don't you start one that is not? What's stopping you?
These vigorously held and loudly proclaimed opinions don't matter.
Don't waste the mental energy. They're more interested in performative ignorance and argument than anything productive. It's somewhere between trying to engage Luddites during the industrial revolution and having a reasonable discussion with /pol/ .
They'd rather cling to what they know than embrace change, or get in rhetorical zingers, and nothing will change that except a collision with reality.
Counterpoint: in my consulting role, I've directly seen well over a billion dollars in failed AI deployments in enterprise environments. They're good at solving narrow problems, but fall apart in problem spaces exceeding roughly thirty concurrent decision points. Just today I got involved in a client's data migration where the agent (Claude) processed test data instead of the intended data identified in the prompt. It went so far as to rename the test files to match the actual source data files and proceed from there, signalling the all clear as it did. It wasn't caught until that customer, in a workshop said, and I quote "This isn't our fucking data".
Companies valued at $300 billion or more are not another individual and people are not "sharing" their works. The companies are stealing them.
For the majority of interesting output people have paid for art, music, software, journalism. But you know that already and are justifying the industry that pays your bills.
Irrelevant really. Invoking this in the argument shows the basis is jealousy.
They are clearly valued as such not because they collected all the data and stored in some database. Your local library is not worth 300 billion.
> For the majority of interesting output people have paid for art, music, software, journalism
Absolutely and demonstrably false. Music and art predate Copyright by hundreds if not thousands of years.
> But you know that already and are justifying the industry that pays your bills.
Huh, ad hominem much? I find it rich that the whole premise of your argument was some "art, music, software, journalist" was entitled to some payment, but suddenly it is a problem when "my industry" (somehow you assume I work in AI) is getting paid?
Copyright was only necessary with mass reproduction. The Gutenberg Bible does not yet qualify. The Berne Convention started in 1886, where the problem became more pressing.
And as I said, art was always paid for. In the case of monarchies, at least their advisers usually had good taste, unlike rich people today.
If you are talking about patronage and other forms of artist compensation, nothing about the economics of that is less robust today than ages ago. NFT craze of yesteryear is proof. So is OnlyFans success. Taylor Swift collects a billion bucks touring the country. AI will not change that; not negatively. If anything it will enrich the customer base and funnel more funds to them. The thing that AI does change is internet-wide impression-based and per-copy monetization.
Interference with copyright does not easily equate with theft, conversion, or fraud. The infringer trespasses into the copyright owner’s domain, but he does not assume physical control over the copyright nor wholly deprive its owner of its use. Although it is no less unlawful or wrongful for that reason, it is not a theft.
Absolutely, I am sceptical of AI omin many ways, but primarily it is about the AI companies and my lack of trust in them. I find it unfortunate that all of the clearly brilliant engineers working at these companies are to preoccupied with always chasing newer and better model trying to reach the dream of AGI do not stop and ask themselves: who are they working for? What happens if they eventually manage to create a model that can replace most or even all of human computer work?
Why whould anyone think that these companies will contribute to the good of humanity when they are even bigger and more powerful, when they seem to care so little now?
"I find it unfortunate that all of the clearly brilliant engineers working at these companies are to preoccupied with always chasing newer and better model trying to reach the dream of AGI do not stop and ask themselves: who are they working for?"
Have you seen the people who do OpenAI demos? It becomes pretty apparent upon inspection, what is driving said people.
So this year SotA models have gotten gold at IMO, IoI, ICPC and beat 9/10 humans in that atcoder thing that tested optimisation problems. Yet the most reposted headlines and rethoric is "wall this", "stangation that", "model regression", "winter", "bubble", doom etc.
In 2015 SotA models blew past all expectations for engine performance in Go, but that didn't translate into LLM-based Code agents for another ~7 years (and even now the performance of these is up for debate). I think what this shows is that humans are extremely bad at understanding what problems are "hard" for computers; or rather we don't understand how to group tasks by difficulty in a generalizable way (success in a previously "hard" domain doesn't necessarily translate to performance in other domains of seemingly comparable difficult). It's incredibly impressive how these models perform in these contests, and certainly demonstrates that these tools have high potential in *specific areas* , but I think we might also need to accept that these are not necessarily good benchmarks for these tools' efficacy in less structured problem spaces.
Copying from a comment I made a few weeks ago:
> I dunno I can see an argument that something like IMO word problems are categorically a different language space than a corpus of historiography. For one, even when expressed in English language math is still highly, highly structured. Definitions of terms are totally unambiguous, logical tautologies can be expressed using only a few tokens, etc. etc. It's incredibly impressive that these rich structures can be learned by such a flexible model class, but it definitely seems closer (to me) to excelling at chess or other structured game, versus something as ambiguous as synthesis of historical narratives.
edit: oh small world! the cited comment was actually a response to you in that other thread :D
> edit: oh small world the cited comment was actually a response to you in that other thread :D
That's hilarious, we must have the same interests since we keep cross posting :D
The thing with the go comparison is that alphago was meant to solve go and nothing else. It couldn't do chess with the same weights.
The current SotA LLMs are "unreasonably good" at a LOT of tasks, while being trained with a very "simple" objective: NTP. That's the key difference here. We have these "stochastic parrots" + RL + compute that basically solve top tier competitions in math, coding, and who knows what else... I think it's insanely good for what it is.
Oh totally! I think that the progress made in NLP, as well as the surprising collision of NLP with seemingly unrelated spaces (like ICPC word problems) is nothing sort of revolutionary. Nevertheless I also see stuff like this: https://dynomight.substack.com/p/chess
To me this suggests that this out-of-domain performance is more like an unexpected boon, rather than a guarantee of future performance. The "and who knows what else..." is kind of I'm getting: so far we are turning out to be bad at predicting where these tools are going to excel or fall short. To me this is sort of where the "wall" stuff comes from; despite all the incredible successes in these structured problem domains, nobody (in my personal opinion) has really unlocked the "killer app" yet. My belief is that by accepting their limitations we might better position ourselves to laser-target LLMs at the kind of things they rule at, rather than trying to make them "everything tools".
Even Sam Altman himself thinks we’re in a bubble, and he ought to have a good sense of the wind direction here.
I think the contradiction here can be reconciled by how these tests don’t tend to run on the typical hardware constraints they need to be able do this at scale. And herein lies a large part of the problem as far as I can tell; in late 2024, OpenAI realized they had to rethink GPT-5 since their first attempt became too costly to run. This delayed the model and when it finally released, it was not a revolutionary update but evolutionary at best compared to o3. Benchmarks published by OpenAI themselves indicated a 10% gain over o3 for God knows how much cash and well over a year of work. We certainly didn’t have those problems in 2023 or even 2024.
DeepSeek has had to delay R2, and Mistral has had to delay Mistral 3 Large, teased within weeks back in May. No word from either about what’s going on. DS is said to move more to Huawei and this is behind a delay but I don’t think it’s entirely clear it has nothing to do with performance issues.
It would be more strange to _not_ have people speculate about stagnation or bubbles given these events and public statements.
Personally, I’m not sure if stagnation is the right word. We’re seeing a lot,of innovation in toolsets and platforms surrounding LLM’s like Codex, Claude Code, etc. I think we’ll see more in this regard and that this will provide more value than the core improvements to the LLM’s themselves in 2026.
And as for the bubble, I think we are in one but mostly because the market has been so incredibly hot. I see a bubble not because AI will fall apart but because there are too many products and services right now in a golden rush era. Companies will fail but not because AI suddenly starts failing us but due to saturation.
People pattern match with a very low-resolution view of the world (web3/crypt/nfts were a bubble because there was hype, so there must be a bubble since AI is hyped! I am very smart) and fail to reckon with the very real ways in which AI is fundamentally different.
Also I think people do understand just how big of a deal AI is but don't want to accept it or at least publicly admit it because they are scared for a number of reasons, least of all being human irrelevance.
There is a clear difference between what OpenAI manages to do with GPT-5 and what I manage to do with GPT-5. The other day I asked for code to generate a linear regression and it gave back a figure of some points and a line through it.
If GPT-5, as claimed, is able to solve all problems in ICPC, please give the instructions on how I can reproduce it.
I believe this is going to be an increasingly important factor.
Call it the “shoelace fallacy”: Alice is supposedly much smarter but Bob can tie his shoelaces just as well.
The choice of eval, prompt scaffolding, etc. all dramatically impact the intelligence that these models exhibit. If you need a PhD to coax PhD performance from these systems, you can see why the non-expert reaction is “LLMs are dumb” / progress has stalled.
Yeah, until OpenAI says "we pasted the questions from ICPC into chatgpt.com and it scored 12/12" the average user isn't really going to be able to reproduce their results.
the average person doesnt need to do that. The benchmark for "is this response accurate and personable enough" on any basic chat app has been saturated for at least a year at this point.
I prefer not to due to privacy concerns. Perhaps you can try yourself?
I will say that after checking, I see that the model is set to "Auto", and as mentioned, used almost 8 minutes. The prompt I used was:
Solve the following problem from a competitive programming contest. Output only the exact code needed to get it to pass on the submission server.
It did a lot of thinking, including
I need to tackle a problem where no web-based help is available. The task involves checking if a given tree can be the result of inserting numbers 1 to n into an empty skew heap, following the described insertion algorithm. I have to figure out the minimal and maximal permutations that produce such a tree.
And I can see that it visited 13 webpages, including icpc, codeforces, geeksforgeeks, github, tehrantimes, arxiv, facebook, stackoverflow, etc.
A terse prompt and expecting a one-shot answer is really not how you'd get an LLM to solve complex problems.
I don't know what Deepmind and OpenAI did in this case, but to get an idea of the kind of scaffolding and prompting strategy that one might want, have a look at this paper where some floks used the normal generally available Gemini Pro 2.5 to solve 5/6 of the 2025 IMO problems: https://arxiv.org/pdf/2507.15855
The point of the GPT-5 model is that it is supposed to route between thinking/nonthinking smartly. Leveraging prompt hacks such as instructing it to "think carefully" to force routing to the thinking model go against OpenAI's claims.
Just select GPT5-thinking if you need anything done with competence. The regular gpt5 is nothing impressive and geared more towards regular daily life chatting.
My response simply is that performance in coding competitions such as ICPC is a very different skillset than what is required in a regular software engineering job. GPT-5 still cannot make sense of my company's legacy codebase even if asked to do the most basic tasks that a new grad out of college can figure out in a day or two. I recently asked it to fix a broken test (I had messed with it by changing one single assertion) and it declared "success" by deleting the entire test suite.
This. Dealing with the problems of a real-world legacy code base is the exact opposite of a perfectly constrained problem, verified for internal consistency probably by computers and humans, of all things, and presented neatly in a single PDF. There are dozens, if not 100s, of assumptions that humans are going to make while solving a problem (i.e., make sure you don't crash the website on your first day at work!) that an LLM is not going to. Similar to why, despite all its hype, Waymo cars are still being supervised by human drivers nearly 100% of the time and can't even park themselves regularly without stalling with no explanation.
I had a class of 5 or so test methods - ABCDE. I asked it to fix C, so it started typing out B token-by-token underneath C, such that my source file was now ABCBDE.
I don't think I'm smart enough to get it to do coding activities.
Two days ago I talked to someone in water management about data centers. One of the big players wanted to build a center that consumed as much water as a medium town in semi arid bushland. A week before that it was a substation which would take a decade to source the transformers for. Before that it was buying closed down coal power plants.
I don't know if we're in a bubble for model capabilities, but we are definitely hitting the wall in terms of what the rest of the physical economy can provide.
You can't undo 50 years of deffered maintenance in three months.
Not in three months. It will take years if not decades.
What happens when OpenAI and friends go bust because China is drowning in spare grid capacity and releasing sota open weights models like R1 every other week?
Every company building infrastructure for AI also goes out of business and we are in a worse position than we are now because instead of having a tiny industry building infrastructure at a level required to replace what has reached end of life we have nothing.
The last time I asked for a code review from AI was last week. It added (hallucinated) some extra lines to the code and then marked them as buggy. Yes, it beats humans at coding — great!
> So this year SotA models have gotten gold at IMO, IoI, ICPC
> Yet the most reposted headlines and rethoric is "wall this", "stangation that", "model regression", "winter", "bubble", doom etc.
this is narrow niche with high amount of training data (they all buy training data from leetcode), and this results are not necessary generalizable on overall industrial tasks
the wall is how we need to throw trillions of hardware to do "breakthroughs", LLM uses the same algorthm from last few years. We need a new algorthm breakthrough otherwise buying hardware to increase intelligence isn't scalable.
People are having a tough time coping with what the near future holds for them. It is quite hard for a typical person to imagine how disruptive and exponential coming world events are like Covid showed.
I personally view all this stuff as noise. Im more interested in seeing any contributions to the real economy. Not some competition stuff that is irrelevant to the welfare of people.
It's important to look closely at the details of how these models actually do these things.
If you look at the details of how Google got gold at IMO, you'll see that AlphaGeometry only relies on LLMs for a very specific part of the whole system, and the LLM wasn't the core problem solving system in play.
Most of AlphaGeometry is standard algorithms at play solving geometry problems using known constraints. When the algorithmic system gets stuck, it reaches out to LLMs that were fine tuned specifically for creating new geometric constraints. So the LLM would create new geometric constraints and pass that back to the algorithmic parts to get it unstuck, and repeat.
Without more details, it's not clear if this win is also the Gpt-5 and Gemini models we use, or specially fine-tuned models that are integrated with other non-LLM and non-ML based systems to solve these.
Not being solved purely by LLM isn't a knock on it, but with the current conversations going on today with LLMs, these are heavily being marketed as "LLMs did this all by themselves", which doesn't match with a lot of the evidence I've personally seen.
>This achievement is a significant advance over last year’s breakthrough result. At IMO 2024, AlphaGeometry and AlphaProof required experts to first translate problems from natural language into domain-specific languages, such as Lean, and vice-versa for the proofs. It also took two to three days of computation. This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions – all within the 4.5-hour competition time limit.
AlphaGeometry/AlphaProof (the one you're thinking of, where they used LLMs + lean) was last year! and they "only" got silver. IMO gold results this year were e2e NLP.
ICPC = The International Collegiate Programming Contest. These are college level programmers, not elite competitive programmers.
Apparently Gemini solved one problem (running on who knows what kind of cluster) by burning 30 min of "thinking" time on it, and at a cost that Google have declined to provide.
According to one prior competition paricipant, writing in the comments section of this ArsClasica coverage, each year they include one "time sink" problem that smart humans will avoid until they have tackled everything else.
This would all seem to put a rather different spin on this. It's not a case of Google outwitting the worlds best programmers, but rather that by searching for solutions for 30 min on god knows what kind of cloud hardware, they were able to get something done that the college kids did not have time to complete, or deem worthwhile starting.
These are college-student or occasionally grad-school programmers who qualified to enter the ICPC World Finals, generally by performing sufficiently well at a regional championship to qualify. You can read actual rules here (see "Advancing to the ICPC World Finals"):
I don't know what you mean by "elite", and there are certainly plenty of teams at the World Finals that are not especially competitive, and there certainly many elite programers who don't qualify for various reasons (most obviously by being the wrong age or not in the right stage of school or having already attended too many times), but I find it hard to believe that there aren't enough "elite" programmers present to make the winning teams be genuinely elite.
Compare to, say, the Olympics or pretty much any academic olympiad. There are many people and teams at the Olympics who are not remotely competitive with the winners.
The ICPC has plenty of elite competitive programmers. It's an activity that "peaks" in importance around college, and not many keep training a lot after participating.
Every year there are multiple "Legendary Grandmasters" in the competition. That's >3000 Elo in Codeforces. I'd estimate it takes a similar level of skill/effort as becoming a Chess Grandmaster.
And even those that aren't at that level are very competent at it. The average ICPC participant is likely "smarter" than the average MIT/Harvard CS student for some reasonable measure of "smarter".
Sure, although my point wasn't intended to be about the cost (which would still be interesting to know), but rather that the win by Google seems more down to brute force than intelligence.
Because I'm a ICPC medalist (not this year though) but not a IOI medalist.
Another evidence is that you only have 5 hours to solve 3 problems in IOI, but you need to solve 10+ problems in ICPC. It's impossible to have all 10+ problems to at IOI level in ICPC.
Are you an ICPC World Finals medalist? Because winning an IOI bronze medal is _way_ easier than even qualifying for the ICPC WF, and less than 10% of the teams at the WF get medals.
I'd go as far as saying that gold at the IOI is probably easier than getting an ICPC medal. (One is individual and the other is in teams, but my point stands).
OK. I think my opinion and definition on "easier" is indeed vague. For "easier", I'm only comparing the thinking difficulty.
Yes, medal is function of ranking but not difficulty.
Nonetheless, I would say that IOI more focus on thinking, which I to some degree is not that good at, while ICPC is more like a mix thinking and implementing. Therefore, my ability to implement stuff can improve my ICPC ranking but not IOI.
As a former ICPC winner, I'd say ICPC is mainly a test of teamwork, given the format of the competition (3 team mates, one computer, scoring that rewards clean solutions submitted quicker, tackled in the optimal order for your three sets of skills, etc).
Sure, you need to be individually good at thinking, etc. But the difference between 1st and places further down the ranking is teamwork.
As a former ICPC participant, albeit not first place (hats off to you), I would generally characterize it as "having a good team," much more so than what's usually considered "teamwork." It is a parallelization/scheduling effort than deep interpersonal collaboration.
(In a certain sense, this is actually the ideal "teamwork" setup in the industry as well, to have a bunch of people who own their part and are trusted by their colleagues take care of it and not step on each other toes than kumbaya let's all get together on the same problem.)
The teams we beat trained as individuals and were selected competitively against each other as their school's "best 3".
We were "just" three friends who had studied together for 4 years, knew each other's strengths and weaknesses intimately, and then for the comp trained intensively on optimising the "parallelization/scheduling" aspects (as you put it) to get the best score in the minimum time. That included both the logistical and mental aspects of recovering from setbacks midway through the 5 hour problem sets.
During the finals, you'd be surprised how many teams' teamwork we saw fall apart when three very smart people under intense time pressure hit unexpectedly failing submissions with the bottleneck of a single computer. ICPC is a genius format.
Not sure by what metric you compare the difficulty, but regardless of the hardness of the problem, IIRC, ICPC requires 100% correctness on test cases to score a problem (even failing one means you don't get the score,) but IOI would admit fractional scores (correct me if I am wrong.)
For fractional scores, it depends on problems. In short, there are two types of problems in IOI. One is traditional problems that requires 100% correctness, and the other is continuous scoring.
The prior can still results in score between 0 and 100, but this is because there are subtasks in the problem. For example, a graph become a tree or even just a linear sequence. Nonetheless, you still need to ensure your algorithm is correct on all testcases in that subtask in order to get the score of that task.
I think it's becoming clear that these mega AI corps are juggling with their models at inference time to produce unrealistically good results. By that it seems that they're just cranking up the compute beyond reasonable levels in order to gain PR points against each other.
The fact is most ordinary mortals never get access to a fraction of that kind of power, which explains the commonly reported issues with AI models failing to complete even rudimentary tasks. It's now turned into a whole marketing circus (maybe to justify these ludicrous billion-dollar valuations?).
$200/month for an LLM with the capability to fully automate my job is extremely cheap. Of course, even with a high thinking budget we don't have that yet, but if we see it at any cost in 2026, I'll be expecting to be forced into retirement by 2030.
When I say 10 times cheaper, I mean when comparing models of the same capabilities. The kind of performance you get now for a 200$ subscription, a year ago probably would have costed 2000$.
You don't? Now I use Gemini to code and optimize CUDA kernels. When I first used GPT3 in the OpenAI playground I was extremely impressed when I managed to get it to output a hello world program in C.
I understand what you're saying. However I'm not sure it's that germane when we're talking about whether or not the current $200 subscription fee is actually delivering value for money, or whether AI giants are manipulating performance to gain marketing points.
The bleeding edge behind closed doors token burning monsters of 2023 are bad compared to the free LLMs we have now.
I believe it was Sundar in an interview with Lex who said that the reason they haven't developed another Ultra model is because by the time it is ready to launch, the flash and pro versions will have already made it redundant.
"It's now turned into a whole marketing circus (maybe to justify these ludicrous billion-dollar valuations?)."
Yes theres an entire ecosystem being built up around language models that has to stay afloat for another 5 years at least, to hope for a significant breakthrough.
A database is good at leetcode, who would have thought. Give humans a database and they'll outperform your "AI" (which probably uses an extraordinary amount of graphics cards and electricity).
It is an idiotic benchmark, in line with the rest of the "AI" propaganda.
"Database" was not meant in a literal sense. Clearly a lot of knowledge from similar problems is encoded in the model, that is why you can use models as a kind of fuzzy encyclopedia.
It is like an open book exam for humans where they also can lookup similar problems.
The current top comment makes the same point, but in a more diplomatic and sophisticated manner.
I mean strong human contestants would also know a lot of similar problems, I'm not seeing how it's fundamentally different or not a meaningful achievement.
Whats the point? These models are still unreliable in every day work. And they're getting fat! For a moment, they were getting cheaper, but now they are only getting bigger and this is not going to be cheap in the future. The point is, what are we investing a trillion dollars in?
Unreliable doesn't mean unusable. I'm finding it harder and harder to believe people are actually trying to use them and saying they are useless.
If you can chop your problem up and give little tedious parts of your bigger task it's starting like doing code review for a new grad instead of coding. And they're getting more reliable and the parts you give it can be bigger and bigger. I wish there was a way to stop this but I don't think it's going to.
I've contemplated this a bit, and I think I have a bit of an unconventional take:
First, this is really impressive.
Second, with that out of the way, these models are not playing the same game as the human contestants, in at least two major regards. First, and quite obviously, they have massive amounts of compute power, which is kind of like giving a human team a week instead of five hours. But the models that are competing have absolutely massive memorization capacity, whereas the teams are allowed to bring a 25-page PDF with them and they need to manually transcribe anything from that PDF that they actually want to use in a submission.
I think that, if you gave me the ability to search the pre-contest Internet and a week to prepare my submissions, I would be kind of embarrassed if I didn't get gold, and I'd find the contest to be rather less interesting than I would find the real thing.
Firstly, automobiles are really impressive.
Second, with that out the way, these cars are not playing the same game as horses… first, and quite obviously they have massive amounts of horsepower, which is kind of like giving a team of horses… many more horses. But also cars have an absolutely massive fuel capacity. Petrol is such an efficient store of chemical energy compared to hay and cars can store gallons of it.
I think if you give my horse the ability of 300 horses and fed it pure gasoline, I would be kind of embarrassed if it wasn’t able to win a horse race.
Yeah man, and it would be wild to publish an article titled "Ford Mustang and Honda Civic win gold in the 100 meter dash at the Olympics" if what happened was the companies drove their cars 100 meters and tweeted that they did it faster than the Olympians had run.
Actually that's too generous, because the humans are given a time limit in ICPC, and there's no clear mapping to say how the LLM's compute should be limited to make a comparison.
It IS an interesting result to see how models can do on these tests - and it's also a garbage headline.
> what happened was the companies drove their cars 100 meters and tweeted that they did it faster than the Olympians had run
That would be indeed an interesting race around the time cars were invented. Today that would be silly, since everyone knows what cars are capable of, but back then one can imagine a lot more skepticism.
Just as there is a ton of skepticism today of what LLMs can achieve. A competition like this clearly demonstrates where the tech is, and what is possible.
> there's no clear mapping to say how the LLM's compute should be limited to make a comparison
There is a very clear mapping of course. You give the same wall clock time to the computer you gave to the humans.
Because what it is showing is that the computer can do the same thing a human can under the same conditions. With your analogy here they are showing that there is such a thing as a car and it can travel 100 meters.
Once it is a foregone conclusion that an LLM can solve the ICPC problems and that question has been sufficiently driven home to everyone who cares we can ask further ones. Like “how much faster can it solve the problems compared to the best humans” or “how much energy it consumes while solving them”? It sounds like you went beyond the first question and already asking these follow up questions.
You're right, they did limit to 5 hours and, I think, 3 models, which seems analogous at least.
Not enough to say they "won gold". Just say what actually happened! The tweets themselves do, but then we have this clickbait headline here on HN somehow that says they "won gold at ICPC".
All the while with skeptics snakily commenting "Cars can move fast, but they can't really run like a human!"
Cars going faster than humans or horses isn't very interesting these days, but it was 100+ years ago when cars were first coming on the scene.
We are at that point now with AI, so a more fitting headline analogy would be "In a world first, automobile finishes with gold-winning time in horse race".
Headlines like those were a sign that cars would eventually replace horses in most use-cases, so the fact that we could be in the the same place now with AI and humans is a big deal.
It was more than interesting 100+ years ago -- it was the subject of wildly inconsistent, often fear-based (or incumbent-industry-based) regulation.
A vetoed 1896 Pennsylvania law would have required drivers who encountered livestock to "disassemble the automobile" and "conceal the various components out of sight, behind nearby bushes until [the] equestrian or livestock is sufficiently pacified". The Locomotive on Highways Act of 1865 required early motorized vehicles to be preceded by a person on foot waving a red flag or carrying a red lantern and blowing a horn.
It might not quite look like that today, but wild-eyed, fear-based regulation as AI use grows is a real possibility. And at least some of it will likely seem just as silly in hindsight.
I think your analogy is interesting but it falls apart because “moving fast” is not something we consider uniquely human, but “solving hard abstract problems” is
Not my analogy, parent is the one who brought up automobiles. Maybe that's who you meant to reply to.
I'm talking about the headline saying they "won gold" at a competition they didn't, and couldn't, compete in.
> Firstly, automobiles are really impressive. Second, with that out the way, these cars are not playing the same game as horses
Yes. That’s why cars don’t compete in equestrian events and horses don’t go to F1 races.
This non-controversial surely? You want different events for humans, humans + computers, and just computers.
Notice that self driving cars have separate race events from both horses and human-driven cars.
The point is that up until now, humans were the best at these competitions, just like horses were the best at racing up until cars came around.
The other commenter is pointing out how ridiculous it would be for someone to downplay the performance of cars because they did it differently from horses. It doesn't matter if they did it using different methods, that fact that the final outcome was better had world-changing ramifications.
The same applies here. Downplaying AI because it has different strengths or plays by different rules is foolish, because that doesn't matter in the real world. People will choose the option that that leads to the better/faster/cheaper outcome, and that option is quickly becoming AI instead of humans - just like cars quickly became the preferred option over horses. And that is crazy to think about.
I feel the main difference is cars can't compress time in the way an array of computers can. I could win this competition with an infinitely parallel array of random characters typed by infinite monkeys on infinite typewriters instantly since one of them would be perfectly right given infinite submissions. When I make my tweet I would pick a single monkey cus I need infinite money to feed my infinite workforce and that's more impressive clearly.
Now obviously it's more impressive as they don't have infinite compute and had finite time but the car only has one entry in each race unless we start getting into some anime ass shit with divergent timelines and one of the cars (and some lesser amount of horses) finishing instantly.
To your last point we don't know that this was cheaper since they don't disclose the cost. I would blindly guess a mechanical turk for the same cost would outperform at least today.
I think you missed that the whole point of this race was:
"did we build a vehicle faster than a horse, yes/no?"
Which matters a lot when horses are the fastest land vehicle available. (We're so used to thinking of horses as a quaint and slow mean of transport that maybe we don't realize that for millennia they've been the fastest possible way to get from one place to another.)
> "did we build a vehicle faster than a horse, yes/no?"
Yeah fair. There's also that famous human vs horse race that happens every few years. So far humans keep winning (because it's long distance)
If you're talking about the Man versus Horse Marathon (https://en.wikipedia.org/wiki/Man_versus_Horse_Marathon) it's the other way around. Overwhelmingly the horses win. Only occasionally does the human.
I stand corrected. My memory garbled that. Thanks!
Yeah I think the only thing OP was passing judgement on is on the competition aspect of it, not the actual achievement of any human or non human participant
That’s how I read it at least - exactly how you put it
I was struck how the argument is also isomorphic to how we talked about computers and chess. We're at the stage where we are arguing the computer isn't _really_ understanding chess, though. It's just doing huge amounts of dumb computation with huge amounts of opening book and end tables and no real understanding, strategy or sense of whats going on.
Even though all the criticism were, in a sense, valid, in the end none of it amounted to a serious challenge to getting good at the task at hand.
I don’t think you’ll find many race tracks that permit horses and cars to compete together.
(I did enjoy the sarcasm, though!)
Comparing power with reasoning does not make any sense at all.
Humans have surpassed their own strength since the invention of the lever thousands of years ago. Since then, it has been a matter of finding power sources millions of times greater such as nuclear energy
Snark aside, I would expect a car partaking in a horse race to beat all of the horses. Not because it's a better horse, but because it's something else altogether.
Ergo, it's impressive with nuance. As the other commenter said.
The massive amounts of compute power is not the major issue. The major issue is unlimited amount of reference material.
If a human can look up similar previous problems just as the "AI" can, it is a huge advantage.
Syzygy tables in chess engines are a similar issue. They allow perfect play, and there is no reason why a computer gets them and a human does not (if you compare humans against chess engines). Humans have always worked with reference material for serious work.
Humans are allowed to look up and learn from as many previous problems as they want before the competition. The AI is also trained on many previous problems before the competition. What's the difference?
Deleted, because the "AI" geniuses and power users pointed out that Tao does not have a point. You can get this one to -4 as well, since that seems to be the primary pleasure for "AI" one armed bandit users.
It doesn't say anywhere that Gemini used any of those things at ICPC, or that it used more real-world time than the humans.
Also, who cares? It's a self contained non-human system that could solve an ICPC problem it hasn't seen before on its own, which hasn't been achieved before.
If there was a savant human contestant with photographic memory who could remember every previous ICPC problem verbatim and can think really fast you wouldn't say they're cheating, just that they're really smart. Same here.
If there was a man behind the curtain that was somehow making this not an AI achievement then you would have a point, but there isn't.
I think "hasn't seen before" is a bit of an overstatement. Sure, the problem is new in the literal sense that it does exist verbatim elsewhere, but arguably, any competition problem is hardly novel: they are all some permutation of problems that exist and have been solved before: pathfinding, optimization, etc. I don't think anyone is pretending to break new scientific ground in 5 hours.
> I think that, if you gave me the ability to search the pre-contest Internet and a week to prepare my submissions, I would be kind of embarrassed if I didn't get gold, and I'd find the contest to be rather less interesting than I would find the real thing.
I don't know what your personal experience with competitive programming is, so your statement may be true for yourself, but I can confidently state that this is not true for the VAST majority of programmers and software engineers.
Much like trying to do IMO problems without tons of training/practice, the mid-to-hard problems in the ICPC are completely unapproachable to the average computer science student (who already has a better chance than the average software engineer) in the course of a week.
In the same way that LLMs have memorized tons of stuff, the top competitors capable of achieving a gold medal at the ICPC know algorithms, data structures, and how to pattern match them to problems to an extreme degree.
> I can confidently state that this is not true for the VAST majority of programmers and software engineers.
That may well be true. I think it's even more true in cases where the user is not a programmer by profession. I once watched someone present their graduate-level research in a different field and explain how they had solved a real-world problem in their field by writing a complicated computer program full of complicated heuristics to get it to run fast enough and thinking "hmm, I'm pretty sure that a standard algorithm from computer graphics could be adapted to directly solve your problem in O(n log n) time".
If users can get usable algorithms that approximately match the state of the art out of a chatbot (or a fancy "agent") without needing to know the magic words, then that would be amazing, regardless of whether those chatbots/agents ever become creative enough to actually advance the state of the art.
(I sometimes dream of an AI producing a piece of actual code that comes even close to state of the art for solving mixed-integer optimization problems. That's a whole field of wonderful computer science / math that is mostly usable via a couple of extraordinarily expensive closed-source offerings.)
> That's a whole field of wonderful computer science / math that is mostly usable via a couple of extraordinarily expensive closed-source offerings.
Take a look at Google OR-Tools: https://developers.google.com/optimization/
OR-Tools is a whole grab-bag of tools, most of which are wrappers around various solvers, including Gurobi and CPLEX. It seems like CP-SAT is under the OR-Tools umbrella, and CP-SAT may well be state-of-the-art for the specific sets of problems that it's well-suited for.
I think that's because the framing around this (and similar stories about eg IMO performances) is imo slightly wrong. It's not interesting that they can get a gold medal in the sense of trying to rank them against human competitors. As you say, the direct comparisons are, while not entirely meaningless, at least very hard to interpret in the best of cases. It's very much an apples to oranges situation.
Rather, the impressive thing is simply that an AI is capable of solving these problems at all. These are novel (ie not in training set) problems that are really hard and beyond the ability of most professional programmers. The "gold medal" part is informative more in the sense that it gives an indication of how many problems the AI was able to solve & how well it was able to do them.
When talking with some friends about chatgpt just a couple years ago I remember being very confident that there was no way this technology would be able to solve this kind of novel, very challenging reasoning problem, and that there was no way it would be able to solve IMO problems. It's remarkable how quickly I've been proven wrong.
> whereas the teams are allowed to bring a 25-page PDF
This is where I see the biggest issue. LLMs are first-and-foremost text compression algorithms. They have a compressed version of a very good chunk of human writing.
After being text compression engines, LLMs are really good at interpolating text based on the generalization induced by the lossy compression.
What this result really tells us is that, given a reasonably well compressed corpus of human knowledge, the ICPC can be view as an interpolation task.
If we develop a system that can:
- compress (in a relatively recoverable way) the entire domain of human knowledge
- interpolate across the entire domain of human knowledge
- draw connections or conclusions that haven't previously been stated explicitly
- verify or disprove those conclusions or connections
- update its internal model based on that (further expanding the domain it can interpolate within)
Then I think we're cooking with gasoline. I guess the question becomes whether those new conclusions or connections result in a convergent or divergent increase in the number of new conclusions and connections the model can draw (e.g. do we understand better the domains we already know or does updating the model with these new conclusions/connections allow us to expand the scope of knowledge we understand to new domains).
It doesn't matter how many instances were running. All that matters is the wall clock time and the cost.
The fact that they don't disclose the cost is a clue that it's probably outrageous today. But costs are coming down fast. And hiring a team of these guys isn't exactly cheap either.
Human teams are limited to three people. So why doesn’t it matter how many instances they used?
This is what the argument is? 10 years ago if you said you could do this with every computer on the planet and every computer scientist focused on trying to create the code to do this I would’ve given you absurd odds against it getting 12 problems right on ICPC. 10 years ago it couldn’t even reliably parse the question statement.
Human brains and cloud instances are not remotely equivalent. What you can compare on an equivalent basis is cost.
All instances of any given model are kinda the same, for lack of a better word, "person": same knowledge, same skills, same failings.
I bet with human teams it'll take longer to solve a problem the more people you have on the team.
As someone who has been to the ICPC finals around a decade ago I agree that the limited time is really the big problem that these machine learning models don't really experience in the same way. Though that being said these problems are hard, the actual coding of the algorithms is pretty easy (most of the questions use one of a handful of algorithms that you've implemented a hundred times by the time you're in the finals) but recognizing which one will actually solve the problem correctly is not obvious at all. I know a lot of people that struggled in their undergrad algorithms class and I think a lot of those people given the ICPC finals problems would struggle even with being able to research.
The human teams also get limited to one computer shared between 3 people. The models have access to an effectively unbounded number of computers.
My argument does feel a bit like the “Watson doesn’t need to physically push the button” equivalents from when that system beat Jeopardy for the first time. I assume 5 hours on a single high-end Mac would probably still be enough compute in the near future.
I found the Watson match to be rather absurd. It would have been much more interesting if the rules had been modified so that all contestants had, say, two seconds two press the buzzer and that the contestant who got to answer first would be chosen by random selection among those who pressed the button. This would at least have made the competition be about who could come up with the most correct answers (questions).
I think your analogy is lacking. Human brain is much more efficient, so it is not right to say "giving a human team a week instead of five hours". Most likely, the whole OpenAI compute cannot match one brain in terms of connections and relations and computation power.
As always with these comparisons you neglect to account for the eons necessary for evolution to create human brains.
But as a product of evolved organisms, LLMs are also a product of evolution. They also came several hundreds of thousands of years later.
I think your assessment is spot on. But I also think there's a bigger picture that's getting lost in the sauce, not just in your comment but in the general discourse around AI progress:
- We're currently unlocking capabilities to solve many tasks which could previously only be solved by the top-1% of the experts in the field.
- Almost all of that progress is coming from large scale deep learning. Turns out transformers with autoregression + RL are mighty generalists (tho yet far from AGI).
Once it becomes cheap enough so the average joe can tinker with models of this scale, every engineering field can apply it to their niche interest. And ultimately nobody cares if you're playing by the same rules as humans outside of these competitions, they only care that you make them wealthy, healthy and comfy.
the end game is that running similar tasks at any moment time and place.
If you want to play that game, let's compute how much energy was spent to grow, house and educate one team since they were born, over 20 years against how much was spent training the model.
This is a fair analogy, but let's also consider that these human beings weren't designed with the express purpose of becoming experts in their field and performing in this way for this specific purpose (albeit in a generalist manner).
We are most definitely in agreement about the folly of comparing the abilities of LLMs to humans, since LLMs are to a greater extent the product of much collective human endeavour. "Living memories" would perhaps be a better description of their current state, and their resultant impact on the human psyche.
Yes yes given this why didn't it do better and isn't it embarrassing to have done it through statistical brute force and not intelligence.
More information on OpenAI's result (which seems better than DeepMind's) from the X thread:
> our OpenAI reasoning system got a perfect score of 12/12
> For 11 of the 12 problems, the system’s first answer was correct. For the hardest problem, it succeeded on the 9th submission. Notably, the best human team achieved 11/12.
> We had both GPT-5 and an experimental reasoning model generating solutions, and the experimental reasoning model selecting which solutions to submit. GPT-5 answered 11 correctly, and the last (and most difficult problem) was solved by the experimental reasoning model.
I'm assuming that "GPT-5" here is a version with the same model weights but higher compute limits than even GPT-5 Pro, with many instances working in parallel, and some specific scaffolding and prompts. Still, extremely impressive to outperform the best human team. The stat I'd really like to see is how much money it would cost to get this result using their API (with a realistic cost for the "experimental reasoning model").
Ha so true. I was so tempted to copy and paste a problem into GPT5 and see what it would say
They likely had a prompt that gave considerable guidance.
Hopefully that prompt was the same for all questions (I think that is what they did for the IMO submission, or maybe it was Google that did that, not sure).
> it succeeded on the 9th submission
What's the judgement here? Was it within the allotted time, or just a "try as often as you need to"?
It was within the allotted time. If I'm reading the scoreboard correctly [edit: I wasn't], the human teams typically submitted dozens or hundreds of attempts at each problem.
For problems that human teams eventually get correct, they seem to have submitted mostly 1 time -- occasionally 2 or 3. For problems that they did not get correct, there are some problems with up to 16 submissions.
Ah, I see I was in fact reading it wrong. So 9 is definitely an unusual but not unprecedented number of submissions.
I went to ICPC's web pages, downloaded the first problem (problem A) and gave it to GPT-5, asking it for code to solve it (stating it was a problem from a recent competitive programming contest).
It thought for 7m 53s and gave as reply
1. What was your prompt? 2. Why did you give it to GPT-5 instead of GPT-5 Thinking or GPT-5 Pro?
Here is the prompt I just gave to GPT-5 Pro - its chugging on it. Not sure if it will succeed. Let's see what happens. I did think about converting the PDF to markdown, but figured this prompt is more fair.
-
You are a gold level math olympiad competitor participating in the ICPC 2025 Baku competition. You will be given a competitive programming problem to solve completely.
All problems are located at the following URL: https://worldfinals.icpc.global/problems/2025/finals/problem...
Here is the problem you need to solve and only solve this problem:
<problem> Problem B located on Page 3 of the PDF that starts with this text - but has other text so ensure you go to the PDF and look at all of page 3
To help her elementary school students understand the concept of prime factorization, Aisha has invented a game for them to play on the blackboard. The rules of the game are as follows.
The game is played by two players who alternate their moves. Initially, the integers from 1 to n are written on the blackboard. To start, the first player may choose any even number and circle it. On every subsequent move, the current player must choose a number that is either the circled number multiplied by some prime, or the circled number divided by some prime. That player then erases the circled number and circles the newly chosen number. When a player is unable to make a move, that player loses the game.
To help Aisha’s students, write a program that, given the integer n, decides whether it is better to move first or second, and if it is better to move first, figures out a winning first move.</problem>
Your task is to provide a complete solution that includes: 1. A thorough analysis and solution approach 2. Working code implementation 3. Unit test cases with random inputs 4. Performance optimization to run within 1 second
Use your scratchpad to think through the problem systematically before providing your final solution.
<scratchpad> Think through the following steps:
1. Problem Understanding: - What exactly is the problem asking for? - What are the input constraints and output requirements? - Are there any edge cases to consider?
2. Solution Strategy: - What algorithm or mathematical approach should be used? - What is the time complexity of your approach? - What is the space complexity? - Will this approach work within the given constraints?
3. Implementation Planning: - What data structures will you need? - How will you handle input/output? - What are the key functions or components?
4. Testing Strategy: - What types of test cases should you create? - How will you generate random inputs within the problem constraints? - What edge cases need specific testing?
5. Optimization Considerations: - Are there any bottlenecks in your initial approach? - Can you reduce time or space complexity? - Are there language-specific optimizations to apply? </scratchpad>
Now provide your complete solution with the following components:
<analysis> Provide a detailed analysis of the problem, including: - Problem interpretation and requirements - Chosen algorithm/approach and why - Time and space complexity analysis - Key insights or mathematical observations </analysis>
<solution> Provide your complete, working code solution. Make sure it: - Handles all input/output correctly - Implements your chosen algorithm efficiently - Includes proper error handling if needed - Is well-commented for clarity </solution>
<unit_tests> Create comprehensive unit test cases that: - Test normal cases with random inputs within constraints - Test edge cases (minimum/maximum values, boundary conditions) - Include at least 5-10 different test scenarios - Show expected outputs for each test case </unit_tests>
<optimization> Explain any optimizations you made or could make: - Performance improvements implemented - Memory usage optimizations - Language-specific optimizations - Verification that solution runs within 1 second for maximum constraints </optimization>
Take all the time you need to solve this problem thoroughly and correctly.
If we're benchmarking problems, mind trying out this problem on Pro if you're willing to spare the compute?
https://www.acmicpc.net/problem/33797
I have the 20$ plan and I think I found a weird bug, at least with the thinking version. It gets stuck in the same local minima super quickly, even though the "fake solution" is easily disproved on random tests.
It's at the point where sometimes I've fed it the editorial and it still converges to the fake solution.
https://chatgpt.com/share/68c8b2ef-c68c-8004-8006-595501929f...
I'm sure that the model is capable of solving it, but seriously I've tried across multiple generations (since about when o3 came out) to get GPT to solve this problem and it's not hampered by its innate ability I don't think, it literally just refuses to think critically about the problem. Maybe with better prompting it doesn't get stuck as hard?
Sounds like a bug. Did you try it again (or with another leading-edge model) and get a similar result?
They apparently managed gold in the IOI as well. A result that was extremely surprising for me and causes me to rethink a lot of assumptions I have about current LLMs. Unfortunately there was very little transparency on how they managed those results and the only source was a Twitter post. I want to know if there was any third party oversight, what kind of compute they used, how much power what kind of models and how they were set up? In this case I see that DeepMind at least has a blog post, but as far as I can see it does not answer any of my questions.
I think this is huge news, and I cannot imagine anything other than models with this capability having a massive impact all over the world. It causes me to be more worried than excited, it is very hard to tell what this will lead which is probably what makes it scary for me.
However with so little transparency from these companies and extreme financial pressure to perform well in these contests, I have to be quite sceptical of how truthful these results are. If true I think it is really remarkable, but I really want some more solid proof before I change my worldview.
So outside of human intervention, I don't think the specifics really matter. What this means is that it is possible and that this capability will in time be commoditized.
This is helpful in framing the conversation, especially with "skeptics" of what these models are capable of.
To a certain extent I agree. But as far as I know I cannot go to chatgpt.com and paste the newest ICPC problems and get full solutions. And there is no information about what they do differently. For a competition like the ICPC, which is academic in its nature, I think it is very unfortunate to setup a seperate AI track like this without publishing clear public information about what that actually entails. And have clear requirements for these AI companies to publish their methology. I know it is a nice source of sponsorships for them, but the ICPC should afford to stand up a bit for academic integrity.
Without any of this I can't even know for sure if there was any human intervention. I don't really think so, but as I mentioned the financial pressure to perform well is extreme so I can totally see that happening. Maybe ICPC did have some oversight, but please write a bit about it then.
If you assume no human intervention then all of this is of course irrelevant if you only care about the capabilities that exist. But still the implications of a general model performing at this level vs something more like a chess model trained specifically on competitive programming are of course different, even if the gap may close in the future. And how much compute/power was used, are we talking hundreds of kWhs? And does that just means larger models than normally or intelligent bruteforcing through a huge solutionspace? If so, then it is not clear how much they will be able to scale down the compute usage while keeping the performance at the same level
Mechanical Turking, in the original sense of the word.
If you assume the brain is a computer (why wouldn't it be is my stance), we have a long ways to go in the optimization department, both in hardware and in software. If it's possible to do at all using hundreds of kilowatt-hours of electricity, no reason it shouldn't be possible within a few hundred Wh (which is a scary prospect, I agree, with consequences hard to imagine when realized.)
I don't see that much reason to be skeptical since this basically lines up with the trend we've been seeing in their performance.
The best thing of the ICPC is the first C, which stands for "collegiate". It means that you get to solve a set of problems with three persons, but with only one computer.
This means that you have to be smart about who is going to spend time coding, thinking, or debugging. The time pressure is intense, and it really is a team sport.
It's also extra fun if one of the team members prefers a Dvorak keyboard layout and vi, and the others do not.
I wonder how three different AI vendors would cooperate. It would probably lift reinforcement learning to the next level.
Claude, ChatGPT, and Gemini on a team.
I'm not sure how it would play out, but at least when you let them talk to each other they tend to get very technical very fast.
Actually collegiate means that the contestants are in college.
This is impressive.
Here is the published 2025 ICPC World Finals problemset. The "Time limit: X seconds" printed on each ICPC World Finals problem is the maximum runtime your program is allowed. If any judged run of your program takes longer than that, the submission fails, even if other runs finish in time.
https://worldfinals.icpc.global/problems/2025/finals/problem...
My understanding is that the way they do this is have some number of model instances generating solution proposals, and then another model which chooses which candidates to submit.
I haven't been able to find information on how many proposals were generated before a solution was chosen to submit. I'm curious to know whether this is "you can get ICPC gold medal performance with a handful of GPT-5 instances" or "you will drown yourself in API credit debt if you try this".
Still extremely impressive either way.
I think in the future information will be more walled -- because AI companies are not paying anyone for that piece of information, and I encourage everyone to put their knowledge on their own website, and for each page, put up a few urls that humans won't be able to find (but can still click if he knows where to find), but can be crawled by AI, which link to pages containing falsified information (such as, oh the information on url blah is actually incorrect, here you can find the correct version, with all those explanations, blah blah -- but of course page blah is the only correct version).
Essentially, we need to poison AI in all possible ways, without impacting human reading. They either have to hire more humans to filter the information, or hire more humans to improve the crawlers.
Or we can simply stop sharing knowledge. I'm fine with it, TBF.
Why the AI hate? How is it different from sharing your knowledge with another individual or writing a book to share it?
> AI companies are not paying anyone for that piece of information
So? For the vast majority of human existence, paying for content was not a thing, just like paying for air isn't. The copyright model you are used to may just be too forced. Many countries have no moral qualms about "pirating" Windows and other pieces of software or games (they won't afford to purchase anyway.) There's no inherent morality or entitlement for author receiving payment for everything they "create" (to wit, Bill Gates had to write a letter to Homebrew Computer Club to make a case for this, showing that it was hardly the default and natural viewpoint.) It's just a legal/social contract to achieve specific goals for the society. Frankly the wheels of copyright have been falling off since the dawn of the Internet, not LLM.
Its different because the AI model will then automate the use of that knowledge, which for most people in this forum is how they make their livelihood. If OpenAI were making robots to replace plumbers, I wouldn't be surprised when plumbers said "we should really stop giving free advice and training to these robots." Its in the worker's best interest to avoid getting undercut by an automated system that can only be built with the worker's free labor. And its in the interest of the company to take as much free labor output (e.g. knowledge) as possible to automate a process so they can profit.
> plumbers
I have received free advice that reduced future need from such actual plumbers (and mechanics and others for that matter)
> we should really stop giving free advice and training to these robots
People routinely freely give advice and teach students, friends, potential competitors, actual competitors, etc on this same forum. Robots? Many also advocate for immigration and outsourcing, presumably because they make the calculus that it is net beneficial in some scenarios. People on this forum contribute to an entire ecosystem of free software, on top of which two kids can and have built $100 billion companies that utilize all such technology freely and without cost. Let's ban it all?
Sure, I totally get if you want to make an individual choice for yourself to keep a secret sauce, not share your code, put stuff behind paywall. That is not the tone and the message here. There is some deep animosity advocating for everyone shutting down their pipes to AI as if some malevolent thing, similar to how Ted Kaczynski saw technology at large.
the AI isn't malevolent (... yet)
but the companies operating it certainly are
they have no concept of consent
they take anything and everything, regardless of copyright or license, with no compensation to the authors
and then use it to directly compete with those they ripped off
not to mention shoving their poor quality generated slop everywhere they can possibly manage, regardless of ethics, consent or potential consequences
children should not be supplied a sycophantic source of partial truths that has been instructed to pretend to be their friend
this is text book malevolence
> but the companies operating it certainly are
Which ones in particular? Is your belief all that are companies are inherently malevolent? If not why don't you start one that is not? What's stopping you?
I don't think I need to give a list
> What's stopping you?
from doing what?
I don't want shitty AI slop; why would I start a company intent on generating it?
These vigorously held and loudly proclaimed opinions don't matter.
Don't waste the mental energy. They're more interested in performative ignorance and argument than anything productive. It's somewhere between trying to engage Luddites during the industrial revolution and having a reasonable discussion with /pol/ .
They'd rather cling to what they know than embrace change, or get in rhetorical zingers, and nothing will change that except a collision with reality.
Counterpoint: in my consulting role, I've directly seen well over a billion dollars in failed AI deployments in enterprise environments. They're good at solving narrow problems, but fall apart in problem spaces exceeding roughly thirty concurrent decision points. Just today I got involved in a client's data migration where the agent (Claude) processed test data instead of the intended data identified in the prompt. It went so far as to rename the test files to match the actual source data files and proceed from there, signalling the all clear as it did. It wasn't caught until that customer, in a workshop said, and I quote "This isn't our fucking data".
Companies valued at $300 billion or more are not another individual and people are not "sharing" their works. The companies are stealing them.
For the majority of interesting output people have paid for art, music, software, journalism. But you know that already and are justifying the industry that pays your bills.
> valued at $300 billion
Irrelevant really. Invoking this in the argument shows the basis is jealousy. They are clearly valued as such not because they collected all the data and stored in some database. Your local library is not worth 300 billion.
> For the majority of interesting output people have paid for art, music, software, journalism
Absolutely and demonstrably false. Music and art predate Copyright by hundreds if not thousands of years.
> But you know that already and are justifying the industry that pays your bills.
Huh, ad hominem much? I find it rich that the whole premise of your argument was some "art, music, software, journalist" was entitled to some payment, but suddenly it is a problem when "my industry" (somehow you assume I work in AI) is getting paid?
Copyright was only necessary with mass reproduction. The Gutenberg Bible does not yet qualify. The Berne Convention started in 1886, where the problem became more pressing.
And as I said, art was always paid for. In the case of monarchies, at least their advisers usually had good taste, unlike rich people today.
If you are talking about patronage and other forms of artist compensation, nothing about the economics of that is less robust today than ages ago. NFT craze of yesteryear is proof. So is OnlyFans success. Taylor Swift collects a billion bucks touring the country. AI will not change that; not negatively. If anything it will enrich the customer base and funnel more funds to them. The thing that AI does change is internet-wide impression-based and per-copy monetization.
Copyright is not the same as paying for it
Copying something isn't stealing, though.
Absolutely, I am sceptical of AI omin many ways, but primarily it is about the AI companies and my lack of trust in them. I find it unfortunate that all of the clearly brilliant engineers working at these companies are to preoccupied with always chasing newer and better model trying to reach the dream of AGI do not stop and ask themselves: who are they working for? What happens if they eventually manage to create a model that can replace most or even all of human computer work?
Why whould anyone think that these companies will contribute to the good of humanity when they are even bigger and more powerful, when they seem to care so little now?
"I find it unfortunate that all of the clearly brilliant engineers working at these companies are to preoccupied with always chasing newer and better model trying to reach the dream of AGI do not stop and ask themselves: who are they working for?"
Have you seen the people who do OpenAI demos? It becomes pretty apparent upon inspection, what is driving said people.
I for one welcome advancement of science and mathematics from our AI overlords
So this year SotA models have gotten gold at IMO, IoI, ICPC and beat 9/10 humans in that atcoder thing that tested optimisation problems. Yet the most reposted headlines and rethoric is "wall this", "stangation that", "model regression", "winter", "bubble", doom etc.
In 2015 SotA models blew past all expectations for engine performance in Go, but that didn't translate into LLM-based Code agents for another ~7 years (and even now the performance of these is up for debate). I think what this shows is that humans are extremely bad at understanding what problems are "hard" for computers; or rather we don't understand how to group tasks by difficulty in a generalizable way (success in a previously "hard" domain doesn't necessarily translate to performance in other domains of seemingly comparable difficult). It's incredibly impressive how these models perform in these contests, and certainly demonstrates that these tools have high potential in *specific areas* , but I think we might also need to accept that these are not necessarily good benchmarks for these tools' efficacy in less structured problem spaces.
Copying from a comment I made a few weeks ago:
> I dunno I can see an argument that something like IMO word problems are categorically a different language space than a corpus of historiography. For one, even when expressed in English language math is still highly, highly structured. Definitions of terms are totally unambiguous, logical tautologies can be expressed using only a few tokens, etc. etc. It's incredibly impressive that these rich structures can be learned by such a flexible model class, but it definitely seems closer (to me) to excelling at chess or other structured game, versus something as ambiguous as synthesis of historical narratives.
edit: oh small world! the cited comment was actually a response to you in that other thread :D
> edit: oh small world the cited comment was actually a response to you in that other thread :D
That's hilarious, we must have the same interests since we keep cross posting :D
The thing with the go comparison is that alphago was meant to solve go and nothing else. It couldn't do chess with the same weights.
The current SotA LLMs are "unreasonably good" at a LOT of tasks, while being trained with a very "simple" objective: NTP. That's the key difference here. We have these "stochastic parrots" + RL + compute that basically solve top tier competitions in math, coding, and who knows what else... I think it's insanely good for what it is.
> I think it's insanely good for what it is.
Oh totally! I think that the progress made in NLP, as well as the surprising collision of NLP with seemingly unrelated spaces (like ICPC word problems) is nothing sort of revolutionary. Nevertheless I also see stuff like this: https://dynomight.substack.com/p/chess
To me this suggests that this out-of-domain performance is more like an unexpected boon, rather than a guarantee of future performance. The "and who knows what else..." is kind of I'm getting: so far we are turning out to be bad at predicting where these tools are going to excel or fall short. To me this is sort of where the "wall" stuff comes from; despite all the incredible successes in these structured problem domains, nobody (in my personal opinion) has really unlocked the "killer app" yet. My belief is that by accepting their limitations we might better position ourselves to laser-target LLMs at the kind of things they rule at, rather than trying to make them "everything tools".
A lot of the current code and science capabilities do not come from NTP training.
Indeed in seems in most language model RL there is not even process supervision, so a long way from NTP
Even Sam Altman himself thinks we’re in a bubble, and he ought to have a good sense of the wind direction here.
I think the contradiction here can be reconciled by how these tests don’t tend to run on the typical hardware constraints they need to be able do this at scale. And herein lies a large part of the problem as far as I can tell; in late 2024, OpenAI realized they had to rethink GPT-5 since their first attempt became too costly to run. This delayed the model and when it finally released, it was not a revolutionary update but evolutionary at best compared to o3. Benchmarks published by OpenAI themselves indicated a 10% gain over o3 for God knows how much cash and well over a year of work. We certainly didn’t have those problems in 2023 or even 2024.
DeepSeek has had to delay R2, and Mistral has had to delay Mistral 3 Large, teased within weeks back in May. No word from either about what’s going on. DS is said to move more to Huawei and this is behind a delay but I don’t think it’s entirely clear it has nothing to do with performance issues.
It would be more strange to _not_ have people speculate about stagnation or bubbles given these events and public statements.
Personally, I’m not sure if stagnation is the right word. We’re seeing a lot,of innovation in toolsets and platforms surrounding LLM’s like Codex, Claude Code, etc. I think we’ll see more in this regard and that this will provide more value than the core improvements to the LLM’s themselves in 2026.
And as for the bubble, I think we are in one but mostly because the market has been so incredibly hot. I see a bubble not because AI will fall apart but because there are too many products and services right now in a golden rush era. Companies will fail but not because AI suddenly starts failing us but due to saturation.
it was not a revolutionary update but evolutionary at best compared to o3
It is a revolutionary update if compared to the previous major release (GPT-4 from March 2023).
People pattern match with a very low-resolution view of the world (web3/crypt/nfts were a bubble because there was hype, so there must be a bubble since AI is hyped! I am very smart) and fail to reckon with the very real ways in which AI is fundamentally different.
Also I think people do understand just how big of a deal AI is but don't want to accept it or at least publicly admit it because they are scared for a number of reasons, least of all being human irrelevance.
There is a clear difference between what OpenAI manages to do with GPT-5 and what I manage to do with GPT-5. The other day I asked for code to generate a linear regression and it gave back a figure of some points and a line through it.
If GPT-5, as claimed, is able to solve all problems in ICPC, please give the instructions on how I can reproduce it.
I believe this is going to be an increasingly important factor.
Call it the “shoelace fallacy”: Alice is supposedly much smarter but Bob can tie his shoelaces just as well.
The choice of eval, prompt scaffolding, etc. all dramatically impact the intelligence that these models exhibit. If you need a PhD to coax PhD performance from these systems, you can see why the non-expert reaction is “LLMs are dumb” / progress has stalled.
If you can't get a modern LLM to generate a simple linear regression I think what you have is a problem between the keyboard and the chair...
Yeah, until OpenAI says "we pasted the questions from ICPC into chatgpt.com and it scored 12/12" the average user isn't really going to be able to reproduce their results.
the average person doesnt need to do that. The benchmark for "is this response accurate and personable enough" on any basic chat app has been saturated for at least a year at this point.
Are you using the thinking model or the non thinking model? Maybe you can share your chat.
I prefer not to due to privacy concerns. Perhaps you can try yourself?
I will say that after checking, I see that the model is set to "Auto", and as mentioned, used almost 8 minutes. The prompt I used was:
It did a lot of thinking, including And I can see that it visited 13 webpages, including icpc, codeforces, geeksforgeeks, github, tehrantimes, arxiv, facebook, stackoverflow, etc.A terse prompt and expecting a one-shot answer is really not how you'd get an LLM to solve complex problems.
I don't know what Deepmind and OpenAI did in this case, but to get an idea of the kind of scaffolding and prompting strategy that one might want, have a look at this paper where some floks used the normal generally available Gemini Pro 2.5 to solve 5/6 of the 2025 IMO problems: https://arxiv.org/pdf/2507.15855
The point of the GPT-5 model is that it is supposed to route between thinking/nonthinking smartly. Leveraging prompt hacks such as instructing it to "think carefully" to force routing to the thinking model go against OpenAI's claims.
Just select GPT5-thinking if you need anything done with competence. The regular gpt5 is nothing impressive and geared more towards regular daily life chatting.
Are you sure? I thought you can only specify reasoning_effort and that's it.
My response simply is that performance in coding competitions such as ICPC is a very different skillset than what is required in a regular software engineering job. GPT-5 still cannot make sense of my company's legacy codebase even if asked to do the most basic tasks that a new grad out of college can figure out in a day or two. I recently asked it to fix a broken test (I had messed with it by changing one single assertion) and it declared "success" by deleting the entire test suite.
This. Dealing with the problems of a real-world legacy code base is the exact opposite of a perfectly constrained problem, verified for internal consistency probably by computers and humans, of all things, and presented neatly in a single PDF. There are dozens, if not 100s, of assumptions that humans are going to make while solving a problem (i.e., make sure you don't crash the website on your first day at work!) that an LLM is not going to. Similar to why, despite all its hype, Waymo cars are still being supervised by human drivers nearly 100% of the time and can't even park themselves regularly without stalling with no explanation.
>Waymo cars are still being supervised by human drivers nearly 100% of the time
That seems...highly implausible?
Similar experience with windsurf.
I had a class of 5 or so test methods - ABCDE. I asked it to fix C, so it started typing out B token-by-token underneath C, such that my source file was now ABCBDE.
I don't think I'm smart enough to get it to do coding activities.
> it declared "success" by deleting the entire test suite.
The paperclip trivial solution!
Two days ago I talked to someone in water management about data centers. One of the big players wanted to build a center that consumed as much water as a medium town in semi arid bushland. A week before that it was a substation which would take a decade to source the transformers for. Before that it was buying closed down coal power plants.
I don't know if we're in a bubble for model capabilities, but we are definitely hitting the wall in terms of what the rest of the physical economy can provide.
You can't undo 50 years of deffered maintenance in three months.
Getting well funded commercial demand is exactly how you undo it.
Not in three months. It will take years if not decades.
What happens when OpenAI and friends go bust because China is drowning in spare grid capacity and releasing sota open weights models like R1 every other week?
Every company building infrastructure for AI also goes out of business and we are in a worse position than we are now because instead of having a tiny industry building infrastructure at a level required to replace what has reached end of life we have nothing.
Well, the supposed PhD-level models are still pretty dumb when they get to consumers, so what gives?
The last time I asked for a code review from AI was last week. It added (hallucinated) some extra lines to the code and then marked them as buggy. Yes, it beats humans at coding — great!
What's "It?" What was your prompt?
> So this year SotA models have gotten gold at IMO, IoI, ICPC > Yet the most reposted headlines and rethoric is "wall this", "stangation that", "model regression", "winter", "bubble", doom etc.
this is narrow niche with high amount of training data (they all buy training data from leetcode), and this results are not necessary generalizable on overall industrial tasks
the wall is how we need to throw trillions of hardware to do "breakthroughs", LLM uses the same algorthm from last few years. We need a new algorthm breakthrough otherwise buying hardware to increase intelligence isn't scalable.
People are having a tough time coping with what the near future holds for them. It is quite hard for a typical person to imagine how disruptive and exponential coming world events are like Covid showed.
This comment makes me think. What did previous winners of these competition go on to do in their lives? Anything spectacular?
Indeed.
I personally view all this stuff as noise. Im more interested in seeing any contributions to the real economy. Not some competition stuff that is irrelevant to the welfare of people.
Don't worry, they're just stochastic parrots copying answers from Stack Overflow. ;)
It's important to look closely at the details of how these models actually do these things.
If you look at the details of how Google got gold at IMO, you'll see that AlphaGeometry only relies on LLMs for a very specific part of the whole system, and the LLM wasn't the core problem solving system in play.
Most of AlphaGeometry is standard algorithms at play solving geometry problems using known constraints. When the algorithmic system gets stuck, it reaches out to LLMs that were fine tuned specifically for creating new geometric constraints. So the LLM would create new geometric constraints and pass that back to the algorithmic parts to get it unstuck, and repeat.
Without more details, it's not clear if this win is also the Gpt-5 and Gemini models we use, or specially fine-tuned models that are integrated with other non-LLM and non-ML based systems to solve these.
Not being solved purely by LLM isn't a knock on it, but with the current conversations going on today with LLMs, these are heavily being marketed as "LLMs did this all by themselves", which doesn't match with a lot of the evidence I've personally seen.
>This achievement is a significant advance over last year’s breakthrough result. At IMO 2024, AlphaGeometry and AlphaProof required experts to first translate problems from natural language into domain-specific languages, such as Lean, and vice-versa for the proofs. It also took two to three days of computation. This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions – all within the 4.5-hour competition time limit.
[1]https://deepmind.google/discover/blog/advanced-version-of-ge...
3 days of computation is crazy and definitely not on par with human contestants.
AlphaGeometry/AlphaProof (the one you're thinking of, where they used LLMs + lean) was last year! and they "only" got silver. IMO gold results this year were e2e NLP.
ICPC = The International Collegiate Programming Contest. These are college level programmers, not elite competitive programmers.
Apparently Gemini solved one problem (running on who knows what kind of cluster) by burning 30 min of "thinking" time on it, and at a cost that Google have declined to provide.
According to one prior competition paricipant, writing in the comments section of this ArsClasica coverage, each year they include one "time sink" problem that smart humans will avoid until they have tackled everything else.
https://arstechnica.com/google/2025/09/google-gemini-earns-g...
This would all seem to put a rather different spin on this. It's not a case of Google outwitting the worlds best programmers, but rather that by searching for solutions for 30 min on god knows what kind of cloud hardware, they were able to get something done that the college kids did not have time to complete, or deem worthwhile starting.
These are college-student or occasionally grad-school programmers who qualified to enter the ICPC World Finals, generally by performing sufficiently well at a regional championship to qualify. You can read actual rules here (see "Advancing to the ICPC World Finals"):
https://icpc.global/regionals/rules
I don't know what you mean by "elite", and there are certainly plenty of teams at the World Finals that are not especially competitive, and there certainly many elite programers who don't qualify for various reasons (most obviously by being the wrong age or not in the right stage of school or having already attended too many times), but I find it hard to believe that there aren't enough "elite" programmers present to make the winning teams be genuinely elite.
Compare to, say, the Olympics or pretty much any academic olympiad. There are many people and teams at the Olympics who are not remotely competitive with the winners.
> There are many people and teams at the Olympics who are not remotely competitive with the winners.
And yet, they are so much closer to the winners than the people that came 11th, 12th etc.
The ICPC has plenty of elite competitive programmers. It's an activity that "peaks" in importance around college, and not many keep training a lot after participating.
Every year there are multiple "Legendary Grandmasters" in the competition. That's >3000 Elo in Codeforces. I'd estimate it takes a similar level of skill/effort as becoming a Chess Grandmaster.
And even those that aren't at that level are very competent at it. The average ICPC participant is likely "smarter" than the average MIT/Harvard CS student for some reasonable measure of "smarter".
I didn't realize - didn't mean to disparage anyone.
I've competed in these contest before. There are probably more difficult than what we can call _elite_ competitive programmer
note: my team only passed the first 2 rounds, far from bragging about my skills here :)
ICPC world finals questions are not easy. Idk what your talking about
Let's bookmark this comment and check again next year, if the freely available models will be able to do it for a few dollars.
Sure, although my point wasn't intended to be about the cost (which would still be interesting to know), but rather that the win by Google seems more down to brute force than intelligence.
Given that ICPC problems are in general easier than IOI problems. I wouldn't be surprise to see they can get Gold (even perfect scores) in ICPC.
Nonetheless, I'm still questioning what's the cost and how long it would take for us to be able to access these models.
Still great work, but it's less useful if the cost is actually higher than hiring someone with the same level.
What makes you say that they are easier? Are there more people who manages to solve a problem from ICPC than from IOI?
How do you compare those?
There were at least 2 very simple problems in IOI this year.
I haven't read the ICPC problem set, and perhaps there are some low-hanging fruits, but I highly doubt it.
Because I'm a ICPC medalist (not this year though) but not a IOI medalist.
Another evidence is that you only have 5 hours to solve 3 problems in IOI, but you need to solve 10+ problems in ICPC. It's impossible to have all 10+ problems to at IOI level in ICPC.
Medals in both contests depend on your relative ranking (and of course depends on the difficulty of qualifying for them).
Doesn't say anything about the difficulty of the questions themselves though.
Are you an ICPC World Finals medalist? Because winning an IOI bronze medal is _way_ easier than even qualifying for the ICPC WF, and less than 10% of the teams at the WF get medals.
I'd go as far as saying that gold at the IOI is probably easier than getting an ICPC medal. (One is individual and the other is in teams, but my point stands).
> Because I'm a ICPC medalist (not this year though) but not a IOI medalist.
Isn't getting a medal a function of your ranking, not score, in both cases? If so, that does not prove much about the difficulty of either.
OK. I think my opinion and definition on "easier" is indeed vague. For "easier", I'm only comparing the thinking difficulty.
Yes, medal is function of ranking but not difficulty.
Nonetheless, I would say that IOI more focus on thinking, which I to some degree is not that good at, while ICPC is more like a mix thinking and implementing. Therefore, my ability to implement stuff can improve my ICPC ranking but not IOI.
As a former ICPC winner, I'd say ICPC is mainly a test of teamwork, given the format of the competition (3 team mates, one computer, scoring that rewards clean solutions submitted quicker, tackled in the optimal order for your three sets of skills, etc).
Sure, you need to be individually good at thinking, etc. But the difference between 1st and places further down the ranking is teamwork.
As a former ICPC participant, albeit not first place (hats off to you), I would generally characterize it as "having a good team," much more so than what's usually considered "teamwork." It is a parallelization/scheduling effort than deep interpersonal collaboration.
(In a certain sense, this is actually the ideal "teamwork" setup in the industry as well, to have a bunch of people who own their part and are trusted by their colleagues take care of it and not step on each other toes than kumbaya let's all get together on the same problem.)
The teams we beat trained as individuals and were selected competitively against each other as their school's "best 3".
We were "just" three friends who had studied together for 4 years, knew each other's strengths and weaknesses intimately, and then for the comp trained intensively on optimising the "parallelization/scheduling" aspects (as you put it) to get the best score in the minimum time. That included both the logistical and mental aspects of recovering from setbacks midway through the 5 hour problem sets.
During the finals, you'd be surprised how many teams' teamwork we saw fall apart when three very smart people under intense time pressure hit unexpectedly failing submissions with the bottleneck of a single computer. ICPC is a genius format.
Not sure by what metric you compare the difficulty, but regardless of the hardness of the problem, IIRC, ICPC requires 100% correctness on test cases to score a problem (even failing one means you don't get the score,) but IOI would admit fractional scores (correct me if I am wrong.)
I compare the difficulty by solving them myself.
For fractional scores, it depends on problems. In short, there are two types of problems in IOI. One is traditional problems that requires 100% correctness, and the other is continuous scoring.
The prior can still results in score between 0 and 100, but this is because there are subtasks in the problem. For example, a graph become a tree or even just a linear sequence. Nonetheless, you still need to ensure your algorithm is correct on all testcases in that subtask in order to get the score of that task.
I think it's becoming clear that these mega AI corps are juggling with their models at inference time to produce unrealistically good results. By that it seems that they're just cranking up the compute beyond reasonable levels in order to gain PR points against each other.
The fact is most ordinary mortals never get access to a fraction of that kind of power, which explains the commonly reported issues with AI models failing to complete even rudimentary tasks. It's now turned into a whole marketing circus (maybe to justify these ludicrous billion-dollar valuations?).
Models drop in price x10 each year. Us, common folk, getting access to these kinds of models is just a matter of time.
Is that true though? Having to pay some $200 a month for a max account of whatever kind doesn't seem to be cheaper to me at all?
$200/month for an LLM with the capability to fully automate my job is extremely cheap. Of course, even with a high thinking budget we don't have that yet, but if we see it at any cost in 2026, I'll be expecting to be forced into retirement by 2030.
When I say 10 times cheaper, I mean when comparing models of the same capabilities. The kind of performance you get now for a 200$ subscription, a year ago probably would have costed 2000$.
I don’t believe that current models are 1000x better than the initial ChatGPT release. What metric are you using?
You don't? Now I use Gemini to code and optimize CUDA kernels. When I first used GPT3 in the OpenAI playground I was extremely impressed when I managed to get it to output a hello world program in C.
I understand what you're saying. However I'm not sure it's that germane when we're talking about whether or not the current $200 subscription fee is actually delivering value for money, or whether AI giants are manipulating performance to gain marketing points.
I assume the original reply was addressing the “never” in this specific point:
“The fact is most ordinary mortals never get access to a fraction of that kind of power”
I think part of it depends on whether you see AI progress as research or product.
The bleeding edge behind closed doors token burning monsters of 2023 are bad compared to the free LLMs we have now.
I believe it was Sundar in an interview with Lex who said that the reason they haven't developed another Ultra model is because by the time it is ready to launch, the flash and pro versions will have already made it redundant.
But then why does every new model release work great for a few weeks, then suddenly performance plummets? It's mysterious?
Ok but if they can pump those compute and get science/math advancements it's worth something even if the costs are very high
ICPC problems are about as far from scientific advancements as a spelling bee is from Shakespeare.
(I’m a former ICPC competitor)
"It's now turned into a whole marketing circus (maybe to justify these ludicrous billion-dollar valuations?)."
Yes theres an entire ecosystem being built up around language models that has to stay afloat for another 5 years at least, to hope for a significant breakthrough.
i'm still waiting for LLMs to give us one profound science breakthrough
Current cope collection:
- It's not a fair match, these models have more compute and memory than humans
- Contestants weren't really elite, they're just college level programmers, not the world's best
- This doesn't matter for the real world, competitive programming is very different from regular software engineering
- It's marketing, they're just cranking up the compute to unrealistic levels to gain PR points
- It's brute force, not intelligence
Sharing links to a couple of tweets is not a blog post.
Google source post: https://deepmind.google/discover/blog/gemini-achieves-gold-l... (https://news.ycombinator.com/item?id=45278480)
OpenAI tweet: https://x.com/OpenAI/status/1968368133024231902 (https://news.ycombinator.com/item?id=45279514)
Two words: Uh oh
Make that shit cure cancer/disease and abstain from that modern Space race equivalent BS ffs.
chemical space and in vivo testing is a different beast
A database is good at leetcode, who would have thought. Give humans a database and they'll outperform your "AI" (which probably uses an extraordinary amount of graphics cards and electricity).
It is an idiotic benchmark, in line with the rest of the "AI" propaganda.
Where is this magic ICPC competition answers database that they're using?
"Database" was not meant in a literal sense. Clearly a lot of knowledge from similar problems is encoded in the model, that is why you can use models as a kind of fuzzy encyclopedia.
It is like an open book exam for humans where they also can lookup similar problems.
The current top comment makes the same point, but in a more diplomatic and sophisticated manner.
I mean strong human contestants would also know a lot of similar problems, I'm not seeing how it's fundamentally different or not a meaningful achievement.
Whats the point? These models are still unreliable in every day work. And they're getting fat! For a moment, they were getting cheaper, but now they are only getting bigger and this is not going to be cheap in the future. The point is, what are we investing a trillion dollars in?
Unreliable doesn't mean unusable. I'm finding it harder and harder to believe people are actually trying to use them and saying they are useless.
If you can chop your problem up and give little tedious parts of your bigger task it's starting like doing code review for a new grad instead of coding. And they're getting more reliable and the parts you give it can be bigger and bigger. I wish there was a way to stop this but I don't think it's going to.
/> The point is, what are THEY investing a trillion dollars in?
Who cares? I won't be a customer until I see a return on my investment [in them].