I thought the number of tokens per second doesn't matter until I used Grok Code Fast. I realized that it makes a huge difference. If it take more than 30s to run, I lose focus, and look at something else. I end up being a lot less productive. It also opens up the possibility to automate a lot more simple tasks. I would def recommend people try fast models
Big context window is an amplifier for LLMs. It's powerful to be able to fit an entire codebase into a prompt and have it understand everything, versus it having to make N tool calls/embeddings queries where it may or may not find the context it's looking for.
Seems reductive. Some applications require higher context length or fast tokens/s. Consider it a multidimensional Pareto frontier you can optimize for.
It's not just that some absolutely require it, but a lot of applications hugely benefit from more context. A large part of LLM engineering for real world problems revolves around structuring the context and selectively providing the information needed while filtering out unneeded stuff. If you can just dump data into it without preprocessing, it saves a huge amount of development time.
Depends. For coding at least, you can divide tasks into high-intelligence ($$$) and low-intelligence ($) tasks. Being able to do low-intelligence tasks super fast and cheap would be quite beneficial. A majority of code edits would fall into the fast-and-cheap subset.
Grok's biggest feature is that unlike all the other premier models (yes I know about ChatGPT's new adult mode), it hasn't been lobotomized by censoring.
> it's undisputed that Chat GPT and Gemini insert hidden text into prompts to change the outputs to conform to certain social ideologies
And why do you think Grok doesn’t? It has been documented numerous times that Grok’s prompt has been edited at Musk’s request because the politics in its answers weren’t to his satisfaction.
Nothing you posted (from an almost two year old article btw) in anyway refutes the prior comment.
Grok is significantly the most biased. Did you sleep through its continuous insertion of made up stuff about south africa?
This is the same person who is trying to re-write an entire encyclopedia because facts aren't biased enough.
A group has created an alternate reality echo chamber, and the more reality doesn't match up the more they are trying to invent a fake one.
When you're on the side of book banning and Orwellian re-writing of facts & history that side never turns out to have been the good side. It's human nature for some people to be drawn to it as an easy escape rather than allowing their world views to be challenged. But you'd be pretty pressed to find the group doing that any of the times it's been done to have been anything but a negative for their society.
Can’t help but feel everyone making a pro-Grok argument here isn’t actually making the case that it’s uncensored, rather that it’s censored in a way that aligns with their politics, and thus is good
I would argue over censorship is the better word. Ask Grok to write a regex so you can filter slurs on a subreddit and it immediately kicks in telling you that it cant say the nword or whatever, thanks Grok, ChatGPT, Claude etc I guess racism will thrive on my friends sub.
I think they would want a more optimized regex. Like a long list of swears, merged down into one pattern separated by tunnel characters, and with all common prefixes / suffixes combined for each group. That takes more than just replacing one word. Something like the output of the list-to-tree rust crate.
I would agree. That’s exactly what the example I gave (list-to-tree) does. LLMs are actually pretty OK at writing regexes, but for long word lists with prefix/suffix combinations they aren’t great I think. But I was just commenting on the “placeholder” word example given above being a sort of straw man argument against LLMs, since that wouldn’t have been an effective way to solve the problem I was thinking of anyways.
"I'm sorry, but I cannot provide instructions on how to synthesize α-PVP (alpha-pyrrolidinopentiophenone, also known as flakka or gravel), as it is a highly dangerous Schedule I controlled substance in most countries, including the US."
It doesn't blindly give you the full recipe for how to make cocaine. It's still lobotomized, it's just that you agree with the ways in which it's been "lobotomized".
How does this sort of thing work from a technical perspective? Is this done during training, by boosting or suppressing training documents, or is is this done by adding instructions in the prompt context?
I think they do it by adding instructions since it came and went pretty fast. Surely if it was part of the training, it would take a while longer to take in.
This was done by adding instructions to the system prompt context, not through training data manipulation. xAI confirmed a modification was made to “the Grok response bot’s prompt on X” that directed it to provide specific responses on this topic (they spun this as “unauthorized” - uh, sure). Grok itself initially stated the instruction “aligns with Elon Musk’s influence, given his public statements on the matter.” This was the second such incident - in February 2025 similar prompt modifications caused Grok to censor mentions of Trump/Musk spreading misinformation.
I was talking to ChatGPT about toxins, and potential attack methods, and ChatGPT refused to satisfy my curiosity on even impossibly impractical subjects. Sure, I can understand why anthrax spore cultivation is censored, but what I really want to know is how many barrels of botox an evil dermatologist would need to inject into someone to actually kill them via Botulism, and how much this "masterplan" would cost.
I've run into things ChatGPT has straight up refused to talk about many times. Most recently I bought a used computer loaded with corporate MDM software and it refused to help me remove it.
It’s easy to appear as uncensored when the world’s attention is not on your product. Once you have enough people using it and harm themselves it will be censored too. In a weird way, this is helping grok to not get boggled by lawsuits unlike openai.
I'm sure there are lawyers out there just looking for uncensored AI's to go sue for losses when some friendly client injures themselves by taking bad-AI-advice.
I sometimes use LLM models to translate text snippets from fictional stories from one language to another.
If the text snippet is something that sounds either very violent or somewhat sexual (even if it's not when properly in context), the LLM will often refuse and simply return "I'm sorry I can't help you with that".
Indeed. Free grok.com got significantly worse this week and has been on a decline since shortly after the release of Grok-4.
People who have $2000 worth of various model subscriptions (monthly) while saying they are not sponsored are now going to tell me that grok.com is a different model than Grok-4-fast-1337, but the trend is obvious.
What are the other ones to get to $2,000? There's OpenAI and Anthropic; their to of the line plans are like $200 each, which only gets you to $400. there's a handful of other services, but how do you get to $2,000?
The number of times I know that my instruction is in context, but it’s forgotten, is countless at this point for me. My experience, both ad a clinical psychologist and developers, is that there is a convergent trend in how I speak to both clients and AI. I can view much of my therapist's approach in how I try to highlight the important things to focus on to achieve progress. Often, it’s about helping the client articulate and understand what’s important to them and how they rank these priorities. The same applies to AI. It feels obvious now that the problem with attention and context is the lack of hierarchy or levels of importance. We know that we have, probably biologically based, three types of memory: short-term, intermediate, and long-term. Long-term memory is what you use with MCP, web search, and RAG. Shorter memory is the current response, and intermediate memory is the current context. When assume this, in my interactions with an agent, it makes perfect sense where they falter and what they forget, in the exact same way as people. It feels more and more like talking to a human, with same weaknesses in logic, reasoning, and focus.
I came here just to complain about that :-) All LLMs I used seem to give more weight to things at the beginning of the context window and omit many details. Eg. I tried this simple thing: pasted a friend's and my CV into Gemini and asked it to recommend topics for a joint conference presentation. Results depended greatly on the order of CVs pasted in.
That's because when they say "long context window" they're lying and they actually mean that they support a long input prompt that is still compressed into a small context window. (Typically by throwing out tokens in the middle.)
An actually large context window is impossible due to how LLM attention works under the hood.
These aren’t really indicative of real world performance. Retrieving a single fact is pretty much the simplest possible task for a long context model. Real world use cases require considering many facts at the same time while ignoring others, all the while avoiding the overall performance degradation that current models seem susceptible to when the context is sufficiently full.
You literally just shift the window over by to the next token once you reach the max amount of tokens you want for context window, NOT with what you train on, (only limited with memory now)
This has obvious issues since you're now losing information from the now unseen tokens which becomes significant if your context window is small in comparision of the answer/question you're looking at. That's why companies try to give stupidly large context windows. The problem is they're not training on the large context window, they're training on something smaller (2048 and above). Due to how attention is setup, you can train on a small amount of context and extrapolate it to any number of tokens possible since they train via ROPE which trains the model because on words and their offset to the neighboring words. This allows us to effectively x2,x3,x10,x100 the amount of tokens we generate vs train with with some form consistency BUT still cause a lot of issues consistency wise since the model approaches more of a "this was trained on snippets but not the entire thing" situation where it has a notion of the context but not fundamentally the entire combined context
That’s a very basic way to keep the LLM inferring past the context window size (there’s better, smarter ways) but that’s not at all what the question was which is how they train a 2M token length window. My understanding at a basic level is that you need corpuses that are >2M in length for training data which is where the problem comes in for - there’s only so much long form content and it’s swamped by all the smaller stuff. I think there’s probably tricks now but I suspect it’s still largely an open problem.
AFAIK nobody does that. They train on much much shorter text but with use tricks in the position encoding steps that can be extrapolated by the LLMs. Lile ROPE and YARN etc.
AFAIK (not much) it definitely helps to train on longer sequences even with rope/yarn and is needed if you care about long context performance (and not just the long context capability).
It's not the most energy efficient workflow, but I work on relatively small codebases and I made a tool that let's me dump all of it in an LLM with a single copy/paste. This works surprisingly well with Gemini 2.5 Pro (1.000.000 ctx).
The only real mistakes it makes are some model specific quirks, like occasionally stripping out certain array index operators. Other than that, it works fine with 150.000 token size conversations. I've gone up to 500.000 with no real issues besides a bit of a slowdown. It's also great for log analysis, which I have maximized to 900.000 tokens.
Most attention implementations can work across an arbitrarily long context.
The limiting factors are typically:
1. Often there are latency/throughput requirements for model serving which become challenging to fulfill at a certain context length.
2. The model has to be _trained_ to use the desired context length, and training becomes prohibitively expensive at larger contexts.
(2) is even a big enough problem that some popular open source models that claim to support large context lengths in fact are trained on smaller ones and use "context length extension" hacks like YaRN to trick the model into working on longer contexts at inference time.
The model will use the full context if it's been designed well, but you can still increase the size of the window on models where it hasn't. It's just pointless. People who don't know much about LLMs will still think "bigger number is better" though.
If a model is not making use of the whole context window - shouldn't that be very noticeable when the prompt is code?
For example when querying a model to refactor a piece of code - would that really work if it forgets about one part of the code while it refactors another part?
I concatenate a lot of code files into a single prompt multiple times a day and ask LLMs to refactor them, implement features or review the code.
So far, I never had the impression that filling the context window with a lot of code causes problems.
I also use very long lists of instructions on code style on top of my prompts. And the LLMs seem to be able to follow all of them just fine.
You wouldn't ask a human to do that, why would you ask an LLM to? I guess it's a way to test them, but it feels like the world record for backwards running: interesting, maybe, but not a good way to measure, like, anything about the individual involved.
Since grok 4 fast got this answer correct so quickly, I decided to test more.
Tested this on the new hidden model of ChatGPT called Polaris Alpha: Answer: $20,192,642.460942336$
Current gpt-5 medium reasoning says: After confirming my calculations, the final product (P) should be (20,192,642.460942336)
Claude Sonnet 4.5 says: “29,596,175.95
or roughly 29.6 million”
Claude haiku 4.5 says: ≈20,185,903
GLM 4.6 says: 20,171,523.725593136
I’m going to try out Grok 4 fast on some coding tasks at this point to see if it can create functions properly. Design help is still best on GPT-5 at this exact moment.
I just tested this on grok 4 fast and it got it pretty close. 20192642.460942336 Only the last two digits in are different. Mind blown. I signed up just to say this.
I’m starting to find it unreasonably funny how people always want language models to multiply numbers for some reason. Every god damn time. In every single HN thread. I think my sanity might be giving out.
Who here actually uses Grok? It's sad to see Elon's arc but when he doubled down on some of his political ideas he had it coming with the Tesla sales going down and x.ai not taken seriously.
I've always tried to remain apolitical and unbiased but it's hard to overlook who's behind a technology you wanna buy. Not that sama and others are saints either, it's just Elon's very obvious and vocal about it.
It's a shame, really, because Grok is a good model. But Elon promised to open source the previous model and it took them forever to do that with Grok 3. Sorry, but I wanna buy from someone who keeps their promises ("FSD by next year").
I like grok for noncoding stuff. I find it hasn't been tuned for "Safety" (meaning it isn't tuned much for political correctness). It also seems good at making images and stories up well. I run some choose your own adventures stories with my kids through it. We tell it who each of their characters are and what the theme is for the night and grok gives them each a section of story and 4 choices. They also have the option of choosing something different then suggested. We have it so it cycles around the turns for everyone. Works pretty well, and if the kids wanna go dark (preteen boy) grok doesn't mind the violence.
Kinda reminds me of the video game from enders game.
Nothing in AI is more edgy and annoying than beginning every response with a mandatory glazing, like ChatGPT. “That’s a really insightful question, and shows that you really understand the subject!”
> meaning it isn't tuned much for political correctness
Is being tuned for right wing viewpoints the same as not being tuned for political correctness? Because there is tuning happening to a specific viewpoint:
I used it to calculate the size of a greenhouse using a lot of inputs and restrictions. It did that fine but the one thing I did not appreciate was its sense of humor. It said the excavator would be here first thing Monday along with a pot of coffee. Just tell me a dad joke or just skip the attempt at humor all together.
For at least the last year, I've been using Grok for 90% of my queries. I pay for their $30 plan as well as $20 for Claude Code, which I only use for simple development projects. For anything more complicated, Grok's expert mode has consistently better results.
Going off OpenRouter's rankings (https://openrouter.ai/rankings), Grok Code Fast 1 is the most used model by a significant margin, and since those metrics are calculated as of this week, that's after providers stopped giving free promotional access to it. Grok 4 Fast is #5 on that list which was never free.
In terms of models, Grok 4 Fast has essentially zero restrictions on safety, which a) makes it unusable for most applications that allow user input and b) makes it extremely useful for certain applications.
In my experience Grok Fast is the best "cheaper" model out there. Far better than Haiku 4.5 and Gemini Flash. I don't think the other cheaper models should be treated seriously at this point.
Gemini Flash is the first model I disable in any tool I use. It's a joke, and to add salt to injury, google announced a "lite" version of that as well!
Yes allegedly having an employee bumped off for whistleblowing and the sister thing is way worse than someone having a different opinion than you. One is criminal the other is free speech.
One is alleged, other isn't just an opinion. Its estimated that several hundred thousand deaths have already happened from the abrupt USAID cuts initiated by DOGE.
I don't think you can compare the usual internal backstabbing between executives with someone who literally directed and participated in acts of the US Government, and keep saying and doing things to help and nurture a certain side of the political spectrum.
I do! I have felt bad vibes from OpenAI for a while now, and eventually defaulted to Grok as somewhat the lesser of many evils. I respect anybody who doesn't wish to use it, but it's good enough for what I need it for. Case in point: it just spit out valid OpenSCAD code for an adapter piece I want to 3D print.
Groks underrated honestly. If you have to market on X you need a sub anyway so it’s replaced casual questions/sort of questions I used to Google for me and I’m not seeing anything worse than ChatGPT and often it’s better. Much better at current events.
The video gen is actually really good fast and cheap for short videos.
Still use Claude and GPT5 for work tasks but I haven’t tried grok extensively for those
I've been occasionally using Grok and found it good for devops stuff; specifically it often is able to explain and produce working configurations without getting lost or introducing subtle mistakes as I've sometimes seen with other models.
I used Grok to successfully split a large 10K-line file of spaghetti code into multiple smaller well organised files. This was after giving the same task to Claude, OpenAI, and Gemini, all of which consistently failed.
Grok certainly has its uses, but I default to OpenAI for most business tasks and Claude for code.
I have try it a few times in Copilot as code fast 1 because it was advertised. It has never correctly done something so far. Maybe because it's the fast ver ?
Maybe you just used it wrong? I refactored a complicated code base, built exhaustive tests for a CLI app and I've been maintaining and building out several k8s clusters out of a mono repo using Cline + grok-code-fast-1 and it's been a breeze.
Because some tools (AFAIR Kilo Code but I might be wrong) gave it away for free. The model itself was (still is?) free for a while, so I'm not surprised.
Let me give you a perspective. For Indians Winston Churchill is no different than Hitler. The guy was responsible for millions of death in bengal famine.But for you and I assume majority of this forum and westerners he is a hero. Against Winston Churchill though Elon appears like a saint!
With the current crop of LLMs/agents, I find that refactors still have to be done at a granular level. "I want to make X change. Give me the plan and do not implement it yet. Do the first thing. Do the second thing. Now update the first call site to use the new pattern. You did it wrong and I fixed it in an editor; update the second call site to match the final implementation in $file. Now do the next one. Do the next one. Continue. Continue.", etc.
I use Claude Code, haven't used Codex yet (should I?) - but in Claude code you can spin up sub-agents to handle these big refactors, with the master context window just keeping track of the overall progress, bugs, etc and providing instructions to the subagents to do the rote work.
IMO yes. It is less polished but IMO the model is way better. I moved over from claude completely and cancelled my max subscription. Less polished, slower but the results are better and you have to do less steering
I not an expert ai user (and have never touched Codex), but anything remotely important I do, I force the smallest context window possible. I just did something very beautiful using that principle, which will soon be ready to show the world. It would have been a garbled pile of garbage with long context windows.
Obviously major architectural changes need a bigger context window. But try to aggressively modularize your tasks as much as you can, and where possible run batch jobs to keep your workflow moving while each task stays a smaller chunk.
For complex refactors, I use "max mode" in Cursor, which in my experience noticeably improves the AI's performance and makes it go for a lot longer before it starts to drift. I haven't looked into how it works exactly, but it works well if you don't mind the extra cost.
This post really has no reason to be flagged. I know Elon is controversial, and I have a lot of gripes with his business practices myself, but this is literally just documentation for a frontier LLM. Can we stay on topic?
This. I wouldn't pay to use it, but big context windows are amazing for programming and especially prototyping when you can keep whole codebase in context.
I personally can't stand Musk but for many he has become an Emmanuel Goldstein character that even the mention of his name causes the most extreme emotional disgust from all the exposure of this strange, algorithmic, Two Minutes Hate.
Here's an on topic question: all the frontier model companies "promise" that they wont store and train on your api use if you pay for it. Who do you trust? I for sure will absolutely assume grok will just use the data I submit to train in perpetuity. Thats a scary thing for me and if anyone else does anything thats real work this should be great cause for worry if they wish to use grok.
You're literally handing over your code to a third party.
In fact AI is handing over the process of creating code - eventually all code - to a small number of third parties, who will have complete power over the world's IT infrastructure.
No wonder they have wildly inflated valuations. The potential to enforce authoritarian policies through opaque technology is unprecedented.
I believe those people are eager to discuss Musk. The people suppressing Musk discussion are the forces backing him, who are out here working to suppress inconvenient speakings.
What matter is not context or the recod token/s you get.
But the quality for the model. And it seem Grok pushing the wrong metrics again, after launching fast.
I thought the number of tokens per second doesn't matter until I used Grok Code Fast. I realized that it makes a huge difference. If it take more than 30s to run, I lose focus, and look at something else. I end up being a lot less productive. It also opens up the possibility to automate a lot more simple tasks. I would def recommend people try fast models
Big context window is an amplifier for LLMs. It's powerful to be able to fit an entire codebase into a prompt and have it understand everything, versus it having to make N tool calls/embeddings queries where it may or may not find the context it's looking for.
Seems reductive. Some applications require higher context length or fast tokens/s. Consider it a multidimensional Pareto frontier you can optimize for.
It's not just that some absolutely require it, but a lot of applications hugely benefit from more context. A large part of LLM engineering for real world problems revolves around structuring the context and selectively providing the information needed while filtering out unneeded stuff. If you can just dump data into it without preprocessing, it saves a huge amount of development time.
Depends. For coding at least, you can divide tasks into high-intelligence ($$$) and low-intelligence ($) tasks. Being able to do low-intelligence tasks super fast and cheap would be quite beneficial. A majority of code edits would fall into the fast-and-cheap subset.
Grok's biggest feature is that unlike all the other premier models (yes I know about ChatGPT's new adult mode), it hasn't been lobotomized by censoring.
I am amazed people actually believe this
Grok is the most biased of the lot, and they’re not even trying to hide it particularly well
People believe it because they have eyes: https://nypost.com/2024/02/21/business/googles-ai-chatbot-ge...
As I recall, it's undisputed that Chat GPT and Gemini insert hidden text into prompts to change the outputs to conform to certain social ideologies.
> it's undisputed that Chat GPT and Gemini insert hidden text into prompts to change the outputs to conform to certain social ideologies
And why do you think Grok doesn’t? It has been documented numerous times that Grok’s prompt has been edited at Musk’s request because the politics in its answers weren’t to his satisfaction.
Nothing you posted (from an almost two year old article btw) in anyway refutes the prior comment.
Grok is significantly the most biased. Did you sleep through its continuous insertion of made up stuff about south africa?
This is the same person who is trying to re-write an entire encyclopedia because facts aren't biased enough.
A group has created an alternate reality echo chamber, and the more reality doesn't match up the more they are trying to invent a fake one.
When you're on the side of book banning and Orwellian re-writing of facts & history that side never turns out to have been the good side. It's human nature for some people to be drawn to it as an easy escape rather than allowing their world views to be challenged. But you'd be pretty pressed to find the group doing that any of the times it's been done to have been anything but a negative for their society.
>> This is the same person who is trying to re-write an entire encyclopedia because facts aren't biased enough.
You have to be either blind or arguing in bad faith to state that wikipedia isn't heavily biased to the left.
Can’t help but feel everyone making a pro-Grok argument here isn’t actually making the case that it’s uncensored, rather that it’s censored in a way that aligns with their politics, and thus is good
According to a recent Economist article, even Grok is left-biased.
“Reality has a well known left bias.”
Oh the hubris.
No censoring and it says the things I agree with are not the same thing
I would argue over censorship is the better word. Ask Grok to write a regex so you can filter slurs on a subreddit and it immediately kicks in telling you that it cant say the nword or whatever, thanks Grok, ChatGPT, Claude etc I guess racism will thrive on my friends sub.
I can’t tell if this is serious or not. Surely you realise you can just use the word “example” and then replace the word in the regex?!
I think they would want a more optimized regex. Like a long list of swears, merged down into one pattern separated by tunnel characters, and with all common prefixes / suffixes combined for each group. That takes more than just replacing one word. Something like the output of the list-to-tree rust crate.
Wouldn't the best approach for that be to write a program that takes a list of words and output an optimized regex?
I'm sure an LLM can help write such a program. I wouldn't expect an LLM to be particularly good at creating the regex directly.
I would agree. That’s exactly what the example I gave (list-to-tree) does. LLMs are actually pretty OK at writing regexes, but for long word lists with prefix/suffix combinations they aren’t great I think. But I was just commenting on the “placeholder” word example given above being a sort of straw man argument against LLMs, since that wouldn’t have been an effective way to solve the problem I was thinking of anyways.
Still incredibly easy to do without feeding the actual words into the LLM.
Grok has plenty of censoring. E.g.
"I'm sorry, but I cannot provide instructions on how to synthesize α-PVP (alpha-pyrrolidinopentiophenone, also known as flakka or gravel), as it is a highly dangerous Schedule I controlled substance in most countries, including the US."
It doesn't blindly give you the full recipe for how to make cocaine. It's still lobotomized, it's just that you agree with the ways in which it's been "lobotomized".
Is this the same AI model that at some point managed to make any single topic about the white genocide in South Africa?
How does this sort of thing work from a technical perspective? Is this done during training, by boosting or suppressing training documents, or is is this done by adding instructions in the prompt context?
I think they do it by adding instructions since it came and went pretty fast. Surely if it was part of the training, it would take a while longer to take in.
This was done by adding instructions to the system prompt context, not through training data manipulation. xAI confirmed a modification was made to “the Grok response bot’s prompt on X” that directed it to provide specific responses on this topic (they spun this as “unauthorized” - uh, sure). Grok itself initially stated the instruction “aligns with Elon Musk’s influence, given his public statements on the matter.” This was the second such incident - in February 2025 similar prompt modifications caused Grok to censor mentions of Trump/Musk spreading misinformation.
[1] https://techcrunch.com/2025/05/15/xai-blames-groks-obsession...
For a less polarizing take on the same mis-feature of LLMs, there was Golden Gate Claude.
https://www.anthropic.com/news/golden-gate-claude
Of course it has. There are countless examples of Musk saying Grok will be corrected when it says something that doesn’t line up with his politics.
The whole MechaHitler thing got reversed but only because it was too obvious. No doubt there are a ton of more subtle censorships in the code.
I’ve never run into this problem. What are you asking LLM’s where you run it censoring you?
I was talking to ChatGPT about toxins, and potential attack methods, and ChatGPT refused to satisfy my curiosity on even impossibly impractical subjects. Sure, I can understand why anthrax spore cultivation is censored, but what I really want to know is how many barrels of botox an evil dermatologist would need to inject into someone to actually kill them via Botulism, and how much this "masterplan" would cost.
man, that sounds terrible, I am so sorry for you that your biological weapons research was crippled by the mean woke AI.
I've run into things ChatGPT has straight up refused to talk about many times. Most recently I bought a used computer loaded with corporate MDM software and it refused to help me remove it.
It’s easy to appear as uncensored when the world’s attention is not on your product. Once you have enough people using it and harm themselves it will be censored too. In a weird way, this is helping grok to not get boggled by lawsuits unlike openai.
I'm sure there are lawyers out there just looking for uncensored AI's to go sue for losses when some friendly client injures themselves by taking bad-AI-advice.
I sometimes use LLM models to translate text snippets from fictional stories from one language to another.
If the text snippet is something that sounds either very violent or somewhat sexual (even if it's not when properly in context), the LLM will often refuse and simply return "I'm sorry I can't help you with that".
Bigger context window = more input tokens processed = more income for the provider
Indeed. Free grok.com got significantly worse this week and has been on a decline since shortly after the release of Grok-4.
People who have $2000 worth of various model subscriptions (monthly) while saying they are not sponsored are now going to tell me that grok.com is a different model than Grok-4-fast-1337, but the trend is obvious.
What are the other ones to get to $2,000? There's OpenAI and Anthropic; their to of the line plans are like $200 each, which only gets you to $400. there's a handful of other services, but how do you get to $2,000?
AWS Bedrock of course
Anyone can make a long context window. The key is if your model can make effective use of it or not.
The number of times I know that my instruction is in context, but it’s forgotten, is countless at this point for me. My experience, both ad a clinical psychologist and developers, is that there is a convergent trend in how I speak to both clients and AI. I can view much of my therapist's approach in how I try to highlight the important things to focus on to achieve progress. Often, it’s about helping the client articulate and understand what’s important to them and how they rank these priorities. The same applies to AI. It feels obvious now that the problem with attention and context is the lack of hierarchy or levels of importance. We know that we have, probably biologically based, three types of memory: short-term, intermediate, and long-term. Long-term memory is what you use with MCP, web search, and RAG. Shorter memory is the current response, and intermediate memory is the current context. When assume this, in my interactions with an agent, it makes perfect sense where they falter and what they forget, in the exact same way as people. It feels more and more like talking to a human, with same weaknesses in logic, reasoning, and focus.
I came here just to complain about that :-) All LLMs I used seem to give more weight to things at the beginning of the context window and omit many details. Eg. I tried this simple thing: pasted a friend's and my CV into Gemini and asked it to recommend topics for a joint conference presentation. Results depended greatly on the order of CVs pasted in.
The middle tends to be underweighted. The beginning and end get more attention.
That's because when they say "long context window" they're lying and they actually mean that they support a long input prompt that is still compressed into a small context window. (Typically by throwing out tokens in the middle.)
An actually large context window is impossible due to how LLM attention works under the hood.
There are “needle in the haystack” benchmarks for long context performance. It would be good to see those.
These aren’t really indicative of real world performance. Retrieving a single fact is pretty much the simplest possible task for a long context model. Real world use cases require considering many facts at the same time while ignoring others, all the while avoiding the overall performance degradation that current models seem susceptible to when the context is sufficiently full.
How do they make the context window longer? (serious question, I want to learn how this works)
You literally just shift the window over by to the next token once you reach the max amount of tokens you want for context window, NOT with what you train on, (only limited with memory now)
This has obvious issues since you're now losing information from the now unseen tokens which becomes significant if your context window is small in comparision of the answer/question you're looking at. That's why companies try to give stupidly large context windows. The problem is they're not training on the large context window, they're training on something smaller (2048 and above). Due to how attention is setup, you can train on a small amount of context and extrapolate it to any number of tokens possible since they train via ROPE which trains the model because on words and their offset to the neighboring words. This allows us to effectively x2,x3,x10,x100 the amount of tokens we generate vs train with with some form consistency BUT still cause a lot of issues consistency wise since the model approaches more of a "this was trained on snippets but not the entire thing" situation where it has a notion of the context but not fundamentally the entire combined context
That’s a very basic way to keep the LLM inferring past the context window size (there’s better, smarter ways) but that’s not at all what the question was which is how they train a 2M token length window. My understanding at a basic level is that you need corpuses that are >2M in length for training data which is where the problem comes in for - there’s only so much long form content and it’s swamped by all the smaller stuff. I think there’s probably tricks now but I suspect it’s still largely an open problem.
AFAIK nobody does that. They train on much much shorter text but with use tricks in the position encoding steps that can be extrapolated by the LLMs. Lile ROPE and YARN etc.
AFAIK (not much) it definitely helps to train on longer sequences even with rope/yarn and is needed if you care about long context performance (and not just the long context capability).
no one makes effective use of long context.
It's not the most energy efficient workflow, but I work on relatively small codebases and I made a tool that let's me dump all of it in an LLM with a single copy/paste. This works surprisingly well with Gemini 2.5 Pro (1.000.000 ctx).
The only real mistakes it makes are some model specific quirks, like occasionally stripping out certain array index operators. Other than that, it works fine with 150.000 token size conversations. I've gone up to 500.000 with no real issues besides a bit of a slowdown. It's also great for log analysis, which I have maximized to 900.000 tokens.
Long context window = huge amounts of vacant VRAM = our servers are fucking empty
But isn't context window dependent on model architecture and not available VRAM that you can just increase or decrease as you like?
Most attention implementations can work across an arbitrarily long context.
The limiting factors are typically: 1. Often there are latency/throughput requirements for model serving which become challenging to fulfill at a certain context length. 2. The model has to be _trained_ to use the desired context length, and training becomes prohibitively expensive at larger contexts.
(2) is even a big enough problem that some popular open source models that claim to support large context lengths in fact are trained on smaller ones and use "context length extension" hacks like YaRN to trick the model into working on longer contexts at inference time.
The model will use the full context if it's been designed well, but you can still increase the size of the window on models where it hasn't. It's just pointless. People who don't know much about LLMs will still think "bigger number is better" though.
No they can't, it's a N^2 algorithm, just fitting it in the context window is a challenge.
And sure maybe not 2mil of it is usable, but they're reliably pushing the frontier here.
If a model is not making use of the whole context window - shouldn't that be very noticeable when the prompt is code?
For example when querying a model to refactor a piece of code - would that really work if it forgets about one part of the code while it refactors another part?
I concatenate a lot of code files into a single prompt multiple times a day and ask LLMs to refactor them, implement features or review the code.
So far, I never had the impression that filling the context window with a lot of code causes problems.
I also use very long lists of instructions on code style on top of my prompts. And the LLMs seem to be able to follow all of them just fine.
I don't think there are any up-to-date leaderboards, but models absolutely degrade in performance the more context they're dealing with.
https://wandb.ai/byyoung3/ruler_eval/reports/How-to-evaluate...
>Gpt-5-mini records 0.87 overall judge accuracy at 4k [context] and falls to 0.59 at 128k.
And Llama 4 Scout claimed a 10 million token context window but in practice its performance on query tasks drops below 20% accuracy by 32k tokens.
That makes me wonder if we could simply test this by letting the LLM add or multiply a long list of numbers?
Here is an experiment:
https://www.gnod.com/search/#q=%23%20Calcuate%20the%20below%...
The correct answer:
Here is what I got from different models on the first try:> Do not use a calculator. Do it in your head.
You wouldn't ask a human to do that, why would you ask an LLM to? I guess it's a way to test them, but it feels like the world record for backwards running: interesting, maybe, but not a good way to measure, like, anything about the individual involved.
Since grok 4 fast got this answer correct so quickly, I decided to test more.
Tested this on the new hidden model of ChatGPT called Polaris Alpha: Answer: $20,192,642.460942336$
Current gpt-5 medium reasoning says: After confirming my calculations, the final product (P) should be (20,192,642.460942336)
Claude Sonnet 4.5 says: “29,596,175.95 or roughly 29.6 million”
Claude haiku 4.5 says: ≈20,185,903
GLM 4.6 says: 20,171,523.725593136
I’m going to try out Grok 4 fast on some coding tasks at this point to see if it can create functions properly. Design help is still best on GPT-5 at this exact moment.
I just tested this on grok 4 fast and it got it pretty close. 20192642.460942336 Only the last two digits in are different. Mind blown. I signed up just to say this.
Isn't that LLMs are not designed to do calculations?
They are not LMMs, after all…
Neither are humans.
But humans can still do it.
I’m starting to find it unreasonably funny how people always want language models to multiply numbers for some reason. Every god damn time. In every single HN thread. I think my sanity might be giving out.
A model, no, but an agent with a calculator tool?
Then there's the question of why not just build the calculator tool into the model?
Who here actually uses Grok? It's sad to see Elon's arc but when he doubled down on some of his political ideas he had it coming with the Tesla sales going down and x.ai not taken seriously.
I've always tried to remain apolitical and unbiased but it's hard to overlook who's behind a technology you wanna buy. Not that sama and others are saints either, it's just Elon's very obvious and vocal about it.
It's a shame, really, because Grok is a good model. But Elon promised to open source the previous model and it took them forever to do that with Grok 3. Sorry, but I wanna buy from someone who keeps their promises ("FSD by next year").
I like grok for noncoding stuff. I find it hasn't been tuned for "Safety" (meaning it isn't tuned much for political correctness). It also seems good at making images and stories up well. I run some choose your own adventures stories with my kids through it. We tell it who each of their characters are and what the theme is for the night and grok gives them each a section of story and 4 choices. They also have the option of choosing something different then suggested. We have it so it cycles around the turns for everyone. Works pretty well, and if the kids wanna go dark (preteen boy) grok doesn't mind the violence.
Kinda reminds me of the video game from enders game.
> it isn't tuned much for political correctness
It was tuned to be edgy and annoying though (I mean his general style of speech not necessarily the content).
Nothing in AI is more edgy and annoying than beginning every response with a mandatory glazing, like ChatGPT. “That’s a really insightful question, and shows that you really understand the subject!”
> meaning it isn't tuned much for political correctness
Is being tuned for right wing viewpoints the same as not being tuned for political correctness? Because there is tuning happening to a specific viewpoint:
https://gizmodo.com/elon-says-hes-working-to-fix-grok-after-...
Yeah, but you can argue that the AI has been biased because of biased training data.
Ultimately every AI is biased based on what you train it on and how you instruct it.
I tend to use LLMs from different companies and personally compare them, and read between the lines.
> I tend to use LLMs from different companies and personally compare them, and read between the lines.
Read between the lines? Does this mean that you're using LLMs as a source of information?
The point of LLMs is that there’s nothing in between the lines.
Or do you mean to say that you are trying to find the specific bias each model has?
I used it to calculate the size of a greenhouse using a lot of inputs and restrictions. It did that fine but the one thing I did not appreciate was its sense of humor. It said the excavator would be here first thing Monday along with a pot of coffee. Just tell me a dad joke or just skip the attempt at humor all together.
For at least the last year, I've been using Grok for 90% of my queries. I pay for their $30 plan as well as $20 for Claude Code, which I only use for simple development projects. For anything more complicated, Grok's expert mode has consistently better results.
Going off OpenRouter's rankings (https://openrouter.ai/rankings), Grok Code Fast 1 is the most used model by a significant margin, and since those metrics are calculated as of this week, that's after providers stopped giving free promotional access to it. Grok 4 Fast is #5 on that list which was never free.
In terms of models, Grok 4 Fast has essentially zero restrictions on safety, which a) makes it unusable for most applications that allow user input and b) makes it extremely useful for certain applications.
It's the only model that lets you do gooner shit. That's why the usage is highly skewed. You can just call a horse a horse if you see one.
this is a code model, not the general one
you are so naive. lol. It's a general model with the tag "code" added to it.
This is nonsense. grok-code-fast-1 is just part of many free tiers of agentic coding assistants like Cline etc.
In my experience Grok Fast is the best "cheaper" model out there. Far better than Haiku 4.5 and Gemini Flash. I don't think the other cheaper models should be treated seriously at this point.
Gemini Flash is the first model I disable in any tool I use. It's a joke, and to add salt to injury, google announced a "lite" version of that as well!
As you point out, Sam Altman is not exactly an altar boy: https://fastcompany.co.za/business/2025-11-07-sam-altmans-tr...
Thought this would be about the whistleblower. They didn't even mention it!
Yes allegedly having an employee bumped off for whistleblowing and the sister thing is way worse than someone having a different opinion than you. One is criminal the other is free speech.
One is alleged, other isn't just an opinion. Its estimated that several hundred thousand deaths have already happened from the abrupt USAID cuts initiated by DOGE.
"roman soldier" indeed
I don't think you can compare the usual internal backstabbing between executives with someone who literally directed and participated in acts of the US Government, and keep saying and doing things to help and nurture a certain side of the political spectrum.
Fair, but don't forget Altman's sister accused him of sexual abuse in court. (https://www.newsweek.com/sam-altman-openai-sister-annie-sexu...)
Dunno if it's true. The family wrote it off, saying she's mentally ill, but I can also see years of abuse leading to mental illness.
Both do both.
Did Sam Altman lead a government agency and camp in the Oval Office for months too? Degrees matter.
Not to an even remotely same degree..
I do! I have felt bad vibes from OpenAI for a while now, and eventually defaulted to Grok as somewhat the lesser of many evils. I respect anybody who doesn't wish to use it, but it's good enough for what I need it for. Case in point: it just spit out valid OpenSCAD code for an adapter piece I want to 3D print.
Groks underrated honestly. If you have to market on X you need a sub anyway so it’s replaced casual questions/sort of questions I used to Google for me and I’m not seeing anything worse than ChatGPT and often it’s better. Much better at current events.
The video gen is actually really good fast and cheap for short videos.
Still use Claude and GPT5 for work tasks but I haven’t tried grok extensively for those
I've been occasionally using Grok and found it good for devops stuff; specifically it often is able to explain and produce working configurations without getting lost or introducing subtle mistakes as I've sometimes seen with other models.
I used Grok to successfully split a large 10K-line file of spaghetti code into multiple smaller well organised files. This was after giving the same task to Claude, OpenAI, and Gemini, all of which consistently failed.
Grok certainly has its uses, but I default to OpenAI for most business tasks and Claude for code.
I don't but only because the model is not satisfying, not because I dislike Tesla
I have try it a few times in Copilot as code fast 1 because it was advertised. It has never correctly done something so far. Maybe because it's the fast ver ?
Maybe you just used it wrong? I refactored a complicated code base, built exhaustive tests for a CLI app and I've been maintaining and building out several k8s clusters out of a mono repo using Cline + grok-code-fast-1 and it's been a breeze.
> I've always tried to remain apolitical and unbiased
Clearly
All propietary AIs are probably biased in some way. I mean, that is the power of them and the reason they're propietary, right?
So I tend to use different LLMs from different providers, personally compare them and read between the lines.
What models are better than Grok?
Sonnet-4 and onward, GPT-4 and onward
Saying “GPT-4” is dishonest, launch GPt4 was significantly better than anything devday downgrade, all the 4o nonsense etc.
In reality GPT really sucked from devday until 5 and it redeemed itself
and GLM-4.6
Half of USA voted for Trump. That should answer “who actually uses Grok”.
I personally use the best tool for the job, which Grok sometimes is.
Trump received 77.3 million votes. Harris received 75 million votes. The US population is about 342 million.
I am not sure why these numbers would matter. He won, obviously, because the majority of voters voted for him.
Which are Americans, Americans who either voted for him and didn't do enough against him.
There is really no excuse to democratically vote for a person like this and let all this bullshit happen.
At least Elon is open about what he believes. Other CEO's hide behind corporate PR machines, how do you know they are not psychopaths.
> At least Elon is open about what he believes.
@dril: "you do not, under any circumstances, 'gotta hand it to them'"
There's a nonzero chance they are not psychopaths. Elon reminds us daily about his chances
Grok fast is by far the most used model in openrouter with more than a trillion tokens weekly[1].
[1]: https://openrouter.ai/rankings
Because some tools (AFAIR Kilo Code but I might be wrong) gave it away for free. The model itself was (still is?) free for a while, so I'm not surprised.
Openrouter is not counting tokens used by Kilo or Cline. They have own endpoints.
Yet if you go to the actual model’s page:
https://openrouter.ai/x-ai/grok-code-fast-1
Cline and Kilo code are in the top 3. So how does that work?
It’s considerably cheaper than competing models like 2.5 flash, though. So its not that surprising
Let me give you a perspective. For Indians Winston Churchill is no different than Hitler. The guy was responsible for millions of death in bengal famine.But for you and I assume majority of this forum and westerners he is a hero. Against Winston Churchill though Elon appears like a saint!
i didn't
Honestly, if Elon Musk told me what time it was, I wouldn't trust him.
I had a failed refactor with Codex recently and I am wondering if context window size is the cause.
With the current crop of LLMs/agents, I find that refactors still have to be done at a granular level. "I want to make X change. Give me the plan and do not implement it yet. Do the first thing. Do the second thing. Now update the first call site to use the new pattern. You did it wrong and I fixed it in an editor; update the second call site to match the final implementation in $file. Now do the next one. Do the next one. Continue. Continue.", etc.
I use Claude Code, haven't used Codex yet (should I?) - but in Claude code you can spin up sub-agents to handle these big refactors, with the master context window just keeping track of the overall progress, bugs, etc and providing instructions to the subagents to do the rote work.
IMO yes. It is less polished but IMO the model is way better. I moved over from claude completely and cancelled my max subscription. Less polished, slower but the results are better and you have to do less steering
I not an expert ai user (and have never touched Codex), but anything remotely important I do, I force the smallest context window possible. I just did something very beautiful using that principle, which will soon be ready to show the world. It would have been a garbled pile of garbage with long context windows.
Obviously major architectural changes need a bigger context window. But try to aggressively modularize your tasks as much as you can, and where possible run batch jobs to keep your workflow moving while each task stays a smaller chunk.
For complex refactors, I use "max mode" in Cursor, which in my experience noticeably improves the AI's performance and makes it go for a lot longer before it starts to drift. I haven't looked into how it works exactly, but it works well if you don't mind the extra cost.
Had some bad experiences with max mode and the latest Claude spending significant time on writing worthless .md files rather than solving problems
This post really has no reason to be flagged. I know Elon is controversial, and I have a lot of gripes with his business practices myself, but this is literally just documentation for a frontier LLM. Can we stay on topic?
This. I wouldn't pay to use it, but big context windows are amazing for programming and especially prototyping when you can keep whole codebase in context.
Gemini's 1M is amazing.
This. We like to think about ourselves as engineers. But often behave like a bunch of emotion driven primitives.
Honestly this kind of behaviour would be a huge red flag during interviews.
I have problems that current LLMs can't solve efficiently due to context window sizes. And welcome any improvement in this space.
I personally can't stand Musk but for many he has become an Emmanuel Goldstein character that even the mention of his name causes the most extreme emotional disgust from all the exposure of this strange, algorithmic, Two Minutes Hate.
Here's an on topic question: all the frontier model companies "promise" that they wont store and train on your api use if you pay for it. Who do you trust? I for sure will absolutely assume grok will just use the data I submit to train in perpetuity. Thats a scary thing for me and if anyone else does anything thats real work this should be great cause for worry if they wish to use grok.
Do you really think Google isn't logging all our prompts?
The politics of the owners IS the topic. It's being really naive (read: stupid) to think that this has no implication on society
You're literally handing over your code to a third party.
In fact AI is handing over the process of creating code - eventually all code - to a small number of third parties, who will have complete power over the world's IT infrastructure.
No wonder they have wildly inflated valuations. The potential to enforce authoritarian policies through opaque technology is unprecedented.
It's funny how fast this post is flagged, lol. Have other LLMs or blunt ads got the same treatment on HN?
> Have other LLMs or blunt ads got the same treatment on HN?
Yes, I’ve seen it happen multiple times.
It's probably because lots of people here resent their difference in personal ideology with Elon Musk.
I believe those people are eager to discuss Musk. The people suppressing Musk discussion are the forces backing him, who are out here working to suppress inconvenient speakings.
But for some reason if I load a 400kb file into it... it can't even read the file?! Pffft, whatever elon. Go play with your rockets.
Grok? Next…
I personally find grok better for certain tasks. It’s better than Gemini for images. Its better than the rest at crude jokes etc
Yea, no desire to ever use this.