How many people here actually write custom cuda/triton kernels? An extremely small hand full of people write them (and they're all on one discord server).. which then gets integrated into Pytorch, Megatron-LM, vLLM, SGLang, etc. The rest of us in the ML/AI ecosystem have absolutely no incentive to migrate off of python due to network effects even though I think it's a terrible language for maintainable production systems.
If Mojo focuses on systems software ( and gets rid of exceptions - Chris, please <3 ) it will be a serious competitor to Rust and Go. It has all the performance and safety of Rust with a significantly easier learning curve.
Mojo doesn't have C++-like exceptions, but does support throwing. The codegen approach is basically like go's (where you return a bool + error conceptually) but with the python style syntax to make it way more ergonomic than Go.
I think part of the reason why just a few people write custom CUDA / triton kernels is that it's really hard to do well. Languages like Mojo aim to make that much easier, and so hopefully more people will be able to write them (and do other interesting things with GPUs that are too technically challenging right now)
Plenty of people do, many more than are in that server -- I asked some of my former coworkers and none knew about it, but we all spent a whole lot of hours tuning CUDA kernels together :). You have one perspective on this sector, but it's not the only one!
Some example motivations:
- Strange synchronization/coherency requirements
- Working with new hardware / new strategies that Nvidia&co haven't fine-tuned yet
- Just wanting to squeeze out some extra performance
Co-founder here. There isn't any signup - that was 2+ years ago and we've been iterating a lot with the community and listening to feedback - which has been wonderful. Go freely and install with Pip, UV, Pixi etc -> https://docs.modular.com/mojo/manual/install
I would be truly interested if you could expand on this. I know I can do my own research but I'm starting down the path of what could be called performance python or something similar and real world stories help.
My use case is realtime audio processing (VST plugins).
Metal.jl can be used to write GPU kernels in Julia to target an Apple Silicon GPU. Or you can use KernelAbstractions.jl to write once in a high-level CUDA-like language to target NVIDIA/AMD/Apple/Intel GPUs. For best performance, you'll want to take advantage of vendor-specific hardware, like Tensor Cores in CUDA or Unified Memory on Mac.
You also get an ever-expanding set of Julia GPU libraries. In my experience, these are more focused on the numerical side rather than ML.
If you want to compile an executable for an end user, that functionality was added in Julia 1.12, which hasn't been released yet. Early tests with the release candidate suggest that it works, but I would advise waiting to get a better developer experience.
I'm very interesting in this field (realtime audio + GPU programming). How do you deal with the latency? Do you send or multiple single vectors/buffers to GPU?
Also I think because samples in one channel need to be processed sequentially, does that mean mono audio processing won't benefit a lot from GPU programming. Or maybe you are dealing with spectral signal processing?
I think the only Zig hype I'm seeing is about its compiler and compatibility. Those might well be the same two reasons why you never hear about modula-2.
The syntax is based on python, but its runtime is not. So nothing about the contrast between the python language and mojo's use as a super-parallelized parallel processing system is inconsistent.
I'm interested to see how this shakes out now that they are well past the proof of concept stage and have something that can replace CUDA on Nvidia hardware without nerfing performance in addition to other significant upsides.
Just the notion of replacing the parts of LLVM that force it to remain single threaded would be a major sea change for developer productivity.
Nah. There's huge alpha here, as one might say. I feel like this comment could age even more poorly than the infamous dropbox comment.
Even with Jax, PyTorch, HF Transformers, whatever you want to throw at it--the dx for cross-platform gpu programming that are compatible with large language models requirements specifically is extremely bad.
I think this may end up be the most important thing that Lattner has worked on in his life (And yes, I am aware of his other projects!)
Comments like this view the ML ecosystem in a vacuum. New ML models are almost never written—all LLMs for example are basically GPT-2 with extremely marginal differences—and the algorithms themselves are the least of the problem in the field. The 30% improvements you get from kernels and compiler tricks are absolutely peanuts compared to the 500%+ improvements you get from upgrading hardware, adding load balancing and routing, KV and prefix caching, optimized collective ops etc. On top of that, the difficulty even just migrating Torch to the C++11 ABI to access fp8 optimizations is nigh insurmountable in large companies.
I say the ship sailed in 2012 because that was around when it was decided to build Tensorflow around legacy data infrastructure at Google rather than developing something new, and the rest of the industry was hamstrung by that decision (along with the baffling declarative syntax of Tensorflow, and the requirement to use Blaze to build it precluding meaningful development outside of Google).
The industry was so desperate to get away from it that they collectively decided that downloading a single giant library with every model definition under the sun baked into it was the de facto solution to loading Torch models for serving, and today I would bet you that easily 90% of deep learning models in production revolve around either TensorRT, or a model being plucked from Huggingface’s giant library.
The decision to halfass machine learning was made a long time ago. A tool like Mojo might work at a place like Apple that works in a vacuum (and is lightyears behind the curve in ML as a result), but it just doesn’t work on Earth.
If there’s anyone that can do it, it’s Lattner, but I don’t think it can be done, because there’s no appetite for it nor is the talent out there. It’s enough of a struggle to get big boy ML engineers at Mag 7 companies to even use Python instead of letting Copilot write them a 500 line bash script. The quality of slop in libraries like sglang and verl is a testament to the futility of trying to reintroduce high quality software back into deep learning.
Thank you for the kind words! Are you saying that AI model innovation stopped at GPT-2 and everyone has performance and gpu utilization figured out?
Are you talking about NVIDIA Hopper or any of the rest of the accelerators people care about these days? :). We're talking about a lot more performance and TCO at stake than traditional CPU compilers.
I’m saying actual algorithmic (as in not data) model innovation has never been a significant part of the revenue generation in the field. You get your random forest, or ResNet, or BERT, or MaskRCNN, or GPT-2-with-One-Weird-Trick, and then you spend four hours trying to figure out how to preprocess your data.
On the flipside, far from figuring out GPU efficiency, most people with huge jobs are network bottlenecked. And that’s where the problem arises: solutions for collective comms optimization tend to explode in complexity because, among other reasons, you now have to package entire orchestrators in your library somehow, which may fight with the orchestrators that actually launch the job.
Doing my best to keep it concise, but Hopper is like a good case study. I want to use Megatron! Suddenly you need FP8, which means the CXX11 ABI, which means recompiling Torch along with all those nifty toys like flash attention, flashinfer, vllm, whatever. Ray, jsonschema, Kafka and a dozen other things also need to match the same glibc and glibc++ versions. So using that as an example, suddenly my company needs C++ CICD pipelines, dependency management etc when we didn’t before. And I just spent three commas on these GPUs. And most likely, I haven’t made a dime on my LLMs, or autonomous vehicles, or weird cyborg slavebots.
So what all that boils down to is just that there’s a ton of inertia against moving to something new and better. And in this field in particular, it’s a very ugly, half-assed, messy inertia. It’s one thing to replace well-designed, well-maintained Java infra with Golang or something, but it’s quite another to try to replace some pile of shit deep learning library that your customers had to build a pile of shit on top of just to make it work, and all the while fifty college kids are working 16 hours a day to add even more in the next dev release, which will of course be wholly backwards and forwards incompatible.
Pytorch didn't even start until 2016, taking a lot of market share from Tensorflow.
I don't know if this is a language that will catch on, but I guarantee there will be another deep learning focused language that catches on in the future.
Now that NVidia finally got serious with Python tooling and JIT compilers for CUDA, I also see it becoming harder, and those I can use natively on Windows, instead of having to be on WSL land.
To be fair, triton is in active use, and this should be even more ergonomic for Python users than triton. I dont think it’s a sure thing, but I wouldn’t say it has zero chance either.
We need a Pythonic language that is compatible with the Python ecosystem designed for machine learning use-cases and compiles directly to an executable with direct specialized access to the low-level GPU cores and is a fast as Rust.
The closest to that is Mojo and borrows many of Rust's ideas, built in type safety with the aim of being compatible with the existing Python ecosystem which is great.
I've never heard a sound argument against Mojo and continue to see the weakest arguments that go along the lines of:
"I don't want to learn another language"
"It will never take off because we don't need another deep learning DSL"
"It's bad that a single company owns the language just like Google and Golang, Microsoft and C# and Apple and Swift".
Well I prefer tools that are extremely fast, save time and make lots of money, instead of spinning up hundreds of costly VMs as the solution. If Mojo excels in performance and reduces cost then I'm all for that, even better if it achieves Python compatibility.
The argument against Mojo is that it replaces CUDA (that you get for free with the hardware) with something that you need to license.
By itself, that's not so bad. Plenty of "buy, don't build" choices out there.
However, every other would-be Mojo user also knowns that. And they don't want to build on top of an ecosystem that's not fully open.
Why don't Mathematica/MATLAB have pytorch-style DL ecosystems? Because nobody in their right mind would contribute for free to a platform owned by Wolfram Research or Mathworks.
I'm hopeful that Modular can navigate this by opening up their stack.
I really want to like Mojo but you nailed what gives me pause. Not to take an anecdotal example of Polars too far beyond, but I get the sense the current gravity in Python for net new stuff that needs to be written outside Python (obviously a ton of highly performant numpy/scipy/pytorch ecosystem stuff aside) is for it to be written in Rust when necessary.
Not an expert, but though I wouldn't be surprised if Mojo ends up being a better language than Rust for the use case we're discussing, I'm not confident it will ever catch up to Rust in ecosystem and escape velocity as a sane general purpose compiled systems language. It really does feel like Rust has replaced C++ for net new buildouts that would've previously needed its power.
How many people here actually write custom cuda/triton kernels? An extremely small hand full of people write them (and they're all on one discord server).. which then gets integrated into Pytorch, Megatron-LM, vLLM, SGLang, etc. The rest of us in the ML/AI ecosystem have absolutely no incentive to migrate off of python due to network effects even though I think it's a terrible language for maintainable production systems.
If Mojo focuses on systems software ( and gets rid of exceptions - Chris, please <3 ) it will be a serious competitor to Rust and Go. It has all the performance and safety of Rust with a significantly easier learning curve.
Mojo doesn't have C++-like exceptions, but does support throwing. The codegen approach is basically like go's (where you return a bool + error conceptually) but with the python style syntax to make it way more ergonomic than Go.
We have a public roadmap and are hard charging about improving the language, check out https://docs.modular.com/mojo/roadmap/ to learn more.
-Chris
I think part of the reason why just a few people write custom CUDA / triton kernels is that it's really hard to do well. Languages like Mojo aim to make that much easier, and so hopefully more people will be able to write them (and do other interesting things with GPUs that are too technically challenging right now)
Plenty of people do, many more than are in that server -- I asked some of my former coworkers and none knew about it, but we all spent a whole lot of hours tuning CUDA kernels together :). You have one perspective on this sector, but it's not the only one!
Some example motivations:
- Strange synchronization/coherency requirements
- Working with new hardware / new strategies that Nvidia&co haven't fine-tuned yet
- Just wanting to squeeze out some extra performance
Which Discord server? I want in!
Not OP, but my guess is GPU MODE. https://discord.gg/gpumode
The Mojo discord and forums are all listed here: https://www.modular.com/community
Probably tens of thousands of people. You do know that CUDA is used for more than just AI/ML?
I guess given all the hype people tend to forget why GPGPU is used for, it is like the common memme of why CUDA when there is PyTorch.
signing up to try a programming language (Mojo) is as bad as logging in to your terminal before using it (Warp).
Co-founder here. There isn't any signup - that was 2+ years ago and we've been iterating a lot with the community and listening to feedback - which has been wonderful. Go freely and install with Pip, UV, Pixi etc -> https://docs.modular.com/mojo/manual/install
I'm very excited for Mojo - more about the programming language itself than all the ML stuff.
Using this in Julia since 2022. :D
I would be truly interested if you could expand on this. I know I can do my own research but I'm starting down the path of what could be called performance python or something similar and real world stories help.
My use case is realtime audio processing (VST plugins).
Metal.jl can be used to write GPU kernels in Julia to target an Apple Silicon GPU. Or you can use KernelAbstractions.jl to write once in a high-level CUDA-like language to target NVIDIA/AMD/Apple/Intel GPUs. For best performance, you'll want to take advantage of vendor-specific hardware, like Tensor Cores in CUDA or Unified Memory on Mac.
You also get an ever-expanding set of Julia GPU libraries. In my experience, these are more focused on the numerical side rather than ML.
If you want to compile an executable for an end user, that functionality was added in Julia 1.12, which hasn't been released yet. Early tests with the release candidate suggest that it works, but I would advise waiting to get a better developer experience.
I'm very interesting in this field (realtime audio + GPU programming). How do you deal with the latency? Do you send or multiple single vectors/buffers to GPU?
Also I think because samples in one channel need to be processed sequentially, does that mean mono audio processing won't benefit a lot from GPU programming. Or maybe you are dealing with spectral signal processing?
I (vaguely) think what the Mojo guys' goal is makes a lot of sense. And I understand why they thought Python was the way to start.
But I just think Python is not the right language to try to turn into this super-optimized parallel processing system they are trying to build.
But their target market are Python programmers, I guess. So I'm not sure what a better option would be.
It would be interesting for them to develop their own language and make it all work. But "yet another programming language" is a tough sell.
What language do you think they should have based Mojo off of? I think Python syntax is great for tensor manipulation.
This is attempt number 2, it was already tried before with Swift for Tensorflow.
Guess why it wasn't a success, or why Julia is having adoption issues among the same community.
Or why although Zig is basically Modula-2 type system, it is being more hyped than Modula-2 ever was since 1978 (it is even part of GCC nowadays).
Syntax and familiarity matters.
I think the only Zig hype I'm seeing is about its compiler and compatibility. Those might well be the same two reasons why you never hear about modula-2.
I am older than Modula-2, so I heard a lot, many of the folks hyping Zig still think the world started with UNIX.
The syntax is based on python, but its runtime is not. So nothing about the contrast between the python language and mojo's use as a super-parallelized parallel processing system is inconsistent.
Exactly, the idea of not having to learn yet a new language is very compelling
Except by all accounts they succeeded. I believe they have the fastest matmul on nvidia chips in the industry
I was under the impression that their uptake it slow or non-existant. Am I wrong on that?
Is it really faster than cublas?
In some things yes. They're mostly identical in performance though
CUTLASS would like to have a word with you.
evidence?
Modular/Mojo is faster than NVIDIA's libraries on their own chips, and open source instead of binary blob. See the 4 part series that culimates in https://www.modular.com/blog/matrix-multiplication-on-blackw... for Blackwell for example.
thanks
I'm interested to see how this shakes out now that they are well past the proof of concept stage and have something that can replace CUDA on Nvidia hardware without nerfing performance in addition to other significant upsides.
Just the notion of replacing the parts of LLVM that force it to remain single threaded would be a major sea change for developer productivity.
I like Chris Lattner but the ship sailed for a deep learning DSL in like 2012. Mojo is never going to be anything but a vanity project.
Nah. There's huge alpha here, as one might say. I feel like this comment could age even more poorly than the infamous dropbox comment.
Even with Jax, PyTorch, HF Transformers, whatever you want to throw at it--the dx for cross-platform gpu programming that are compatible with large language models requirements specifically is extremely bad.
I think this may end up be the most important thing that Lattner has worked on in his life (And yes, I am aware of his other projects!)
Comments like this view the ML ecosystem in a vacuum. New ML models are almost never written—all LLMs for example are basically GPT-2 with extremely marginal differences—and the algorithms themselves are the least of the problem in the field. The 30% improvements you get from kernels and compiler tricks are absolutely peanuts compared to the 500%+ improvements you get from upgrading hardware, adding load balancing and routing, KV and prefix caching, optimized collective ops etc. On top of that, the difficulty even just migrating Torch to the C++11 ABI to access fp8 optimizations is nigh insurmountable in large companies.
I say the ship sailed in 2012 because that was around when it was decided to build Tensorflow around legacy data infrastructure at Google rather than developing something new, and the rest of the industry was hamstrung by that decision (along with the baffling declarative syntax of Tensorflow, and the requirement to use Blaze to build it precluding meaningful development outside of Google).
The industry was so desperate to get away from it that they collectively decided that downloading a single giant library with every model definition under the sun baked into it was the de facto solution to loading Torch models for serving, and today I would bet you that easily 90% of deep learning models in production revolve around either TensorRT, or a model being plucked from Huggingface’s giant library.
The decision to halfass machine learning was made a long time ago. A tool like Mojo might work at a place like Apple that works in a vacuum (and is lightyears behind the curve in ML as a result), but it just doesn’t work on Earth.
If there’s anyone that can do it, it’s Lattner, but I don’t think it can be done, because there’s no appetite for it nor is the talent out there. It’s enough of a struggle to get big boy ML engineers at Mag 7 companies to even use Python instead of letting Copilot write them a 500 line bash script. The quality of slop in libraries like sglang and verl is a testament to the futility of trying to reintroduce high quality software back into deep learning.
Thank you for the kind words! Are you saying that AI model innovation stopped at GPT-2 and everyone has performance and gpu utilization figured out?
Are you talking about NVIDIA Hopper or any of the rest of the accelerators people care about these days? :). We're talking about a lot more performance and TCO at stake than traditional CPU compilers.
I’m saying actual algorithmic (as in not data) model innovation has never been a significant part of the revenue generation in the field. You get your random forest, or ResNet, or BERT, or MaskRCNN, or GPT-2-with-One-Weird-Trick, and then you spend four hours trying to figure out how to preprocess your data.
On the flipside, far from figuring out GPU efficiency, most people with huge jobs are network bottlenecked. And that’s where the problem arises: solutions for collective comms optimization tend to explode in complexity because, among other reasons, you now have to package entire orchestrators in your library somehow, which may fight with the orchestrators that actually launch the job.
Doing my best to keep it concise, but Hopper is like a good case study. I want to use Megatron! Suddenly you need FP8, which means the CXX11 ABI, which means recompiling Torch along with all those nifty toys like flash attention, flashinfer, vllm, whatever. Ray, jsonschema, Kafka and a dozen other things also need to match the same glibc and glibc++ versions. So using that as an example, suddenly my company needs C++ CICD pipelines, dependency management etc when we didn’t before. And I just spent three commas on these GPUs. And most likely, I haven’t made a dime on my LLMs, or autonomous vehicles, or weird cyborg slavebots.
So what all that boils down to is just that there’s a ton of inertia against moving to something new and better. And in this field in particular, it’s a very ugly, half-assed, messy inertia. It’s one thing to replace well-designed, well-maintained Java infra with Golang or something, but it’s quite another to try to replace some pile of shit deep learning library that your customers had to build a pile of shit on top of just to make it work, and all the while fifty college kids are working 16 hours a day to add even more in the next dev release, which will of course be wholly backwards and forwards incompatible.
But I really hope I’m wrong :)
And comments like this forget that there is more to AI and ML than just LLMs or even NNs.
Pytorch didn't even start until 2016, taking a lot of market share from Tensorflow.
I don't know if this is a language that will catch on, but I guarantee there will be another deep learning focused language that catches on in the future.
Now that NVidia finally got serious with Python tooling and JIT compilers for CUDA, I also see it becoming harder, and those I can use natively on Windows, instead of having to be on WSL land.
To be fair, triton is in active use, and this should be even more ergonomic for Python users than triton. I dont think it’s a sure thing, but I wouldn’t say it has zero chance either.
Tritonlang itself is a deep learning DSL.
> I like Chris Lattner but the ship sailed for a deep learning DSL in like 2012.
Nope. There's certainly room for another alternative that's performant and portable than the rest without the hacks needed to meet it.
Maybe you caught the wrong ship, but Mojo is a speedboat.
> Mojo is never going to be anything but a vanity project.
Will come back in 10 years and we'll see if your comment needs to be studied like the one done for Dropbox.
Any actual reasoning for that claim?
Apologies for a noob (and off-topic) question, but what stops Apple from competing with Nvidia?
We need a Pythonic language that is compatible with the Python ecosystem designed for machine learning use-cases and compiles directly to an executable with direct specialized access to the low-level GPU cores and is a fast as Rust.
The closest to that is Mojo and borrows many of Rust's ideas, built in type safety with the aim of being compatible with the existing Python ecosystem which is great.
I've never heard a sound argument against Mojo and continue to see the weakest arguments that go along the lines of:
"I don't want to learn another language"
"It will never take off because we don't need another deep learning DSL"
"It's bad that a single company owns the language just like Google and Golang, Microsoft and C# and Apple and Swift".
Well I prefer tools that are extremely fast, save time and make lots of money, instead of spinning up hundreds of costly VMs as the solution. If Mojo excels in performance and reduces cost then I'm all for that, even better if it achieves Python compatibility.
In an alternative reality, Chris invented Mojo at Apple (instead of Swift).
If one language was used for iOS apps and gpu programming, with some compatibility with python, it would be pretty neat.
The argument against Mojo is that it replaces CUDA (that you get for free with the hardware) with something that you need to license.
By itself, that's not so bad. Plenty of "buy, don't build" choices out there.
However, every other would-be Mojo user also knowns that. And they don't want to build on top of an ecosystem that's not fully open.
Why don't Mathematica/MATLAB have pytorch-style DL ecosystems? Because nobody in their right mind would contribute for free to a platform owned by Wolfram Research or Mathworks.
I'm hopeful that Modular can navigate this by opening up their stack.
I really want to like Mojo but you nailed what gives me pause. Not to take an anecdotal example of Polars too far beyond, but I get the sense the current gravity in Python for net new stuff that needs to be written outside Python (obviously a ton of highly performant numpy/scipy/pytorch ecosystem stuff aside) is for it to be written in Rust when necessary.
Not an expert, but though I wouldn't be surprised if Mojo ends up being a better language than Rust for the use case we're discussing, I'm not confident it will ever catch up to Rust in ecosystem and escape velocity as a sane general purpose compiled systems language. It really does feel like Rust has replaced C++ for net new buildouts that would've previously needed its power.
> The argument against Mojo is that it replaces CUDA (that you get for free with the hardware) with something that you need to license.
You realize that CUDA isn't open source or planned to be open source in the future, right?
Meanwhile parts of Mojo are already open source with the rest expected to be opened up next year.
parent said free, not open source. I want Mojo to succeed, but I'm also doubtful of the business model.
Do you get a functional version of CUDA free with AMD's much more reasonably priced hardware?
Mojo is planned to be both free and open source by the end of next year and it's not vendor locked to extremely expensive hardware.
To take full advantage of Mojo you will need Modular's ecosystem, and they need to pay the VCs back somehow.
Also as of today anything CUDA works out of the box in Windows, Mojo might eventually work outside WSL, some day.