Astro/Solid - Hacker News

$lordofgibbons 12 hours ago

How many people here actually write custom cuda/triton kernels? An extremely small hand full of people write them (and they're all on one discord server).. which then gets integrated into Pytorch, Megatron-LM, vLLM, SGLang, etc. The rest of us in the ML/AI ecosystem have absolutely no incentive to migrate off of python due to network effects even though I think it's a terrible language for maintainable production systems.

If Mojo focuses on systems software ( and gets rid of exceptions - Chris, please <3 ) it will be a serious competitor to Rust and Go. It has all the performance and safety of Rust with a significantly easier learning curve.

[-]

$chrislattner 11 hours ago

Mojo doesn't have C++-like exceptions, but does support throwing. The codegen approach is basically like go's (where you return a bool + error conceptually) but with the python style syntax to make it way more ergonomic than Go.

We have a public roadmap and are hard charging about improving the language, check out https://docs.modular.com/mojo/roadmap/ to learn more.

-Chris

$latemedium 10 hours ago

I think part of the reason why just a few people write custom CUDA / triton kernels is that it's really hard to do well. Languages like Mojo aim to make that much easier, and so hopefully more people will be able to write them (and do other interesting things with GPUs that are too technically challenging right now)

$achierius 11 hours ago

Plenty of people do, many more than are in that server -- I asked some of my former coworkers and none knew about it, but we all spent a whole lot of hours tuning CUDA kernels together :). You have one perspective on this sector, but it's not the only one!

Some example motivations:

- Strange synchronization/coherency requirements

- Working with new hardware / new strategies that Nvidia&co haven't fine-tuned yet

- Just wanting to squeeze out some extra performance

$huevosabio 12 hours ago

Which Discord server? I want in!

[-]

$fprog 11 hours ago

Not OP, but my guess is GPU MODE. https://discord.gg/gpumode

$chrislattner 11 hours ago

The Mojo discord and forums are all listed here: https://www.modular.com/community

$adastra22 9 hours ago

Probably tens of thousands of people. You do know that CUDA is used for more than just AI/ML?

[-]

$pjmlp 6 hours ago

I guess given all the hype people tend to forget why GPGPU is used for, it is like the common memme of why CUDA when there is PyTorch.

$behnamoh 14 hours ago

signing up to try a programming language (Mojo) is as bad as logging in to your terminal before using it (Warp).

[-]

$timmyd 7 hours ago

Co-founder here. There isn't any signup - that was 2+ years ago and we've been iterating a lot with the community and listening to feedback - which has been wonderful. Go freely and install with Pip, UV, Pixi etc -> https://docs.modular.com/mojo/manual/install

$rubymamis 13 hours ago

I'm very excited for Mojo - more about the programming language itself than all the ML stuff.

$Archit3ch 13 hours ago

Using this in Julia since 2022. :D

[-]

$bahmboo 11 hours ago

I would be truly interested if you could expand on this. I know I can do my own research but I'm starting down the path of what could be called performance python or something similar and real world stories help.

[-]

$Archit3ch 10 hours ago

My use case is realtime audio processing (VST plugins).

Metal.jl can be used to write GPU kernels in Julia to target an Apple Silicon GPU. Or you can use KernelAbstractions.jl to write once in a high-level CUDA-like language to target NVIDIA/AMD/Apple/Intel GPUs. For best performance, you'll want to take advantage of vendor-specific hardware, like Tensor Cores in CUDA or Unified Memory on Mac.

You also get an ever-expanding set of Julia GPU libraries. In my experience, these are more focused on the numerical side rather than ML.

If you want to compile an executable for an end user, that functionality was added in Julia 1.12, which hasn't been released yet. Early tests with the release candidate suggest that it works, but I would advise waiting to get a better developer experience.

[-]

$larme 8 hours ago

I'm very interesting in this field (realtime audio + GPU programming). How do you deal with the latency? Do you send or multiple single vectors/buffers to GPU?

Also I think because samples in one channel need to be processed sequentially, does that mean mono audio processing won't benefit a lot from GPU programming. Or maybe you are dealing with spectral signal processing?

$timmg 14 hours ago

I (vaguely) think what the Mojo guys' goal is makes a lot of sense. And I understand why they thought Python was the way to start.

But I just think Python is not the right language to try to turn into this super-optimized parallel processing system they are trying to build.

But their target market are Python programmers, I guess. So I'm not sure what a better option would be.

It would be interesting for them to develop their own language and make it all work. But "yet another programming language" is a tough sell.

[-]

$cactusfrog 14 hours ago

What language do you think they should have based Mojo off of? I think Python syntax is great for tensor manipulation.

$pjmlp 6 hours ago

This is attempt number 2, it was already tried before with Swift for Tensorflow.

Guess why it wasn't a success, or why Julia is having adoption issues among the same community.

Or why although Zig is basically Modula-2 type system, it is being more hyped than Modula-2 ever was since 1978 (it is even part of GCC nowadays).

Syntax and familiarity matters.

[-]

$a96 an hour ago

I think the only Zig hype I'm seeing is about its compiler and compatibility. Those might well be the same two reasons why you never hear about modula-2.

[-]

$pjmlp 32 minutes ago

I am older than Modula-2, so I heard a lot, many of the folks hyping Zig still think the world started with UNIX.

$golly_ned 11 hours ago

The syntax is based on python, but its runtime is not. So nothing about the contrast between the python language and mojo's use as a super-parallelized parallel processing system is inconsistent.

$ziofill 14 hours ago

Exactly, the idea of not having to learn yet a new language is very compelling

$mempko 14 hours ago

Except by all accounts they succeeded. I believe they have the fastest matmul on nvidia chips in the industry

[-]

$timmg 14 hours ago

I was under the impression that their uptake it slow or non-existant. Am I wrong on that?

$ozgrakkurt 13 hours ago

Is it really faster than cublas?

[-]

$melodyogonna 5 hours ago

In some things yes. They're mostly identical in performance though

$saagarjha 12 hours ago

CUTLASS would like to have a word with you.

$fooblaster 12 hours ago

evidence?

[-]

$chrislattner 11 hours ago

Modular/Mojo is faster than NVIDIA's libraries on their own chips, and open source instead of binary blob. See the 4 part series that culimates in https://www.modular.com/blog/matrix-multiplication-on-blackw... for Blackwell for example.

[-]

$fooblaster 9 hours ago

thanks

$GeekyBear 12 hours ago

I'm interested to see how this shakes out now that they are well past the proof of concept stage and have something that can replace CUDA on Nvidia hardware without nerfing performance in addition to other significant upsides.

Just the notion of replacing the parts of LLVM that force it to remain single threaded would be a major sea change for developer productivity.

$lqstuart 13 hours ago

I like Chris Lattner but the ship sailed for a deep learning DSL in like 2012. Mojo is never going to be anything but a vanity project.

[-]

$growthwtf 13 hours ago

Nah. There's huge alpha here, as one might say. I feel like this comment could age even more poorly than the infamous dropbox comment.

Even with Jax, PyTorch, HF Transformers, whatever you want to throw at it--the dx for cross-platform gpu programming that are compatible with large language models requirements specifically is extremely bad.

I think this may end up be the most important thing that Lattner has worked on in his life (And yes, I am aware of his other projects!)

[-]

$lqstuart 11 hours ago

Comments like this view the ML ecosystem in a vacuum. New ML models are almost never written—all LLMs for example are basically GPT-2 with extremely marginal differences—and the algorithms themselves are the least of the problem in the field. The 30% improvements you get from kernels and compiler tricks are absolutely peanuts compared to the 500%+ improvements you get from upgrading hardware, adding load balancing and routing, KV and prefix caching, optimized collective ops etc. On top of that, the difficulty even just migrating Torch to the C++11 ABI to access fp8 optimizations is nigh insurmountable in large companies.

I say the ship sailed in 2012 because that was around when it was decided to build Tensorflow around legacy data infrastructure at Google rather than developing something new, and the rest of the industry was hamstrung by that decision (along with the baffling declarative syntax of Tensorflow, and the requirement to use Blaze to build it precluding meaningful development outside of Google).

The industry was so desperate to get away from it that they collectively decided that downloading a single giant library with every model definition under the sun baked into it was the de facto solution to loading Torch models for serving, and today I would bet you that easily 90% of deep learning models in production revolve around either TensorRT, or a model being plucked from Huggingface’s giant library.

The decision to halfass machine learning was made a long time ago. A tool like Mojo might work at a place like Apple that works in a vacuum (and is lightyears behind the curve in ML as a result), but it just doesn’t work on Earth.

If there’s anyone that can do it, it’s Lattner, but I don’t think it can be done, because there’s no appetite for it nor is the talent out there. It’s enough of a struggle to get big boy ML engineers at Mag 7 companies to even use Python instead of letting Copilot write them a 500 line bash script. The quality of slop in libraries like sglang and verl is a testament to the futility of trying to reintroduce high quality software back into deep learning.

[-]

$chrislattner 11 hours ago

Thank you for the kind words! Are you saying that AI model innovation stopped at GPT-2 and everyone has performance and gpu utilization figured out?

Are you talking about NVIDIA Hopper or any of the rest of the accelerators people care about these days? :). We're talking about a lot more performance and TCO at stake than traditional CPU compilers.

[-]

$lqstuart 5 hours ago

I’m saying actual algorithmic (as in not data) model innovation has never been a significant part of the revenue generation in the field. You get your random forest, or ResNet, or BERT, or MaskRCNN, or GPT-2-with-One-Weird-Trick, and then you spend four hours trying to figure out how to preprocess your data.

On the flipside, far from figuring out GPU efficiency, most people with huge jobs are network bottlenecked. And that’s where the problem arises: solutions for collective comms optimization tend to explode in complexity because, among other reasons, you now have to package entire orchestrators in your library somehow, which may fight with the orchestrators that actually launch the job.

Doing my best to keep it concise, but Hopper is like a good case study. I want to use Megatron! Suddenly you need FP8, which means the CXX11 ABI, which means recompiling Torch along with all those nifty toys like flash attention, flashinfer, vllm, whatever. Ray, jsonschema, Kafka and a dozen other things also need to match the same glibc and glibc++ versions. So using that as an example, suddenly my company needs C++ CICD pipelines, dependency management etc when we didn’t before. And I just spent three commas on these GPUs. And most likely, I haven’t made a dime on my LLMs, or autonomous vehicles, or weird cyborg slavebots.

So what all that boils down to is just that there’s a ton of inertia against moving to something new and better. And in this field in particular, it’s a very ugly, half-assed, messy inertia. It’s one thing to replace well-designed, well-maintained Java infra with Golang or something, but it’s quite another to try to replace some pile of shit deep learning library that your customers had to build a pile of shit on top of just to make it work, and all the while fifty college kids are working 16 hours a day to add even more in the next dev release, which will of course be wholly backwards and forwards incompatible.

But I really hope I’m wrong :)

$wolvesechoes 3 hours ago

And comments like this forget that there is more to AI and ML than just LLMs or even NNs.

$epistasis 12 hours ago

Pytorch didn't even start until 2016, taking a lot of market share from Tensorflow.

I don't know if this is a language that will catch on, but I guarantee there will be another deep learning focused language that catches on in the future.

$pjmlp 5 hours ago

Now that NVidia finally got serious with Python tooling and JIT compilers for CUDA, I also see it becoming harder, and those I can use natively on Windows, instead of having to be on WSL land.

$atty 13 hours ago

To be fair, triton is in active use, and this should be even more ergonomic for Python users than triton. I dont think it’s a sure thing, but I wouldn’t say it has zero chance either.

$golly_ned 11 hours ago

Tritonlang itself is a deep learning DSL.

$rvz 12 hours ago

> I like Chris Lattner but the ship sailed for a deep learning DSL in like 2012.

Nope. There's certainly room for another alternative that's performant and portable than the rest without the hacks needed to meet it.

Maybe you caught the wrong ship, but Mojo is a speedboat.

> Mojo is never going to be anything but a vanity project.

Will come back in 10 years and we'll see if your comment needs to be studied like the one done for Dropbox.

[-]

$adastra22 9 hours ago

Any actual reasoning for that claim?

$roansh 10 hours ago

Apologies for a noob (and off-topic) question, but what stops Apple from competing with Nvidia?

$rvz 12 hours ago

We need a Pythonic language that is compatible with the Python ecosystem designed for machine learning use-cases and compiles directly to an executable with direct specialized access to the low-level GPU cores and is a fast as Rust.

The closest to that is Mojo and borrows many of Rust's ideas, built in type safety with the aim of being compatible with the existing Python ecosystem which is great.

I've never heard a sound argument against Mojo and continue to see the weakest arguments that go along the lines of:

"I don't want to learn another language"

"It will never take off because we don't need another deep learning DSL"

"It's bad that a single company owns the language just like Google and Golang, Microsoft and C# and Apple and Swift".

Well I prefer tools that are extremely fast, save time and make lots of money, instead of spinning up hundreds of costly VMs as the solution. If Mojo excels in performance and reduces cost then I'm all for that, even better if it achieves Python compatibility.

[-]

$krzat 3 hours ago

In an alternative reality, Chris invented Mojo at Apple (instead of Swift).

If one language was used for iOS apps and gpu programming, with some compatibility with python, it would be pretty neat.

$Archit3ch 12 hours ago

The argument against Mojo is that it replaces CUDA (that you get for free with the hardware) with something that you need to license.

By itself, that's not so bad. Plenty of "buy, don't build" choices out there.

However, every other would-be Mojo user also knowns that. And they don't want to build on top of an ecosystem that's not fully open.

Why don't Mathematica/MATLAB have pytorch-style DL ecosystems? Because nobody in their right mind would contribute for free to a platform owned by Wolfram Research or Mathworks.

I'm hopeful that Modular can navigate this by opening up their stack.

[-]

$yowlingcat 11 hours ago

I really want to like Mojo but you nailed what gives me pause. Not to take an anecdotal example of Polars too far beyond, but I get the sense the current gravity in Python for net new stuff that needs to be written outside Python (obviously a ton of highly performant numpy/scipy/pytorch ecosystem stuff aside) is for it to be written in Rust when necessary.

Not an expert, but though I wouldn't be surprised if Mojo ends up being a better language than Rust for the use case we're discussing, I'm not confident it will ever catch up to Rust in ecosystem and escape velocity as a sane general purpose compiled systems language. It really does feel like Rust has replaced C++ for net new buildouts that would've previously needed its power.

$GeekyBear 10 hours ago

> The argument against Mojo is that it replaces CUDA (that you get for free with the hardware) with something that you need to license.

You realize that CUDA isn't open source or planned to be open source in the future, right?

Meanwhile parts of Mojo are already open source with the rest expected to be opened up next year.

[-]

$deagle50 10 hours ago

parent said free, not open source. I want Mojo to succeed, but I'm also doubtful of the business model.

[-]

$GeekyBear 9 hours ago

Do you get a functional version of CUDA free with AMD's much more reasonably priced hardware?

Mojo is planned to be both free and open source by the end of next year and it's not vendor locked to extremely expensive hardware.

[-]

$pjmlp 5 hours ago

To take full advantage of Mojo you will need Modular's ecosystem, and they need to pay the VCs back somehow.

Also as of today anything CUDA works out of the box in Windows, Mojo might eventually work outside WSL, some day.

Apple Silicon GPU Support in Mojo