You have to be extremely careful with benchmarks like these. It's very easy to measure and conclude different things. In this case, it really only measures the compiler's ability to optimise the `for` loop construct - nothing else. It doesn't say anything much about the languages' general "speed", whatever that means (which is generally the implication with tests like this).
Some points to consider:
- `for` is not how you'd write an optimised loop in all languages.
- Be VERY suspicious of microbenchmarks on C, C++, and similar - the compilers are smarter than you think and you won't get the code you wrote.
- Languages do arrays differently - added memory effects on performance.
- Languages do numbers differently.
- Summing and modulo is somewhat contrived example. Modulo is one of the slower instructions out there.
I think curves make it tricky to maintain objective perception, but the layout contributes significantly to the idea that internal latency is a different order from external.
It’s a billion times integer modulus, summed. Div/mod is by far the slowest arithmetic instruction on a CPU, so it underestimates C/rust performance. I guess the author chose that to prevent the C compiler from optimizing the loop away.
Did you verify this through disassembly? These loops can still be replaced by a closed form expression and I wouldn’t be surprised if LLVM figured this out.
If it turned the inner loop into a closed form expression, I would expect the outer loop to go through 10k iterations a lot faster than needing half a second.
Running the C code through Godbolt with the same Clang and -O3 optimizations it looks like you're right, there are still 2 loops being performed. The loops are unrolled to perform 4 iterations at a time but it's otherwise the same. Other than that it didn't do too much with the loops. Hilariously there is a god awful set of shifts and magic numbers... to save a single modulo operation from running once on the line declaring 'r'.
I'm surprised to see javascript so high up on the list! I would have thought PHP was faster, honestly. But I did run the benchmark in my browser and saw that the browser JS in chrome is slightly slower than Deno at ~1.2-1.3s. Even though firefox runs closer to 3 seconds, I'm still surprised that browser js is that much faster than some of those closer to the metal.
I'm working on a web app in JS, the sort of thing I think people on here will enjoy and I'll definitely share it when it's presentable. It's all in JS, including the graphical aspects of it (I'm drawing to a canvas). For per-pixel rendering I have 4 levels of nested loops. I feel like I've done everything in my power to optimize the process, and it's still not at the ideal speed. Really, it ought to be about 5-10 times faster than it is now.
I've been thinking of getting it into WASM somehow - it's really just one function that needs to be replaced by a WASM module - but one of the things which has given me pause are benchmarks like this, as well as other anecdotes, that demonstrate I'd only get a minor speedup from WASM. I guess part of this is due to modern JS engines being so optimized.
Just about all of the relevant code is adding and subtracting integers, but theres's some Boolean logic. There's also some multiplication in there that I could eliminate if I really had to, although at this point I'm not certain it would make a significant difference. Most of the data that contributes to the value of each pixel is spread across a tree of objects, and there's some polymorphism involved too. The point is it's not like I'm just dealing with a bunch of contiguous data in memory that is easily cached by the CPU, which might speed things up considerably. I guess I'm just venting now, lamenting a nightmare of my own creation...
For the record - I can't imagine any tangible value in writing a graphics engine (and form library) for an environment that was literally designed as a graphics engine for forms. But, that disclaimer aside, I've considered WASM for graphics for a game engine and I have had some trial success with allocating a shared array buffer, writing to it in WASM and reading from it at a js-friendly rate of 30-60 fps (whatever requestAnimationFrame is serving). The idea being that you treat the shared array buffer as your "screen" and write to it from the WASM as if it were an image buffer, while on the JS side you read it as an image buffer onto the canvas. Gives you a nice, finite, bandwidth and an easy to scale maximum for how much data you can pass through, and neither JS nor WASM ever has to pass any time-critical data.
Of course, this setup leaves a very thin JS client - it's essentially just rendering the same image over and over. So if you've already got everything done in JS, it would mean porting all of your functional code over to something that can compile into WASM. Not a huge lift, but not fun either.
And then, on the other hand, if it's just rendering you're seeing as a bottleneck, JS works pretty well with webGL and webGPU for simple form rendering. A little loop unrolling might be a lot easier than porting the rest of the app.
have you done any benchmarks to compare wasm performance with js? i have been doing some benchmark test cases after a large refactor of my game engine written in typescript that is built entirely on top of shared buffer arrays (using ECS pattern or so it’s called i guess). i’m done benching for a while now and am ok with results but i have also compared my handcrafted quadtree sort/query algorithm with off the shelf rust implementation that matched the test requirements and saw only 30% gain and figured anything lower than that would not be worth the effort assuming rust compiled to wasm is same or slower and never bothered to test compiled wasm because i could not figure out how to share my buffer arrays with the rust code using wasm bindgen fast enough to bother spending more time on more benching lol
Aside - a long time ago, I worked on a high school math problem which involved summing the thousand fractions 0/1, 1/2, 2/3,...upto 999/1000, so I did the obvious iteration.
(display (foldl + 0 (map (lambda (x) (/ x (+ x 1))) (range 0 1000))))
The solution matched and I was happy. Then I thought wouldn't it be neat to do this in a dozen languages and benchmark...but got nowhere with that idea. None of the languages supported fraction addition without jumping through whole bunch of hoops. I wonder if its still the same situation. If someone wants to give this a shot, the answer if you do this for the first ten is 17819/2520.
How in the world is pypy faster than cpython by almost 2 orders of magnitude. Is pypy normally faster in other benchmarks. If so, why? Can some python expert pitch in?
Golang folks, after looking at the code, I am surprised and don't understand why Go version is slow compared to C/C++/Rust. Can anyone please explain? Thank you.
Go's int defaults to 64 bits on a 64 bit machine while C/Rust/Java are all defaulting to 32 bit on the author's machine.
Editing the code and running local to get a difference factor, it looks like the same Go code with int takes 1.5x the amount of time as the int32 code on my machine. That puts it ~right next to Java (assuming the perf change maps 1:1). Remove the GC and it comes in ~right next to C++.
Looking at the compiler assembly output on Godbolt it's fun to see all the extra panic info and checks Go adds before going to the loop whereas the C just segfaults from blasting forward anyways, sometimes with no detail.
Anyways, that's why I don't like these types of "benchmarks" (micro tests made into a media post). By the time someone actually looks into them to see what's up everyone has already seen the results and moved on.
Author here: Yeah, I was surprised that there doesn't seem to be many options for extra optimizations in the Go compiler. Would be curious if Go experts have more insight or other compiler recommendations.
I tried this with elixir, and on my m3 mac, it ran in 6.5s. Perl took over 40s :P
#!/usr/bin/env elixir
u =
case System.argv() do
[arg] -> String.to_integer(arg)
_ -> raise "Please provide a valid integer as the first argument."
end
r = :rand.uniform(10_000) - 1
a = :array.new(10_000, default: 0)
a =
Enum.reduce(0..9_999, a, fn i, acc ->
inner_sum =
Enum.reduce(0..99_999, 0, fn j, sum ->
sum + rem(j, u)
end)
updated_value = :array.get(i, acc) + inner_sum + r
:array.set(i, updated_value, acc)
end)
IO.puts(:array.get(r, a))
Oh, the programs are supposed to take an integer from the command line? Then I wonder how many of them would fail when given an integer that's just a wee bit too large (like, say (2^129)-1).
It blows my mind to see that speed difference between different languages. In even the slowest language there we have lots of ways we could optimize this contrived code but this is not something that we can always do in the real world. Seeing the visualization here really underscores how fast some languages are.
> Seeing the visualization here really underscores how fast some languages are.
Yes and no.
It emphasizes that there are some scenarios where language choice can have an outsized effect on performance when running on modern architectures, which is a reminder some people could use or have never really witnessed.
But it's a very particular scenario that relatively few projects will encounter at a scale that meaningfully impacts them.
In particular, C and Rust are much better positioned to instruction pipelining and lack of cache busting because there are no conditionals and little-to-no out-of-loop code to execute. Even with a massive loop like this, that can be a real world scenario in things like data/signal processing and numerical computation. When you're doing those things, language choice can make a huge difference that may not be apparent without understanding how they work or at least seeing visualizations like this.
But even in C and Rust, most applications don't have to do this kind of thing. Most are dominated by smaller loops, with branching that breaks pipelining, complexity that busts cpu cache, calls and jumps that might interfere with both etc. In those much more common cases, the practical performance difference is going to be quite a lot less significant, and the fluency/ergonomics/safety/convenience of working in some other language can easily matter more.
I like the extreme boost that PyPy gets here which is most more than you usually get with PyPy, on one hand it shows the potential of PyPy but it is testing just one area where PyPy is strong, other workloads depend on performance that is not well optimized.
My take though is that branchy, old-AI, and heavy code (say keep every fact in a triple store and not as a Python object at the benefit of having an index for every permutation of ?s-?p-?o) that cannot be handled with numpy or similar tools benefits greatly from PyPy, but not 100x.
I thought the emphasis was on the visualization part, and not so much on the operations that are being benchmarked. Most benchmarks are presented as static bar charts or similar, but this particular site presented them as moving circles, which I thought was a very effective illustration of speed.
The other problem with it is that there are other variables involved besides pure speed, which is CONSISTENCY of speed.
Folks are always surprised to see just how fast Java is able to perform on benchmarks - it has a reputation as such a slow language then how come it's able to execute almost as fast as C++?
But Java's problem isn't execution performance. It's:
* Startup performance. It takes time to instantiate the JVM. That matters for some, not for others.
* Consistent performance. This is the nature of Garbage Collection. It can surprise you when it executes you and drop a performance blip when you least expect it.
Most developers who think they need the fastest performance actually need CONSISTENT performance more than the fastest.
I even see some video games (say Dome Keeper) that periodically go out to lunch even on a top of the line gaming PC and understand that kind of game is often written in C# which has a garbage collector. Thing is I remember playing games on the much weaker PS Vita that were written in C# but I never remember pausing like that.
VR games are now the frontier of gaming (in terms of the industry outdoing itself) and there you're approaching hard real time requirements because it is so awful to get seasick. I appreciate it.
I think it might have more to do with the specific implementation than the tool.
Most incarnations of C# offer a GC.Collect method and an ability to configure the threading model around collections. Used properly, you can keep maximum delays well-bounded. You still have to work your ass off to minimize allocations, but when some do inevitably occur due to framework internals, etc., you don't want to have to clean up 3 hours worth of it at once. Do it every frame or scene change.
These hitches (in unityland anyways) usually mean the dev is allocating a lot of short lived objects in their render loop. Even little heuristics like keeping the new keyword out of Update() and using object pools prevent the majority of things like this for us over on the unity side.
Godot has some perf warts still being churned out, so I don't know what all they've got to do to finesse around them. Even if GDScript was greasy gcc lightning, it's really easy to crunch yourself into some compromises when you're trying to eat your way out of the bottom of ticket mountain before Nextfest or you starve or whatever. Small team gamedev is wild.
> * Consistent performance. This is the nature of Garbage Collection. It can surprise you when it executes you and drop a performance blip when you least expect it.
This has been getting much better in Java in recent years. Shenandoah has been the default since Java 15, and we've been seeing further improvements since then.
> Processing thread stacks concurrently gives us reliable sub-millisecond pauses in JDK 17.
The animation is confusing...is one trip to the end and back the 1B iterations? I thought maybe they were counting up and once the slowest reached the end they would all stop but that didn't seem to be it. Maybe if it did (and there was a counter on each showing how many bounces they made), it'd be a little more clear.
Looks like they are just bouncing back and forth at relative speeds to each other; the speed of each ball is all that matters. I also found this diagram confusing at first.
Looking at the source code, is the random value generated before the main loop iterations to stop the compiler from shortcutting all the iterations? Does that work on some of the better compilers out there? The random value is only generated once. I would like to see a similar benchmark on creating/destroying objects in memory.
If you are interested in looking into the details of microbenchmarks of this type, there is always https://benchmarksgame-team.pages.debian.net/benchmarksgame/... which has clearly defined methodology and allows to understand wall-clock time vs CPU cost too.
The benchmark code linked in the submission unfortunately suffers from the mistakes that happen whenever you write a benchmark for the first time. The visuals are pretty but might as well have been based on something more comprehensive.
(also, please stop measuring performance with loops that perform no work and fibonacci sequences, this was bad 30 years ago, this is bad today too)
Cool, thanks! In another lifetime I had an internship at an embedded software company which was to write tests and optimize compute or throughput for various chips and chip firmware, and reading the functions here sparked some alarm bells in my head. Still a fun idea though and cool graphics.
Yeah, the random value prevents compile-time pre-calculation and the print makes sure the loops actually get run and values actually get calculated (as modern compilers are wont to simply omit code that clearly doesn't have consequence).
I wrote this to probe the question of why “slow” languages are popular and valid when used in conduction with fast ones:
https://tobeva.com/articles/beyond/
It's interesting to imagine the head scratching going on in this thread. All these clever people, stumped. 'How can this be!???'. Quite the post, bravo OP.
But seriously, other comments have this nailed. This 'benchmark' isn't comparing apples with apples. There are too many quirks, optimisations and design decisions (of a compiler) that have not being taken into account.
Also, as others have repeated, this isn't indicative of actual usage - real world usage. In that regard, this type of 'benchmark' isn't scientific at all, and worse, is misleading a naive reader.
It's comparing a granny smith apple to a mcintosh apple by hitting them with a baseball bat and inferring the texture of the apple from that. It uses them in a way that almost nobody does in real life, and is a vague proxy to only one of several important factors. If you use this test as a basis for deciding which you would like, you're going to have a bad time, whereas you'll be fine if you merely look at what other people like.
Is the "way" it's using things affecting performance significantly?
If not then it's a pretty good proxy to one or two dimensions of performance, and a mix of similar benchmarks could paint a good fraction of a reasonable picture.
If you want to buy a bushel of apples and hit them with an array of objects at varying velocities and then try to construct a model that can use the resulting splatter patterns to infer the texture of the apple as experienced by humans, you're welcome to do so.
The program is a useless toy that nobody would write and expect to be performant in real life. I can write an optimized version of the algorithm that prints the exact same output (barring edge cases) without looping. With this new algorithm (really just a single expression), the C program is 500x faster than before and the python script is 5000x faster than before. So python is only 10x slower than C in this example (and I'd imagine that this is measuring more overhead and less arithmetic).
Presumably the C compiler has a relative advantage at optimizing the looped algorithm vs the arithmetic algorithm. How much? Is C 100x faster than python, or was the compiler able to get closer to the non-looping algorithm than python? How close? Is it something about the arithmetic or the looping that's natively faster in C? What part of it was optimized by the compiler? Since nobody does large loops in python, would python be more comparable if we introduced a C-backed library like numpy?
What other objects do you think we should bash this apple with?
Do you think a slightly more complex piece of math that can't be replaced with an equation would have significantly different performance from the benchmark?
I don't. So the simple math just makes things easier, it doesn't harm results.
Your suggestion to skip the math entirely is the apple-smash.
I don't know whether or not a more complex piece of math would change the outcome, so I would err on the side of not making assumptions until I had done more benchmarks and investigated the inner workings more. You probably know more about C than I do and might be able to make that assumption more confidently, but I think most people know less than either of us, and will read too much into a single statistic without context.
Nobody who cares about performance enough to look at a benchmark like this would do large loops in pure python.
> Nobody who cares about performance enough to look at a benchmark like this would do large loops in pure python.
Because they already know it would be very slow, and putting specific numbers on that is helpful for them and for non-veterans too. Even with some wiggle room for "microbenchmarks are weird".
I think the comparison to goto is a little extreme, but actually not too crazy. I could see a future where the accepted best practice is to use parallelized operations for everything, and only use sequential loops in cases where the underlying concept is inherently sequential (gradient descent, iterative hashing, etc.).
You have to be extremely careful with benchmarks like these. It's very easy to measure and conclude different things. In this case, it really only measures the compiler's ability to optimise the `for` loop construct - nothing else. It doesn't say anything much about the languages' general "speed", whatever that means (which is generally the implication with tests like this).
Some points to consider:
- `for` is not how you'd write an optimised loop in all languages.
- Be VERY suspicious of microbenchmarks on C, C++, and similar - the compilers are smarter than you think and you won't get the code you wrote.
- Languages do arrays differently - added memory effects on performance.
- Languages do numbers differently.
- Summing and modulo is somewhat contrived example. Modulo is one of the slower instructions out there.
Another fave of mine by this person: https://benjdd.com/az/
Wow what a creative way to visualize the concept.
I think curves make it tricky to maintain objective perception, but the layout contributes significantly to the idea that internal latency is a different order from external.
It’s a billion times integer modulus, summed. Div/mod is by far the slowest arithmetic instruction on a CPU, so it underestimates C/rust performance. I guess the author chose that to prevent the C compiler from optimizing the loop away.
Author here: Yeah, the mod was added to reduce risk of loop(s) getting optimized away.
Did you verify this through disassembly? These loops can still be replaced by a closed form expression and I wouldn’t be surprised if LLVM figured this out.
If it turned the inner loop into a closed form expression, I would expect the outer loop to go through 10k iterations a lot faster than needing half a second.
Running the C code through Godbolt with the same Clang and -O3 optimizations it looks like you're right, there are still 2 loops being performed. The loops are unrolled to perform 4 iterations at a time but it's otherwise the same. Other than that it didn't do too much with the loops. Hilariously there is a god awful set of shifts and magic numbers... to save a single modulo operation from running once on the line declaring 'r'.
See here for why the Go version performs worse https://news.ycombinator.com/item?id=42251571
I'm surprised to see javascript so high up on the list! I would have thought PHP was faster, honestly. But I did run the benchmark in my browser and saw that the browser JS in chrome is slightly slower than Deno at ~1.2-1.3s. Even though firefox runs closer to 3 seconds, I'm still surprised that browser js is that much faster than some of those closer to the metal.
It’s impressive but even more interesting, it could still be faster! Java is JIT at #3, so theoretically any of these languages could get there.
I'm working on a web app in JS, the sort of thing I think people on here will enjoy and I'll definitely share it when it's presentable. It's all in JS, including the graphical aspects of it (I'm drawing to a canvas). For per-pixel rendering I have 4 levels of nested loops. I feel like I've done everything in my power to optimize the process, and it's still not at the ideal speed. Really, it ought to be about 5-10 times faster than it is now.
I've been thinking of getting it into WASM somehow - it's really just one function that needs to be replaced by a WASM module - but one of the things which has given me pause are benchmarks like this, as well as other anecdotes, that demonstrate I'd only get a minor speedup from WASM. I guess part of this is due to modern JS engines being so optimized.
Just about all of the relevant code is adding and subtracting integers, but theres's some Boolean logic. There's also some multiplication in there that I could eliminate if I really had to, although at this point I'm not certain it would make a significant difference. Most of the data that contributes to the value of each pixel is spread across a tree of objects, and there's some polymorphism involved too. The point is it's not like I'm just dealing with a bunch of contiguous data in memory that is easily cached by the CPU, which might speed things up considerably. I guess I'm just venting now, lamenting a nightmare of my own creation...
For the record - I can't imagine any tangible value in writing a graphics engine (and form library) for an environment that was literally designed as a graphics engine for forms. But, that disclaimer aside, I've considered WASM for graphics for a game engine and I have had some trial success with allocating a shared array buffer, writing to it in WASM and reading from it at a js-friendly rate of 30-60 fps (whatever requestAnimationFrame is serving). The idea being that you treat the shared array buffer as your "screen" and write to it from the WASM as if it were an image buffer, while on the JS side you read it as an image buffer onto the canvas. Gives you a nice, finite, bandwidth and an easy to scale maximum for how much data you can pass through, and neither JS nor WASM ever has to pass any time-critical data.
Of course, this setup leaves a very thin JS client - it's essentially just rendering the same image over and over. So if you've already got everything done in JS, it would mean porting all of your functional code over to something that can compile into WASM. Not a huge lift, but not fun either.
And then, on the other hand, if it's just rendering you're seeing as a bottleneck, JS works pretty well with webGL and webGPU for simple form rendering. A little loop unrolling might be a lot easier than porting the rest of the app.
Anyway, best of luck on it!
have you done any benchmarks to compare wasm performance with js? i have been doing some benchmark test cases after a large refactor of my game engine written in typescript that is built entirely on top of shared buffer arrays (using ECS pattern or so it’s called i guess). i’m done benching for a while now and am ok with results but i have also compared my handcrafted quadtree sort/query algorithm with off the shelf rust implementation that matched the test requirements and saw only 30% gain and figured anything lower than that would not be worth the effort assuming rust compiled to wasm is same or slower and never bothered to test compiled wasm because i could not figure out how to share my buffer arrays with the rust code using wasm bindgen fast enough to bother spending more time on more benching lol
Aside - a long time ago, I worked on a high school math problem which involved summing the thousand fractions 0/1, 1/2, 2/3,...upto 999/1000, so I did the obvious iteration.
(display (foldl + 0 (map (lambda (x) (/ x (+ x 1))) (range 0 1000))))
The solution matched and I was happy. Then I thought wouldn't it be neat to do this in a dozen languages and benchmark...but got nowhere with that idea. None of the languages supported fraction addition without jumping through whole bunch of hoops. I wonder if its still the same situation. If someone wants to give this a shot, the answer if you do this for the first ten is 17819/2520.
In Haskell the numerator and denominator of the Rational type are unbounded integers, so one of the (many equivalent) ways of writing it is:
% is the constructor for Rational numbers. I.e. one half is written (and shown as) 1 % 2.I'll spare people the full result, but:
How in the world is pypy faster than cpython by almost 2 orders of magnitude. Is pypy normally faster in other benchmarks. If so, why? Can some python expert pitch in?
PyPy uses just-in-time compilation, so for arithmetic-heavy computations not involving the GC, it should absolutely be faster.
This is the exact code you would expect pypy to run faster.
Golang folks, after looking at the code, I am surprised and don't understand why Go version is slow compared to C/C++/Rust. Can anyone please explain? Thank you.
Go's int defaults to 64 bits on a 64 bit machine while C/Rust/Java are all defaulting to 32 bit on the author's machine.
Editing the code and running local to get a difference factor, it looks like the same Go code with int takes 1.5x the amount of time as the int32 code on my machine. That puts it ~right next to Java (assuming the perf change maps 1:1). Remove the GC and it comes in ~right next to C++.
Looking at the compiler assembly output on Godbolt it's fun to see all the extra panic info and checks Go adds before going to the loop whereas the C just segfaults from blasting forward anyways, sometimes with no detail.
Anyways, that's why I don't like these types of "benchmarks" (micro tests made into a media post). By the time someone actually looks into them to see what's up everyone has already seen the results and moved on.
Author here: Yeah, I was surprised that there doesn't seem to be many options for extra optimizations in the Go compiler. Would be curious if Go experts have more insight or other compiler recommendations.
I doubt its the GC kicking in, but you could run it with the following environment variable set just to ensure that its not doing anything.
EDIT: Trying it out quickly shows a small improvement actually, but so small as to likely be noise as I was doing other things on the machine.I tried this with elixir, and on my m3 mac, it ran in 6.5s. Perl took over 40s :P
Oh, the programs are supposed to take an integer from the command line? Then I wonder how many of them would fail when given an integer that's just a wee bit too large (like, say (2^129)-1).
I don't think that's a concern. My assumption is that the point of taking user input is to prevent a compile from precomputing the result.
It blows my mind to see that speed difference between different languages. In even the slowest language there we have lots of ways we could optimize this contrived code but this is not something that we can always do in the real world. Seeing the visualization here really underscores how fast some languages are.
> Seeing the visualization here really underscores how fast some languages are.
Yes and no.
It emphasizes that there are some scenarios where language choice can have an outsized effect on performance when running on modern architectures, which is a reminder some people could use or have never really witnessed.
But it's a very particular scenario that relatively few projects will encounter at a scale that meaningfully impacts them.
In particular, C and Rust are much better positioned to instruction pipelining and lack of cache busting because there are no conditionals and little-to-no out-of-loop code to execute. Even with a massive loop like this, that can be a real world scenario in things like data/signal processing and numerical computation. When you're doing those things, language choice can make a huge difference that may not be apparent without understanding how they work or at least seeing visualizations like this.
But even in C and Rust, most applications don't have to do this kind of thing. Most are dominated by smaller loops, with branching that breaks pipelining, complexity that busts cpu cache, calls and jumps that might interfere with both etc. In those much more common cases, the practical performance difference is going to be quite a lot less significant, and the fluency/ergonomics/safety/convenience of working in some other language can easily matter more.
Programming style is as much a determiner of pipeline and cache friendly code as domain. It’s a set of techniques that can be learned.
I like the extreme boost that PyPy gets here which is most more than you usually get with PyPy, on one hand it shows the potential of PyPy but it is testing just one area where PyPy is strong, other workloads depend on performance that is not well optimized.
My take though is that branchy, old-AI, and heavy code (say keep every fact in a triple store and not as a Python object at the benefit of having an index for every permutation of ?s-?p-?o) that cannot be handled with numpy or similar tools benefits greatly from PyPy, but not 100x.
I thought the emphasis was on the visualization part, and not so much on the operations that are being benchmarked. Most benchmarks are presented as static bar charts or similar, but this particular site presented them as moving circles, which I thought was a very effective illustration of speed.
Personally I have gone max 5 level deep. It worked reasonably well with agressive break condition checking in my specific case.
How deep people have actually written useful nested loops?
This does not seem like a realistic benchmark target.
Perhaps it's unrealistic in the sense that this is unrealistically-optimisable code.
Real-world code has closures, allocations, and pointer-dererencing. This code is just iterations, additions and modulus.
> This code is just iterations...
Which is what the average programmer out there writes. Therefore, I think this is actually a good realistic benchmark.
Precisely - its bad code, which is what most code is!
What's a Github repository with good code?
Nobody's ever seen it!
All code is bad, some is useful
Useful is good. Ergo transitive property.
What the average programmer out there writes is just a tip on the iceberg of what gets executed. And an incredibly small tip.
Nor does it pretend to be.
It's literally just "1 Billion nested loop iterations", methods and results.
You make of that whatever you want.
The other problem with it is that there are other variables involved besides pure speed, which is CONSISTENCY of speed.
Folks are always surprised to see just how fast Java is able to perform on benchmarks - it has a reputation as such a slow language then how come it's able to execute almost as fast as C++?
But Java's problem isn't execution performance. It's:
* Startup performance. It takes time to instantiate the JVM. That matters for some, not for others.
* Consistent performance. This is the nature of Garbage Collection. It can surprise you when it executes you and drop a performance blip when you least expect it.
Most developers who think they need the fastest performance actually need CONSISTENT performance more than the fastest.
I even see some video games (say Dome Keeper) that periodically go out to lunch even on a top of the line gaming PC and understand that kind of game is often written in C# which has a garbage collector. Thing is I remember playing games on the much weaker PS Vita that were written in C# but I never remember pausing like that.
VR games are now the frontier of gaming (in terms of the industry outdoing itself) and there you're approaching hard real time requirements because it is so awful to get seasick. I appreciate it.
I think it might have more to do with the specific implementation than the tool.
Most incarnations of C# offer a GC.Collect method and an ability to configure the threading model around collections. Used properly, you can keep maximum delays well-bounded. You still have to work your ass off to minimize allocations, but when some do inevitably occur due to framework internals, etc., you don't want to have to clean up 3 hours worth of it at once. Do it every frame or scene change.
These hitches (in unityland anyways) usually mean the dev is allocating a lot of short lived objects in their render loop. Even little heuristics like keeping the new keyword out of Update() and using object pools prevent the majority of things like this for us over on the unity side.
Dome Keeper's dev Bippinbits is a Godot studio. Some comments on their opinions that I can't read until I dig out my twitter login lol. https://x.com/Bippinbits/status/1702351798302851398
Godot has some perf warts still being churned out, so I don't know what all they've got to do to finesse around them. Even if GDScript was greasy gcc lightning, it's really easy to crunch yourself into some compromises when you're trying to eat your way out of the bottom of ticket mountain before Nextfest or you starve or whatever. Small team gamedev is wild.
> * Consistent performance. This is the nature of Garbage Collection. It can surprise you when it executes you and drop a performance blip when you least expect it.
This has been getting much better in Java in recent years. Shenandoah has been the default since Java 15, and we've been seeing further improvements since then.
> Processing thread stacks concurrently gives us reliable sub-millisecond pauses in JDK 17.
https://developers.redhat.com/articles/2021/09/16/shenandoah...
The animation is confusing...is one trip to the end and back the 1B iterations? I thought maybe they were counting up and once the slowest reached the end they would all stop but that didn't seem to be it. Maybe if it did (and there was a counter on each showing how many bounces they made), it'd be a little more clear.
Interesting nonetheless.
Looks like they are just bouncing back and forth at relative speeds to each other; the speed of each ball is all that matters. I also found this diagram confusing at first.
Looking at the source code, is the random value generated before the main loop iterations to stop the compiler from shortcutting all the iterations? Does that work on some of the better compilers out there? The random value is only generated once. I would like to see a similar benchmark on creating/destroying objects in memory.
If you are interested in looking into the details of microbenchmarks of this type, there is always https://benchmarksgame-team.pages.debian.net/benchmarksgame/... which has clearly defined methodology and allows to understand wall-clock time vs CPU cost too.
The benchmark code linked in the submission unfortunately suffers from the mistakes that happen whenever you write a benchmark for the first time. The visuals are pretty but might as well have been based on something more comprehensive.
(also, please stop measuring performance with loops that perform no work and fibonacci sequences, this was bad 30 years ago, this is bad today too)
Cool, thanks! In another lifetime I had an internship at an embedded software company which was to write tests and optimize compute or throughput for various chips and chip firmware, and reading the functions here sparked some alarm bells in my head. Still a fun idea though and cool graphics.
Yeah, the random value prevents compile-time pre-calculation and the print makes sure the loops actually get run and values actually get calculated (as modern compilers are wont to simply omit code that clearly doesn't have consequence).
I would have loved to see C# on this list. Wonder if he's accepting PR's.
There are...a lot of PRs (and 4 as of this moment which add C#).
Note that these benchmarks ignore object-related stuff, which is usually the expensive part.
Would be nice to know max RAM usage by each implementation.
I wrote this to probe the question of why “slow” languages are popular and valid when used in conduction with fast ones: https://tobeva.com/articles/beyond/
It's interesting to imagine the head scratching going on in this thread. All these clever people, stumped. 'How can this be!???'. Quite the post, bravo OP.
But seriously, other comments have this nailed. This 'benchmark' isn't comparing apples with apples. There are too many quirks, optimisations and design decisions (of a compiler) that have not being taken into account.
Also, as others have repeated, this isn't indicative of actual usage - real world usage. In that regard, this type of 'benchmark' isn't scientific at all, and worse, is misleading a naive reader.
While it's not (necessarily) indicative of the languages as a whole, it looks pretty apples to apples to me.
These are very simple programs with no fancy tricks going on.
It's comparing a granny smith apple to a mcintosh apple by hitting them with a baseball bat and inferring the texture of the apple from that. It uses them in a way that almost nobody does in real life, and is a vague proxy to only one of several important factors. If you use this test as a basis for deciding which you would like, you're going to have a bad time, whereas you'll be fine if you merely look at what other people like.
Is the "way" it's using things affecting performance significantly?
If not then it's a pretty good proxy to one or two dimensions of performance, and a mix of similar benchmarks could paint a good fraction of a reasonable picture.
If you want to buy a bushel of apples and hit them with an array of objects at varying velocities and then try to construct a model that can use the resulting splatter patterns to infer the texture of the apple as experienced by humans, you're welcome to do so.
Personally, I'm going to bite the apple.
I do not accept your analogy.
This program is a perfectly reasonable "bite". It's not doing weird stuff.
The program is a useless toy that nobody would write and expect to be performant in real life. I can write an optimized version of the algorithm that prints the exact same output (barring edge cases) without looping. With this new algorithm (really just a single expression), the C program is 500x faster than before and the python script is 5000x faster than before. So python is only 10x slower than C in this example (and I'd imagine that this is measuring more overhead and less arithmetic).
Presumably the C compiler has a relative advantage at optimizing the looped algorithm vs the arithmetic algorithm. How much? Is C 100x faster than python, or was the compiler able to get closer to the non-looping algorithm than python? How close? Is it something about the arithmetic or the looping that's natively faster in C? What part of it was optimized by the compiler? Since nobody does large loops in python, would python be more comparable if we introduced a C-backed library like numpy?
What other objects do you think we should bash this apple with?
Do you think a slightly more complex piece of math that can't be replaced with an equation would have significantly different performance from the benchmark?
I don't. So the simple math just makes things easier, it doesn't harm results.
Your suggestion to skip the math entirely is the apple-smash.
> Since nobody does large loops in python
Oh I doubt that.
Numpy would be fine as a different entry.
I don't know whether or not a more complex piece of math would change the outcome, so I would err on the side of not making assumptions until I had done more benchmarks and investigated the inner workings more. You probably know more about C than I do and might be able to make that assumption more confidently, but I think most people know less than either of us, and will read too much into a single statistic without context.
Nobody who cares about performance enough to look at a benchmark like this would do large loops in pure python.
> Nobody who cares about performance enough to look at a benchmark like this would do large loops in pure python.
Because they already know it would be very slow, and putting specific numbers on that is helpful for them and for non-veterans too. Even with some wiggle room for "microbenchmarks are weird".
Doing Python or R right means using vector/matrix operations in e.g. numpy. Those will be faster.
Just think twice before using a for loop. Most of the time it is not needed.
If you write the R code using vectorised operations it’s significant orders of magnitude faster (148 seconds to 1.5 milliseconds on posit.cloud).
The nested loop becomes: j <- sum((seq_len(100000) - 1) %% u) a <- a + j + r
I disagree w.r.t. python. You shouldn't need a numerical computing library to do a for loop generally.
Kick the habit. For loops are what "goto" was 20 years ago. Error prone and not possible to abstract for example for parallel execution.
I think the comparison to goto is a little extreme, but actually not too crazy. I could see a future where the accepted best practice is to use parallelized operations for everything, and only use sequential loops in cases where the underlying concept is inherently sequential (gradient descent, iterative hashing, etc.).