> 40x faster trigonometry: Speed-up standard library functions like std::sin in just 3 lines of code.
Huh, ok, let's see how...
* By limiting the expansion to a few
* terms, we can approximate `sin(x)` as:
*
* sin(x) ≈ x - (x^3)/3! + (x^5)/5!
*
* This reduces the computational cost but comes with reduced accuracy.
I see. "reduced accuracy" is an understatement. It's just horrifically wrong for inputs outside the range of [-2, 2]
You can always get more accuracy by expanding those 3 lines to handle more of the Taylor components… but it’s important to remember that this is still educational material.
You can find more complete examples in my SimSIMD (https://github.com/ashvardanian/SimSIMD), but they also often assume that at a certain part of a kernel, a floating point number is guaranteed to be in a certain range. This can greatly simplify the implementation for kernels like Atan2. For general-purpose inputs, go to SLEEF (https://sleef.org). Just remember that every large, complicated optimization starts with a small example.
Educational material that misinforms its readers isn't educational, and it's insanely counterproductive.
People have already ragged on you for doing Taylor approximation, and I'm not the best expert on the numerical analysis of implementing transcendental functions, so I won't pursue that further. But there's still several other unaddressed errors in your trigonometric code:
* If your function is going to omit range reduction, say so upfront. Saying "use me to get a 40× speedup because I omit part of the specification" is misleading to users, especially because you should assume that most users are not knowledgeable about floating-point and thus they aren't even going to understand they're missing something without you explicitly telling them!
* You're doing polynomial evaluation via `x * a + (x * x) * b + (x * x * x) * c` which is not the common way of doing so, and also, it's a slow way of doing so. If you're trying to be educational, do it via the `((((x * c) * x + b) * x) + a) * x` technique--that's how it's done, that's how it should be done.
* Also, doing `x / 6.0` is a disaster for performance, because fdiv is one of the slowest operations you can do. Why not do `x * (1.0 / 6.0)` instead?
* Doing really, really dumb floating-point code and then relying on -ffast-math to make the compiler unpick your dumbness is... a bad way of doing stuff. Especially since you're recommending people go for it for the easy speedup and saying absolutely nothing about where it can go catastrophically wrong. As Simon Byrne said, "Friends don't let friends use -ffast-math" (and the title of my explainer on floating-point will invariably be "Floating Point, or How I Learned to Start Worrying and Hate -ffast-math").
I'd say `x * a + (x * x) * b + (x * x * x) * c` is likely faster (subject to the compiler being reasonable) than `((((x * c) * x + b) * x) + a) * x` because it has a shorter longest instruction dependency chain. Add/Mul have higher throughput than latency, so the latency chain dominates performance and a few extra instructions will just get hidden away by instruction level parallelism.
Also x/6 vs x(1/6) is not as bad as it used to be, fdiv keeps getting faster. On my zen2 its 10 cycles latency and 0.33/cycle throughput for (vector) div, and 3 latency and 2/cycle throughput for (vector) add. So about 1/3 speed, worse if you have a lot of divs and the pipeline fills up. Going back to Pentium the difference is ~10x and you don't get to hide it with instruction parallelism.
* The first expression has a chain of 4 instructions that cannot be started before the last one finished `(((x * x) * x) * c) + the rest` vs the entire expression being a such a chain in the second version. Using fma instructions changes this a bit, making all the adds in both expressions 'free' but this changes precision and needs -ffast-math or such, which I agree is dangerous and generally ill advised.
-ffast-math has about 15 separate compiler flags that go into it, and on any given piece of code, about 3-5 of them are disastrous for your numerical accuracy (but which 3-5 changes by application). If you do the other 10, you get most of the benefit without the inaccuracy. -ffast-math is especially dumb because it encourages people to go for all of them or nothing.
Someone is still putting tremendous effort into this project so I reckon it would be worthwhile to submit this, obviously well thought through, criticism as a PR for the repo!
For the range reduction, I've always been a fan of using revolutions rather than radians as the angle measure as you can just extract fractional bits to range reduce. Note that this is at the cost of a more complicated calculus.
I can't for the life of me find the Sony presentation, but the fastest polynomial calculation is somewhere between Horner's method (which has a huge dependency tree in terms of pipelining) and full polynomial evaluation (which has redundancy in calculation).
Totally with you on not relying on fast math! Not that I had much choice when I was working on games because that decision was made higher up!
I don't know who the target audience is supposed to be, but who would be the type of person who tries to implement performance critical numerical codes but doesn't know the implications of Taylor expanding the sine function?
People who found that sin is the performance bottleneck in their code and are trying to find way to speed it up.
One of the big problems with floating-point code in general is that users are largely ignorant of floating-point issues. Even something as basic as "0.1 + 0.2 != 0.3" shouldn't be that surprising to a programmer if you spend about five minutes explaining it, but the evidence is clear that it is a shocking surprise to a very large fraction of programmers. And that's the most basic floating-point issue, the one you're virtually guaranteed to stumble across if you do anything with floating-point; there's so many more issues that you're not going to think about until you uncover them for the first time (e.g., different hardware gives different results).
Thanks to a lot of effort by countless legions of people, we're largely past the years when it was common for different hardware to give different results. It's pretty much just contraction, FTZ/DAZ, funsafe/ffast-math, and NaN propagation. Anyone interested in practical reproducibility really only has to consider the first two among the basic parts of the language, and they're relatively straightforward to manage.
Divergent math library implementations is the other main category, and for many practical cases, you might have to worry about parallelization factor changing things. For completeness' sake, I might as well add in approximate functions, but if you using an approximate inverse square root instruction, well, you should probably expect that to be differ on different hardware.
On the plus side, x87 excess precision is largely a thing of the past, and we've seen some major pushes towards getting rid of FTZ/DAZ (I think we're at the point where even the offload architectures are mandating denormal support?). Assuming Intel figures out how to fully get rid of denormal penalties on its hardware, we're probably a decade or so out from making -ffast-math no longer imply denormal flushing, yay. (Also, we're seeing a lot of progress on high-speed implementations of correctly-rounded libm functions, so I also expect to see standard libraries require correctly-rounded implementations as well).
The definition I use for determinism is "same inputs and same order = same results", down to the compiler level. All modern compilers on all modern platforms that I've tested take steps to ensure that for everything except transcendental and special functions (where it'd be an unreasonable guarantee).
I'm somewhat less interested in correctness of the results, so long as they're consistent. rlibm and related are definitely neat, but I'm not optimistic they'll become mainstream.
There are lots of cases where you can get away with moderate accuracy. Rotating a lot of batched sprites would be one of them; could easily get away with a mediocre Taylor series approximation, even though it's leaving free accuracy on the table compared to minimax.
But not having _range reduction_ is a bigger problem, I can't see many uses for a sin() approximation that's only good for half wave. And as others have said, if you need range reduction for the approximation to work in its intended use case, that needs to be included in the benchmark because you're going to be paying that cost relative to `std::sin()`.
I am genuinely quite surprised that the sine approximation is the eyeball catcher in that entire repo.
It will only take a 5-line PR to add Horner’s method and Chebyshev’s polynomials and probably around 20 lines of explanations, and everyone passionate about the topic is welcome to add them.
There are more than enough examples in the libraries mentioned above ;)
I'd suggest simply adding `assert(-M_PI/2 <= x && x <= M_PI/2)` to your function. It won't slow down the code in optimized builds, and makes it obvious that it isn't designed to work outside that range even if people copy/paste the code without reading it or any comments.
Also, it would be good to have even in a "production" use of a function like this, in case something outside that range reaches it by accident.
Yes, that’s a very productive suggestion! Feel free to open a PR, or I’ll just patch it myself in a couple of hours when I’m back to the computer. Thanks!
No. Do not use Taylor series approximations in your real code. They are slow and inaccurate. You can do much, much better with some basic numerical analysis. Chebyshev and Remez approximations will give you more bang for your buck every time.
> but it’s important to remember that this is still educational material.
Then it should be educating on the applicability and limitations of things like this instead of just saying "reduced accuracy" and hoping the reader notices the massive issues? Kinda like the ffast-math section does.
This is kind of a dumb objection. If your sine function has good accuracy in [-pi/2, pi/2], you can compute all other values by shifting the argument and/or multiplying the result by -1.
There are a bunch of real situations where you can assume the input will be in a small range. And while reducing from [-pi;pi] or [-2*pi;2*pi] or whatever is gonna slow it down somewhat, I'm pretty sure it wouldn't be too significant, compared to the FP arith. (and branching on inputs outside of even that expanded target expected range is a fine strategy realistically)
Most real math libraries will do this with only a quarter of the period, accounting for both sine and cosine in the same numerical approximation. You can then do range reduction into the region [0, pi/2) and run your approximation, flipping the X or Y axis as appropriate for either sine or cosine. This can be done branchlessly and in a SIMD-friendly way, and is far better than using a higher-order approximation to cover a larger region.
That's only if they're unpredictable; sure, perhaps on some workload it'll be unpredictable whether the input to sin/cos is grater than 2*pi, but I'm pretty sure on most it'll be nearly-always a "no". Perhaps not an optimization to take in general, but if you've got a workload where you're fine with 0.5% error, you can also spend a couple seconds thinking about what range of inputs to handle in the fast path. (hence "target expected range" - unexpected inputs getting unexpected branches isn't gonna slow down things if you've calibrated your expectations correctly; edited my comment slightly to make it clearer that that's about being out of the expanded range, not just [-pi/2,pi/2])
I'm of course not suggesting branching in cases where you expect a 30% misprediction rate. You'd do branchless reduction from [-2*pi;2*pi] or whatever you expect to be frequent, and branch on inputs with magnitude greater than 2*pi if you want to be extra sure you don't get wrong results if usage changes.
Again, we're in a situation where we know we can tolerate a 0.5% error, we can spare a bit of time to think about what range needs to be handled fast or supported at all.
Those reductions need to be part of the function being benchmarked, though. Assuming a range limitation of [-pi,pi] even would be reasonable, there's certainly cases where you don't need multiple revolutions around a circle. But this can't even do that, so it's simply not a substitute for sin, and claiming 40x faster is a sham
At least do not name the function "sin". One former game dev works in my company and he is using similar tricks all the time. It makes code so hard to read and unless you are computing "sin" a lot speedup is not measurable.
At pi/2 that approximation gets you 1.0045, i.e. half a percent off, so it's not particularly good at that. (still could be sufficient for some uses though; but not the best achievable even with that performance)
It's not useless if it's good enough for the problem at hand.
Kaze Emanuar has two entire videos dedicated to optimizing sin() on the Nintendo 64 and he's using approximations like this without issues in his homebrew:
I came to these videos expecting someone else pushing Taylor series, but this video series was surprisingly good in terms of talking about hacking their way through some numerical analysis. These videos started with Taylor series, but did not end there. They came up with some hacky but much better polynomial approximations, which are respectable. Production math libraries also use polynomial approximations. They just don't use Taylor series approximations.
Oof this is bad. If you're going to ask people to approximate, use a Chebyshev approximation please. You will do sin(x) faster than this and more accurately.
The following comes from the complete opposite side of computing, microcontrollers. I've been working on an embedded system where the heap is about 256 KiB and the biggest stack has 4 KiB. I do write idiomatic modern C++ for the most part (even if I hate C++ metaprogramming with a passion), but not all tricks are suitable in all situations:
- CTRE is fine as long as you don't overflow the stack. I tried once to validate a string for a HTTP proxy configuration with an exhaustive regex, CTRE tried to allocate 5 KiB of stack 40 call frames in and therefore crashed the embedded system with a stack overflow. I've had to remove port validation from the regex (matching a number between 1 and 65535 was a bridge too far) and check that part by hand instead. I've also had to dumb down other CTRE regexes too in my code for similar reasons.
- Several constraints and design decisions led me to mostly ditch JSON internally and write my own BSON library. Instead of the traditional dynamically-allocated tree of nodes approach it works directly in-place, so I can give it a std::vector with a chunk of reserved memory upfront and not worry about memory allocation or fragmentation later on. One major benefit is that since there are no string escape sequences, I can return directly a std::string_view for string values. There are downsides to this approach, mostly revolving around modifications: one needs to be very careful not to invalidate iterators (which are raw pointers to the underlying buffer) while doing so and adding/removing entries towards the beginning of a large document is expensive due to the memmove().
- I ditched newlib for picolibc and exterminated anything that pulled in the C/C++ standard library locale code (that was over 130 kilobytes of Flash altogether IIRC), which includes among other things C++ streams (they are bad for plenty of other reasons too, but mine was program size).
You seem to have mostly aimed for throughput and raw performance in your benchmarks, which is fine for a generic desktop or server-class system with a MMU and plenty of resources. Just wanna point out that other environments will have different constraints that will mandate different kinds of optimizations, like memory usage (heap/stack/program size), dynamic memory fragmentation, real-time/jitter...
I’ve done C++ on a Cortex-M0+ with 8KB of flash. Code size is a big issue. You have to disable a bunch of stuff (no exceptions, nothing that does dynamic allocation) but you can still use classes, virtual methods, templates, constexpr, etc. These are all things that are a pain to do in C and usually require a bunch of gross macros.
As a former C++ programmer now writing C, I think this only true for templates, but it if you limited to somewhat this is also fine. For constexpr it depends what you use it for. If it something expensive to compute I would just run a program at build time (caching the output) and include the result. This seems preferable to me anyhow. The same for tests.
Yeah, embedded C++ is a wildly different experience from vanilla. I've worked in large embedded C++ codebases where we couldn't use the STL and had to use homegrown containers for everything.
I wonder how Rust is stacking up (no pun intended) in the embedded game these days.
Very true! I'd also go for similar optimizations when processing texts or sparse linear algebra on Nvidia and AMD GPUs. You only have ~50 KB of constant memory, ~50 MB of shared memory, and ~50 GB of global memory. It is BIG compared to microcontrollers but very little compared to the scope of problems often solved on GPUs. So many optimizations revolve around compressed representations and coalesced memory accesses.
I am still looking for a short example of such CUDA kernels, and I would love to see more embedded examples if you have thoughts ;)
I haven't had to reach for them so far either professionally or personally, but custom memory allocators (slab allocation, bump allocator...) and allocation strategies is something I've been meaning to look into. Too bad that the one game I've done reverse-engineering on used dynamic memory allocation for just about everything, with an allocator that uses a singly-linked list of used/free chunks that wouldn't look out of place in the 1980s.
I'm aware that the C++ standard library has polymorphic allocators alongside a couple of memory resource implementations. I've also heard that the dynamic dispatch for the polymorphic allocators could bring some optimization or speed penalties compared to a statically dispatched allocator or the standard std::allocator that uses operator new(), but I have no concrete data to judge either way.
> CTRE is fine as long as you don't overflow the stack
Which is to say CTRE is mostly not fine, if you use it on user-provided strings, regardless of target environment. It's heavily recursion based, with never spilling to the heap and otherwise no safeguards for memory use/recursion depth.
Once I've given up validating the port number with the regex, it no longer blew up the stack:
^http:\/\/([a-z0-9.-]+)\/?:([1-9][0-9]{0,4})$
I'll admit I haven't done a thorough job of auditing the stack usage afterwards, but not all regexes look like Perl codegolf. For simple, straightforward patterns I don't see any problems using CTRE, but I'd be interested to see some proof to the contrary if you have some.
Not sure I'd reach for C++ or regexes in such a constrained micro environment. Anything where you don't directly understand the precise memory use is probably out.
The NumWorks N0100 graphical calculator had 1 MiB of Flash and 256 KiB of RAM. It packed seven mathematical apps (calculation, grapher, equations, statistics, regression, sequences, distributions) with a decently powerful maths engine/equation typesetter written in C++ and a MicroPython shell. They've paid a fair amount of attention to details in order to fit all of that in (least of all no STL), but C++ wielded correctly for embedded is no more of a memory hog than C.
Our target has ~1.5 MiB of Flash for program code and 512 KiB of RAM. We're using half of the former and maybe a third of the latter, the team barely paid any attention to program size or memory consumption. One day the project lead became slightly concerned about that and by the end of the day I shed off 20% of Flash and RAM usage going for the lowest hanging fruits.
I find it a bit amusing to call a 250 MHz STM32H5 MCU a constrained micro environment, if anything it's a bit overkill for what we need.
> I find it a bit amusing to call a 250 MHz STM32H5 MCU a constrained micro environment, if anything it's a bit overkill for what we need.
I took an "embedded" systems class in college 15+ years ago that targeted a 32-bit ARM with megabytes of ram, so using these kBs of RAM micros in 2025 definitely feels like a constrained environment to me. The platforms I work on with C++ professionally have, ya know, hundreds of gigabytes of RAM (and our application gets ~100% of it).
Why choose Abseil for associative containers? It was a big deal when it first appeared on the scene, and it's certainly still better than the standard library for the most part, but there are now a number of alternatives that have since improved upon its weaknesses. (I've been especially impressed by Boost's unordered_flat_map.)
I have increasingly soured on Abseil over the past couple of years. At Google, we've seen an exodus of many of the core library maintainers, and some of the more recent design choices have been questionable at best.
I've never tried `unordered_flat_map` and would probably prefer to keep Boost, Poco, Qt, and other mega-frameworks outside of this repo to make the builds leaner, but I'm open to alternatives. Which are the best to look into?
I agree that it may not be the most readable format. So far, the best-structured piece on similar topics I’ve seen is Algorithmica: <https://en.algorithmica.org/hpc>.
I am sure it overlaps in terms of topics, maybe even some examples, but the homepage suggests that the book is about 500 pages long. I generally don’t have a time budget to read that much, and in most cases, I want to play with the code more than read the text, especially when some new IO_uring feature, a metaprogramming tool, or an Assembly instruction comes out.
Another observation is that people don’t like to read into 5000-word essays on each topic. At best, those become training materials for the next LLMs and will affect future code only indirectly…
I’m all ears for alternative formats if you have recommendations, as I generally reimplement such projects every couple of years ;)
Agreed. I'm not seeing any use of a automatic profiler to find the hot spots of one's code. If you don't know that, you're aiming blind on what to optimize and how much. The only profiling mentioned is something manual you have to add to each scope you want to time.
In so, so much code, basic big-O analysis, concurrency analysis (minimize mutable state / coordination), and even a little thought about CPU caches (including TLB) yields huge dividends.
It does need some explanation and sectioning in between the readme and code comments, for the purpose of readability. I started reading the C++ source and wasn't sure what was happening, then realised the file is humongous and stopped. Perhaps splitting it into separate files with individual explanations would help. Also worth keeping in mind that with complex or intricate code, no one has any idea what you're doing unless you explicitly tell them.
This is great! There has been far too much sophistry in C++ where the justification for some contortions are "performance", with no empirical data.
I am surprised about CTRE giving good results—I will admit I have thought of it more as a parlor trick than a viable engine. I will need to dig into that more. I also want to dig into the OpenMP & TBB threadpool benchmarks to see whether Boost::ASIO threadpool can be added into it.
I was also surprised about CTRE and I can imagine it being a viable tool for implementing parsers for 99% of use cases, where you may not need all the SIMD bells and whistles.
A word of caution, though: I remember the throughput differing vastly between GCC and MSVC builds. The latter struggles with heavy meta-programming and expression templates. I don't know why.
Oh MSVC, bless. Mingw setup might be a pain, but the dividends accrue over time.
--
This was a good reminder that I need to pay more attention to Unum's projects. I noticed this older blog article, https://www.unum.cloud/blog/2021-01-03-texts, and that brings up some questions. First, in 2025, is UStore a wholesale replacement for UDisk or are the two complementary? Second, what is the current Unum approach for replacing full-text search engines (e.g., ElasticSearch)?
For years, I've had a hope to build it in the form of an open-core project: open-source SotA solutions for Storage, Compute, and AI Modeling built bottom up. You can imagine the financial & time burden of building something like that with all the weird optimizations and coding practices listed above.
A few years in, with millions spent out of my pocket, without any venture support or revenue, I've decided to change gears and focus on a few niche workloads until some of the Unum tools become industry standards for something. USearch was precisely that, a toy Vector Search engine that would still, hopefully, be 10x better than alternatives, in one way or another: <https://www.unum.cloud/blog/2023-11-07-scaling-vector-search...>.
Now, ScyllaDB (through Rust SDK) and YugaByte (through C++ SDK) are the most recent DBMSs to announce features built on USearch, joining the ranks of many other tech products leveraging some of those optimizations, and I was playing around with different open-source growth & governance ideas last year, looking for way to organize more collaborative environment among our upstream users, rather than competitive — no major releases, just occasional patches here and there.
It’s kickass that USearch is in DuckDB. That’s something I will play with, for sure.
ElasticSearch has always seemed geared too much towards concurrent querying with mixed workloads, and then it gets applied to logs… and, well, with logs you care about detection of known query sets at ingest, indexing speed, compression, and ability to answer queries over large cold indices in cheap object storage. And often when searching logs, you want exact matching, preferably with regex. Part of me wants to play with rolling my own crazy FM-index library, part of me thinks it might be easier to play games with Parquet dictionary tables (get factors out of regexps, check dictionary tables for the factors for great win), and part of me thinks I will be better off waiting to see what comes of the Rust-based ES challengers.
Will definitely follow announcements to come with StringZilla.
std::regex is such a nightmare, I didn't take the time to run the code myself but I'd be curious if you would see the same delta if you swapped it for boost::regex or re2.
You can beat CTRE throughput with a non-compile-time solution. My first recommendation will be to look at HyperScan. It has remarkable throughput: https://github.com/intel/hyperscan
I’ve only avoided it in the tutorial, as I want to keep the build system lean. I wouldn’t be surprised if it’s 10x faster than Boost in the average case.
To echo Ash, HyperScan is the performance king. Use it when shipping on x64 and one has control over the pattern sets and you don't mind building it and working through its idiosyncrasies. It has some nonstandard matching semantics, reflecting its focus on network intrusion detection systems (NIDS), and does not have Unicode support.
For general purpose usage, Google's RE2 and PCRE2 in JIT mode will offer pretty good performance. Zoltan Herczeg's work on the PCRE2's JIT is underappreciated. Both these options are widely available and portable.
There is no "coding FPGA". Verilog/VHDL are not programming languages, they are "hardware description languages" (HDL). That means when you write those languages literally what you're doing is specifying every single wire in the circuit. People that do FPGA work aren't even called programmers, they're called designers.
> High-Level Synthesis
Something that absolutely not a single professional designer uses.
I agree that the sentiment around HLS is very bad, but there aren’t many good examples online, showcasing the difference. Do you think HDL vs HLS tradeoffs will be visible on something like 8-bit unsigned integer 3x3x3 or 4x4x4 matrix-multiplication?
I've started implementing this, but organizing a good testing environment was time-consuming. If someone here has a sandbox and the right skill set, 3x3x3 or 4x4x4 matrix multiplications can be a perfect candidate for FPGA ports and case studies!
Showing the tradeoffs between complexity & throughput of HLS and HDL implementations is the goal. I assume, uint8 should be sufficient to make a point?
Intel has always had terrible subnormal performance. It's not that difficult to implement in HW, and even if you still want to optimize for the normalized case, we're talking about a 1 cycle penalty, not an order(s)-of-magnitude penalty.
Hey! Your Google Benchmark post was one of my go-to resources when I started picking that up a couple years ago. I love to see the focus on performance benchmarking here, and the repository is laid out well. Nice work!
Most hardware-level observations, like the latency of various memory accesses or numeric operations, would be the same for the Rust code. As for higher-level abstractions, I've already started porting them to Rust <https://github.com/ashvardanian/less_slow.rs>.
Next, it would be exciting to implement a concurrent job-stealing graph algorithm in both languages to get a feel for their ergonomics in non-trivial problems. I can imagine it looks very different in Rust and C++, but before I get there, I'm looking for best practices for implementing nested associative containers with shared stateful allocators in Rust.
I’m definitely interested in seeing this kind of content in Rust, have you looked at Rayon’s implementation for work stealing yet? Can result in some very nice high-level code.
Great question! This has been top of mind for me for the last 2–3 years.
Short answer: sadly, no. I love the "usability" promise of coroutines—and even have 2–3 FOSS projects that could be rewritten entirely around C++ or Rust coroutines for better debuggability and extensibility—but my experiments show that the runtime cost of most coroutine‑like abstractions is simply too high. Frankly, I’m not even sure if a better design is possible on modern hardware.
This leads me to conclude that, despite my passion for SIMD and superscalar execution, the highest‑impact new assembly instructions that x86 and Arm could standardize would center on async execution and lightweight context switching... yet I haven’t seen any movement in that direction.
⸻
I also wrote toy examples for various range/async/stream models in C++, Rust, and Python, with measured latencies in inline comments:
Aside from coroutines (toy hand-rolled implementations and commonly used libraries), I've also played around C++ executors, senders & receivers, but didn't have much success with them either. May be a skill issue.
> my experiments show that the runtime cost of most coroutine‑like abstractions is simply too high
Which runtime cost do you mean?
The main one I am aware of is a heap allocation per coroutine, though this can in some cases be elided if the coroutine is being called from another coroutine.
The other cost I am aware of is the initializing of the coroutine handle, but I think this is just a couple of pointers.
In both cases I would expect these overheads to be relatively modest compared to the cost of the I/O itself, though it's definitely better to elide the heap allocation when possible.
I don't know much about coroutine libraries like unifex (which I think your test is using), but a hand-coded prototype I was playing with doesn't seem to add much overhead: https://godbolt.org/z/8Kc1oKf15
My exploration into coroutines and I/O is only in the earliest stages, so I won't claim any of this to be definitive. But I am very interested in this question of whether the overhead is low enough to be a good match for io_uring or not.
the cost of context switch consists of two parts, one of which can be subdivided:
1. register save/restore
1a. user-space only registers
1b. full register save/restore, required in kernel space
2. the cost of the TLB flush, which is in turn proportional to the working set size of the switched-to process (i.e. if you don't touch much memory after the context switch, the cost is lower than if you do)
I am not sure that any assembler instructions could address either of these.
This is impressive work—feels like it should be packaged as a handbook for high-performance computing or even a practical guide for HFT developers. Great resource!
We’ve all been there. The feeling of staring at a codebase, trying to find optimizations in every corner—to speed things up, streamline processes, and keep teams happy.
But then you come across something like this Less Slow C++. It’s not just about being clever or showy—it’s about solving real problems in the most efficient way possible. And that’s why it stands out in a world full of quick fixes and overcomplicated solutions.
Keep pushing for clarity, efficiency, and above all, results. With tools like this, the sky really is the limit.
> 40x faster trigonometry: Speed-up standard library functions like std::sin in just 3 lines of code.
Huh, ok, let's see how...
I see. "reduced accuracy" is an understatement. It's just horrifically wrong for inputs outside the range of [-2, 2]https://www.wolframalpha.com/input?i=plot+sin+x%2C+x+-+%28x%...
It cannot handle a single interval of a sin wave, much less the repeating nature? What an absolutely useless "optimization"
You can always get more accuracy by expanding those 3 lines to handle more of the Taylor components… but it’s important to remember that this is still educational material.
You can find more complete examples in my SimSIMD (https://github.com/ashvardanian/SimSIMD), but they also often assume that at a certain part of a kernel, a floating point number is guaranteed to be in a certain range. This can greatly simplify the implementation for kernels like Atan2. For general-purpose inputs, go to SLEEF (https://sleef.org). Just remember that every large, complicated optimization starts with a small example.
Educational material that misinforms its readers isn't educational, and it's insanely counterproductive.
People have already ragged on you for doing Taylor approximation, and I'm not the best expert on the numerical analysis of implementing transcendental functions, so I won't pursue that further. But there's still several other unaddressed errors in your trigonometric code:
* If your function is going to omit range reduction, say so upfront. Saying "use me to get a 40× speedup because I omit part of the specification" is misleading to users, especially because you should assume that most users are not knowledgeable about floating-point and thus they aren't even going to understand they're missing something without you explicitly telling them!
* You're doing polynomial evaluation via `x * a + (x * x) * b + (x * x * x) * c` which is not the common way of doing so, and also, it's a slow way of doing so. If you're trying to be educational, do it via the `((((x * c) * x + b) * x) + a) * x` technique--that's how it's done, that's how it should be done.
* Also, doing `x / 6.0` is a disaster for performance, because fdiv is one of the slowest operations you can do. Why not do `x * (1.0 / 6.0)` instead?
* Doing really, really dumb floating-point code and then relying on -ffast-math to make the compiler unpick your dumbness is... a bad way of doing stuff. Especially since you're recommending people go for it for the easy speedup and saying absolutely nothing about where it can go catastrophically wrong. As Simon Byrne said, "Friends don't let friends use -ffast-math" (and the title of my explainer on floating-point will invariably be "Floating Point, or How I Learned to Start Worrying and Hate -ffast-math").
I'd say `x * a + (x * x) * b + (x * x * x) * c` is likely faster (subject to the compiler being reasonable) than `((((x * c) * x + b) * x) + a) * x` because it has a shorter longest instruction dependency chain. Add/Mul have higher throughput than latency, so the latency chain dominates performance and a few extra instructions will just get hidden away by instruction level parallelism.
Also x/6 vs x(1/6) is not as bad as it used to be, fdiv keeps getting faster. On my zen2 its 10 cycles latency and 0.33/cycle throughput for (vector) div, and 3 latency and 2/cycle throughput for (vector) add. So about 1/3 speed, worse if you have a lot of divs and the pipeline fills up. Going back to Pentium the difference is ~10x and you don't get to hide it with instruction parallelism.
* The first expression has a chain of 4 instructions that cannot be started before the last one finished `(((x * x) * x) * c) + the rest` vs the entire expression being a such a chain in the second version. Using fma instructions changes this a bit, making all the adds in both expressions 'free' but this changes precision and needs -ffast-math or such, which I agree is dangerous and generally ill advised.
-ffast-math has about 15 separate compiler flags that go into it, and on any given piece of code, about 3-5 of them are disastrous for your numerical accuracy (but which 3-5 changes by application). If you do the other 10, you get most of the benefit without the inaccuracy. -ffast-math is especially dumb because it encourages people to go for all of them or nothing.
Someone is still putting tremendous effort into this project so I reckon it would be worthwhile to submit this, obviously well thought through, criticism as a PR for the repo!
For the range reduction, I've always been a fan of using revolutions rather than radians as the angle measure as you can just extract fractional bits to range reduce. Note that this is at the cost of a more complicated calculus.
I can't for the life of me find the Sony presentation, but the fastest polynomial calculation is somewhere between Horner's method (which has a huge dependency tree in terms of pipelining) and full polynomial evaluation (which has redundancy in calculation).
Totally with you on not relying on fast math! Not that I had much choice when I was working on games because that decision was made higher up!
I don't know who the target audience is supposed to be, but who would be the type of person who tries to implement performance critical numerical codes but doesn't know the implications of Taylor expanding the sine function?
People who found that sin is the performance bottleneck in their code and are trying to find way to speed it up.
One of the big problems with floating-point code in general is that users are largely ignorant of floating-point issues. Even something as basic as "0.1 + 0.2 != 0.3" shouldn't be that surprising to a programmer if you spend about five minutes explaining it, but the evidence is clear that it is a shocking surprise to a very large fraction of programmers. And that's the most basic floating-point issue, the one you're virtually guaranteed to stumble across if you do anything with floating-point; there's so many more issues that you're not going to think about until you uncover them for the first time (e.g., different hardware gives different results).
Thanks to a lot of effort by countless legions of people, we're largely past the years when it was common for different hardware to give different results. It's pretty much just contraction, FTZ/DAZ, funsafe/ffast-math, and NaN propagation. Anyone interested in practical reproducibility really only has to consider the first two among the basic parts of the language, and they're relatively straightforward to manage.
Divergent math library implementations is the other main category, and for many practical cases, you might have to worry about parallelization factor changing things. For completeness' sake, I might as well add in approximate functions, but if you using an approximate inverse square root instruction, well, you should probably expect that to be differ on different hardware.
On the plus side, x87 excess precision is largely a thing of the past, and we've seen some major pushes towards getting rid of FTZ/DAZ (I think we're at the point where even the offload architectures are mandating denormal support?). Assuming Intel figures out how to fully get rid of denormal penalties on its hardware, we're probably a decade or so out from making -ffast-math no longer imply denormal flushing, yay. (Also, we're seeing a lot of progress on high-speed implementations of correctly-rounded libm functions, so I also expect to see standard libraries require correctly-rounded implementations as well).
The definition I use for determinism is "same inputs and same order = same results", down to the compiler level. All modern compilers on all modern platforms that I've tested take steps to ensure that for everything except transcendental and special functions (where it'd be an unreasonable guarantee).
I'm somewhat less interested in correctness of the results, so long as they're consistent. rlibm and related are definitely neat, but I'm not optimistic they'll become mainstream.
There are lots of cases where you can get away with moderate accuracy. Rotating a lot of batched sprites would be one of them; could easily get away with a mediocre Taylor series approximation, even though it's leaving free accuracy on the table compared to minimax.
But not having _range reduction_ is a bigger problem, I can't see many uses for a sin() approximation that's only good for half wave. And as others have said, if you need range reduction for the approximation to work in its intended use case, that needs to be included in the benchmark because you're going to be paying that cost relative to `std::sin()`.
> tries to implement performance critical numerical codes but doesn't know the implications of Taylor expanding the sine function?
That would be me, I’m afraid. I know little about Taylor series, but I’m pretty sure it’s less than ideal for the use case.
Here’s a better way to implement faster trigonometry functions in C++ https://github.com/Const-me/AvxMath/blob/master/AvxMath/AvxM... That particular source file implements that for 4-wide FP64 AVX vectors.
I am genuinely quite surprised that the sine approximation is the eyeball catcher in that entire repo.
It will only take a 5-line PR to add Horner’s method and Chebyshev’s polynomials and probably around 20 lines of explanations, and everyone passionate about the topic is welcome to add them.
There are more than enough examples in the libraries mentioned above ;)
It's eye catching because it's advertised as a 40x speedup without any caveats.
Oh, I've missed that part. It's hard to fit README's bullet points on a single line, and I've probably removed too many relevant words.
I'll update the README statement in a second, and already patched the sources to explicitly focus on the [-π/2, π/2] range.
Thanks!
I'd suggest simply adding `assert(-M_PI/2 <= x && x <= M_PI/2)` to your function. It won't slow down the code in optimized builds, and makes it obvious that it isn't designed to work outside that range even if people copy/paste the code without reading it or any comments.
Also, it would be good to have even in a "production" use of a function like this, in case something outside that range reaches it by accident.
Yes, that’s a very productive suggestion! Feel free to open a PR, or I’ll just patch it myself in a couple of hours when I’m back to the computer. Thanks!
No. Do not use Taylor series approximations in your real code. They are slow and inaccurate. You can do much, much better with some basic numerical analysis. Chebyshev and Remez approximations will give you more bang for your buck every time.
> but it’s important to remember that this is still educational material.
Then it should be educating on the applicability and limitations of things like this instead of just saying "reduced accuracy" and hoping the reader notices the massive issues? Kinda like the ffast-math section does.
This is kind of a dumb objection. If your sine function has good accuracy in [-pi/2, pi/2], you can compute all other values by shifting the argument and/or multiplying the result by -1.
But then you have to include this in the benchmark and it will no longer be 40x faster.
There are a bunch of real situations where you can assume the input will be in a small range. And while reducing from [-pi;pi] or [-2*pi;2*pi] or whatever is gonna slow it down somewhat, I'm pretty sure it wouldn't be too significant, compared to the FP arith. (and branching on inputs outside of even that expanded target expected range is a fine strategy realistically)
Most real math libraries will do this with only a quarter of the period, accounting for both sine and cosine in the same numerical approximation. You can then do range reduction into the region [0, pi/2) and run your approximation, flipping the X or Y axis as appropriate for either sine or cosine. This can be done branchlessly and in a SIMD-friendly way, and is far better than using a higher-order approximation to cover a larger region.
> branching on inputs outside of the target expected range is a fine strategy realistically
branches at this scale are actually significant, and so will drastically impact being able to achieve 40x faster as claimed
That's only if they're unpredictable; sure, perhaps on some workload it'll be unpredictable whether the input to sin/cos is grater than 2*pi, but I'm pretty sure on most it'll be nearly-always a "no". Perhaps not an optimization to take in general, but if you've got a workload where you're fine with 0.5% error, you can also spend a couple seconds thinking about what range of inputs to handle in the fast path. (hence "target expected range" - unexpected inputs getting unexpected branches isn't gonna slow down things if you've calibrated your expectations correctly; edited my comment slightly to make it clearer that that's about being out of the expanded range, not just [-pi/2,pi/2])
Assuming an even distribution over a single iteration of sin, that is [0,pi], this will have a ~30% misprediction rate. That's not rare.
I'm of course not suggesting branching in cases where you expect a 30% misprediction rate. You'd do branchless reduction from [-2*pi;2*pi] or whatever you expect to be frequent, and branch on inputs with magnitude greater than 2*pi if you want to be extra sure you don't get wrong results if usage changes.
Again, we're in a situation where we know we can tolerate a 0.5% error, we can spare a bit of time to think about what range needs to be handled fast or supported at all.
Those reductions need to be part of the function being benchmarked, though. Assuming a range limitation of [-pi,pi] even would be reasonable, there's certainly cases where you don't need multiple revolutions around a circle. But this can't even do that, so it's simply not a substitute for sin, and claiming 40x faster is a sham
At least do not name the function "sin". One former game dev works in my company and he is using similar tricks all the time. It makes code so hard to read and unless you are computing "sin" a lot speedup is not measurable.
At pi/2 that approximation gets you 1.0045, i.e. half a percent off, so it's not particularly good at that. (still could be sufficient for some uses though; but not the best achievable even with that performance)
Good argument reduction routines are not exactly easy for a novice to write, so I think this is a valid objection.
And it's not even using the Horner scheme for evaluating the polynomial.
It's not useless if it's good enough for the problem at hand.
Kaze Emanuar has two entire videos dedicated to optimizing sin() on the Nintendo 64 and he's using approximations like this without issues in his homebrew:
I came to these videos expecting someone else pushing Taylor series, but this video series was surprisingly good in terms of talking about hacking their way through some numerical analysis. These videos started with Taylor series, but did not end there. They came up with some hacky but much better polynomial approximations, which are respectable. Production math libraries also use polynomial approximations. They just don't use Taylor series approximations.
My small angle sin function is even faster: sin(x) = x
Oof this is bad. If you're going to ask people to approximate, use a Chebyshev approximation please. You will do sin(x) faster than this and more accurately.
The following comes from the complete opposite side of computing, microcontrollers. I've been working on an embedded system where the heap is about 256 KiB and the biggest stack has 4 KiB. I do write idiomatic modern C++ for the most part (even if I hate C++ metaprogramming with a passion), but not all tricks are suitable in all situations:
- CTRE is fine as long as you don't overflow the stack. I tried once to validate a string for a HTTP proxy configuration with an exhaustive regex, CTRE tried to allocate 5 KiB of stack 40 call frames in and therefore crashed the embedded system with a stack overflow. I've had to remove port validation from the regex (matching a number between 1 and 65535 was a bridge too far) and check that part by hand instead. I've also had to dumb down other CTRE regexes too in my code for similar reasons.
- Several constraints and design decisions led me to mostly ditch JSON internally and write my own BSON library. Instead of the traditional dynamically-allocated tree of nodes approach it works directly in-place, so I can give it a std::vector with a chunk of reserved memory upfront and not worry about memory allocation or fragmentation later on. One major benefit is that since there are no string escape sequences, I can return directly a std::string_view for string values. There are downsides to this approach, mostly revolving around modifications: one needs to be very careful not to invalidate iterators (which are raw pointers to the underlying buffer) while doing so and adding/removing entries towards the beginning of a large document is expensive due to the memmove().
- I ditched newlib for picolibc and exterminated anything that pulled in the C/C++ standard library locale code (that was over 130 kilobytes of Flash altogether IIRC), which includes among other things C++ streams (they are bad for plenty of other reasons too, but mine was program size).
You seem to have mostly aimed for throughput and raw performance in your benchmarks, which is fine for a generic desktop or server-class system with a MMU and plenty of resources. Just wanna point out that other environments will have different constraints that will mandate different kinds of optimizations, like memory usage (heap/stack/program size), dynamic memory fragmentation, real-time/jitter...
I’ve done C++ on a Cortex-M0+ with 8KB of flash. Code size is a big issue. You have to disable a bunch of stuff (no exceptions, nothing that does dynamic allocation) but you can still use classes, virtual methods, templates, constexpr, etc. These are all things that are a pain to do in C and usually require a bunch of gross macros.
As a former C++ programmer now writing C, I think this only true for templates, but it if you limited to somewhat this is also fine. For constexpr it depends what you use it for. If it something expensive to compute I would just run a program at build time (caching the output) and include the result. This seems preferable to me anyhow. The same for tests.
Yeah, embedded C++ is a wildly different experience from vanilla. I've worked in large embedded C++ codebases where we couldn't use the STL and had to use homegrown containers for everything.
I wonder how Rust is stacking up (no pun intended) in the embedded game these days.
Very true! I'd also go for similar optimizations when processing texts or sparse linear algebra on Nvidia and AMD GPUs. You only have ~50 KB of constant memory, ~50 MB of shared memory, and ~50 GB of global memory. It is BIG compared to microcontrollers but very little compared to the scope of problems often solved on GPUs. So many optimizations revolve around compressed representations and coalesced memory accesses.
I am still looking for a short example of such CUDA kernels, and I would love to see more embedded examples if you have thoughts ;)
I haven't had to reach for them so far either professionally or personally, but custom memory allocators (slab allocation, bump allocator...) and allocation strategies is something I've been meaning to look into. Too bad that the one game I've done reverse-engineering on used dynamic memory allocation for just about everything, with an allocator that uses a singly-linked list of used/free chunks that wouldn't look out of place in the 1980s.
I'm aware that the C++ standard library has polymorphic allocators alongside a couple of memory resource implementations. I've also heard that the dynamic dispatch for the polymorphic allocators could bring some optimization or speed penalties compared to a statically dispatched allocator or the standard std::allocator that uses operator new(), but I have no concrete data to judge either way.
> CTRE is fine as long as you don't overflow the stack
Which is to say CTRE is mostly not fine, if you use it on user-provided strings, regardless of target environment. It's heavily recursion based, with never spilling to the heap and otherwise no safeguards for memory use/recursion depth.
The regex which was overflowing the stack was something like this (simplified and from memory):
Once I've given up validating the port number with the regex, it no longer blew up the stack: I'll admit I haven't done a thorough job of auditing the stack usage afterwards, but not all regexes look like Perl codegolf. For simple, straightforward patterns I don't see any problems using CTRE, but I'd be interested to see some proof to the contrary if you have some.The problem can occur in general if there is a greedy match within the regex: https://github.com/hanickadot/compile-time-regular-expressio...
Although it looks like that this got fixed for simple patterns.
Not sure I'd reach for C++ or regexes in such a constrained micro environment. Anything where you don't directly understand the precise memory use is probably out.
The NumWorks N0100 graphical calculator had 1 MiB of Flash and 256 KiB of RAM. It packed seven mathematical apps (calculation, grapher, equations, statistics, regression, sequences, distributions) with a decently powerful maths engine/equation typesetter written in C++ and a MicroPython shell. They've paid a fair amount of attention to details in order to fit all of that in (least of all no STL), but C++ wielded correctly for embedded is no more of a memory hog than C.
Our target has ~1.5 MiB of Flash for program code and 512 KiB of RAM. We're using half of the former and maybe a third of the latter, the team barely paid any attention to program size or memory consumption. One day the project lead became slightly concerned about that and by the end of the day I shed off 20% of Flash and RAM usage going for the lowest hanging fruits.
I find it a bit amusing to call a 250 MHz STM32H5 MCU a constrained micro environment, if anything it's a bit overkill for what we need.
> least of all no STL
That's certainly a restricted dialect of C++.
> I find it a bit amusing to call a 250 MHz STM32H5 MCU a constrained micro environment, if anything it's a bit overkill for what we need.
I took an "embedded" systems class in college 15+ years ago that targeted a 32-bit ARM with megabytes of ram, so using these kBs of RAM micros in 2025 definitely feels like a constrained environment to me. The platforms I work on with C++ professionally have, ya know, hundreds of gigabytes of RAM (and our application gets ~100% of it).
Why choose Abseil for associative containers? It was a big deal when it first appeared on the scene, and it's certainly still better than the standard library for the most part, but there are now a number of alternatives that have since improved upon its weaknesses. (I've been especially impressed by Boost's unordered_flat_map.)
I have increasingly soured on Abseil over the past couple of years. At Google, we've seen an exodus of many of the core library maintainers, and some of the more recent design choices have been questionable at best.
I've never tried `unordered_flat_map` and would probably prefer to keep Boost, Poco, Qt, and other mega-frameworks outside of this repo to make the builds leaner, but I'm open to alternatives. Which are the best to look into?
boost:unordered is a header-only library.
I thought the same! And boost::unordered_flat_map is indeed awesome.
This is a hodgepodge of random stuff. Completely incoherent. Would not really point anyone at this to try and dig out gems.
I agree that it may not be the most readable format. So far, the best-structured piece on similar topics I’ve seen is Algorithmica: <https://en.algorithmica.org/hpc>.
I am sure it overlaps in terms of topics, maybe even some examples, but the homepage suggests that the book is about 500 pages long. I generally don’t have a time budget to read that much, and in most cases, I want to play with the code more than read the text, especially when some new IO_uring feature, a metaprogramming tool, or an Assembly instruction comes out.
Another observation is that people don’t like to read into 5000-word essays on each topic. At best, those become training materials for the next LLMs and will affect future code only indirectly…
I’m all ears for alternative formats if you have recommendations, as I generally reimplement such projects every couple of years ;)
This could be like, 20 really good blog posts? A series? Idk. They don't have to be 5000+ words each :-).
Is there one for C? I do not like the "std:" whatnots.
Agreed. I'm not seeing any use of a automatic profiler to find the hot spots of one's code. If you don't know that, you're aiming blind on what to optimize and how much. The only profiling mentioned is something manual you have to add to each scope you want to time.
In so, so much code, basic big-O analysis, concurrency analysis (minimize mutable state / coordination), and even a little thought about CPU caches (including TLB) yields huge dividends.
It does need some explanation and sectioning in between the readme and code comments, for the purpose of readability. I started reading the C++ source and wasn't sure what was happening, then realised the file is humongous and stopped. Perhaps splitting it into separate files with individual explanations would help. Also worth keeping in mind that with complex or intricate code, no one has any idea what you're doing unless you explicitly tell them.
I used such a structure in a benchmark suite https://github.com/MC-DeltaT/cpu-performance-demos
This is great! There has been far too much sophistry in C++ where the justification for some contortions are "performance", with no empirical data.
I am surprised about CTRE giving good results—I will admit I have thought of it more as a parlor trick than a viable engine. I will need to dig into that more. I also want to dig into the OpenMP & TBB threadpool benchmarks to see whether Boost::ASIO threadpool can be added into it.
I was also surprised about CTRE and I can imagine it being a viable tool for implementing parsers for 99% of use cases, where you may not need all the SIMD bells and whistles.
A word of caution, though: I remember the throughput differing vastly between GCC and MSVC builds. The latter struggles with heavy meta-programming and expression templates. I don't know why.
Oh MSVC, bless. Mingw setup might be a pain, but the dividends accrue over time.
--
This was a good reminder that I need to pay more attention to Unum's projects. I noticed this older blog article, https://www.unum.cloud/blog/2021-01-03-texts, and that brings up some questions. First, in 2025, is UStore a wholesale replacement for UDisk or are the two complementary? Second, what is the current Unum approach for replacing full-text search engines (e.g., ElasticSearch)?
I wish I'd had a short answer :)
For years, I've had a hope to build it in the form of an open-core project: open-source SotA solutions for Storage, Compute, and AI Modeling built bottom up. You can imagine the financial & time burden of building something like that with all the weird optimizations and coding practices listed above.
A few years in, with millions spent out of my pocket, without any venture support or revenue, I've decided to change gears and focus on a few niche workloads until some of the Unum tools become industry standards for something. USearch was precisely that, a toy Vector Search engine that would still, hopefully, be 10x better than alternatives, in one way or another: <https://www.unum.cloud/blog/2023-11-07-scaling-vector-search...>.
Now, ScyllaDB (through Rust SDK) and YugaByte (through C++ SDK) are the most recent DBMSs to announce features built on USearch, joining the ranks of many other tech products leveraging some of those optimizations, and I was playing around with different open-source growth & governance ideas last year, looking for way to organize more collaborative environment among our upstream users, rather than competitive — no major releases, just occasional patches here and there.
It was an interesting period, but now I'm again deep in the "CUDA-GDB" land, and the next major release to come is precisely around Full-Text Search in StringZilla <https://github.com/ashvardanian/stringzilla>, and will be integrated into both USearch <https://github.com/unum-cloud/usearch> and somewhere else ;)
It’s kickass that USearch is in DuckDB. That’s something I will play with, for sure.
ElasticSearch has always seemed geared too much towards concurrent querying with mixed workloads, and then it gets applied to logs… and, well, with logs you care about detection of known query sets at ingest, indexing speed, compression, and ability to answer queries over large cold indices in cheap object storage. And often when searching logs, you want exact matching, preferably with regex. Part of me wants to play with rolling my own crazy FM-index library, part of me thinks it might be easier to play games with Parquet dictionary tables (get factors out of regexps, check dictionary tables for the factors for great win), and part of me thinks I will be better off waiting to see what comes of the Rust-based ES challengers.
Will definitely follow announcements to come with StringZilla.
> Part of me wants to play with rolling my own crazy FM-index library
Oh, absolutely, go for it! And check out what Pesho <https://github.com/pesho-ivanov> & Ragnar are doing <https://github.com/RagnarGrootKoerkamp> ;)
std::regex is such a nightmare, I didn't take the time to run the code myself but I'd be curious if you would see the same delta if you swapped it for boost::regex or re2.
I think there's a case to be made for libaries like https://github.com/foonathan/lexy and/or https://github.com/boostorg/parser instead of reaching for a regex in the first place.
You can beat CTRE throughput with a non-compile-time solution. My first recommendation will be to look at HyperScan. It has remarkable throughput: https://github.com/intel/hyperscan
I’ve only avoided it in the tutorial, as I want to keep the build system lean. I wouldn’t be surprised if it’s 10x faster than Boost in the average case.
To echo Ash, HyperScan is the performance king. Use it when shipping on x64 and one has control over the pattern sets and you don't mind building it and working through its idiosyncrasies. It has some nonstandard matching semantics, reflecting its focus on network intrusion detection systems (NIDS), and does not have Unicode support.
For general purpose usage, Google's RE2 and PCRE2 in JIT mode will offer pretty good performance. Zoltan Herczeg's work on the PCRE2's JIT is underappreciated. Both these options are widely available and portable.
If you're taking priority requests:
"How coding FPGA differs from GPU and what is High-Level Synthesis, Verilog, and VHDL? #36" Yes please!
There is no "coding FPGA". Verilog/VHDL are not programming languages, they are "hardware description languages" (HDL). That means when you write those languages literally what you're doing is specifying every single wire in the circuit. People that do FPGA work aren't even called programmers, they're called designers.
> High-Level Synthesis
Something that absolutely not a single professional designer uses.
I agree that the sentiment around HLS is very bad, but there aren’t many good examples online, showcasing the difference. Do you think HDL vs HLS tradeoffs will be visible on something like 8-bit unsigned integer 3x3x3 or 4x4x4 matrix-multiplication?
I've started implementing this, but organizing a good testing environment was time-consuming. If someone here has a sandbox and the right skill set, 3x3x3 or 4x4x4 matrix multiplications can be a perfect candidate for FPGA ports and case studies!
What data type/precision? How much parallelism versus iterative? Is this fully in fabric or part of a SoC? Easy to get mired in many details.
HW design has so many degrees of freedom which makes it fun but challenging.
What specific topics are you trying to address with this?
Showing the tradeoffs between complexity & throughput of HLS and HDL implementations is the goal. I assume, uint8 should be sufficient to make a point?
This is only from a cursory look, but it all looks really concise and practical - thank you for putting it all together!
I did not know about unnormalized floats. I sometimes wonder about it staring at my GPU when it multiplies matrices.
I've made a bunch of bad measurements, until someone reminded me to:
It had a 50x performance impact on Intel. As benchmarked on `c7i.metal-48xl` instances: Here is that section in the repo with more notes AVX-512, AMX, and other instructions: <https://github.com/ashvardanian/less_slow.cpp/blob/8f32d65cc...>.Intel has always had terrible subnormal performance. It's not that difficult to implement in HW, and even if you still want to optimize for the normalized case, we're talking about a 1 cycle penalty, not an order(s)-of-magnitude penalty.
Hey! Your Google Benchmark post was one of my go-to resources when I started picking that up a couple years ago. I love to see the focus on performance benchmarking here, and the repository is laid out well. Nice work!
> Less Slow C, C++, and Assembly Code
I expected C code, too, not only .cpp and .S, however.
The less_slow.cpp uses a lot of C++-ism.
This may require fixing, or removal of "C" from the list.
I would pay some real money for the rust equivalent of this kind of material. Anyone got a book or repo they like and would recommend?
Most hardware-level observations, like the latency of various memory accesses or numeric operations, would be the same for the Rust code. As for higher-level abstractions, I've already started porting them to Rust <https://github.com/ashvardanian/less_slow.rs>.
Next, it would be exciting to implement a concurrent job-stealing graph algorithm in both languages to get a feel for their ergonomics in non-trivial problems. I can imagine it looks very different in Rust and C++, but before I get there, I'm looking for best practices for implementing nested associative containers with shared stateful allocators in Rust.
In C++, I've implemented them like this: <https://github.com/ashvardanian/less_slow.cpp/blob/8f32d65cc...>, even though I haven't seen many people doing that in public codebases. Any good examples for Rust?
I’m definitely interested in seeing this kind of content in Rust, have you looked at Rayon’s implementation for work stealing yet? Can result in some very nice high-level code.
> Let me know if you have ideas for other cool, obscure topics to cover!
Everything that Kris Jusiak has under https://github.com/qlibs/ is worth a look.
Awesome, thanks.
Does C++ have a good async ('coroutine') story for io_uring yet?
Great question! This has been top of mind for me for the last 2–3 years.
Short answer: sadly, no. I love the "usability" promise of coroutines—and even have 2–3 FOSS projects that could be rewritten entirely around C++ or Rust coroutines for better debuggability and extensibility—but my experiments show that the runtime cost of most coroutine‑like abstractions is simply too high. Frankly, I’m not even sure if a better design is possible on modern hardware.
This leads me to conclude that, despite my passion for SIMD and superscalar execution, the highest‑impact new assembly instructions that x86 and Arm could standardize would center on async execution and lightweight context switching... yet I haven’t seen any movement in that direction.
⸻
I also wrote toy examples for various range/async/stream models in C++, Rust, and Python, with measured latencies in inline comments:
Aside from coroutines (toy hand-rolled implementations and commonly used libraries), I've also played around C++ executors, senders & receivers, but didn't have much success with them either. May be a skill issue.> my experiments show that the runtime cost of most coroutine‑like abstractions is simply too high
Which runtime cost do you mean?
The main one I am aware of is a heap allocation per coroutine, though this can in some cases be elided if the coroutine is being called from another coroutine.
The other cost I am aware of is the initializing of the coroutine handle, but I think this is just a couple of pointers.
In both cases I would expect these overheads to be relatively modest compared to the cost of the I/O itself, though it's definitely better to elide the heap allocation when possible.
I don't know much about coroutine libraries like unifex (which I think your test is using), but a hand-coded prototype I was playing with doesn't seem to add much overhead: https://godbolt.org/z/8Kc1oKf15
If we can compile with -fno-exceptions, the code is even tighter: https://godbolt.org/z/5Yo8Pqvd5
My exploration into coroutines and I/O is only in the earliest stages, so I won't claim any of this to be definitive. But I am very interested in this question of whether the overhead is low enough to be a good match for io_uring or not.
> lightweight context switching
the cost of context switch consists of two parts, one of which can be subdivided:
1. register save/restore
2. the cost of the TLB flush, which is in turn proportional to the working set size of the switched-to process (i.e. if you don't touch much memory after the context switch, the cost is lower than if you do)I am not sure that any assembler instructions could address either of these.
What do you have in mind?
https://github.com/NVIDIA/stdexec/blob/main/examples/io_urin...
This is impressive work—feels like it should be packaged as a handbook for high-performance computing or even a practical guide for HFT developers. Great resource!
Looks like AI slop.
Could also be just slop or failure to organize.
We’ve all been there. The feeling of staring at a codebase, trying to find optimizations in every corner—to speed things up, streamline processes, and keep teams happy.
But then you come across something like this Less Slow C++. It’s not just about being clever or showy—it’s about solving real problems in the most efficient way possible. And that’s why it stands out in a world full of quick fixes and overcomplicated solutions.
Keep pushing for clarity, efficiency, and above all, results. With tools like this, the sky really is the limit.
Is this account a bot? It was flagged as such early on and is now shadowbanned. (Only people with showdead on will see the comments).
[flagged]