can be found throughout the source, and the project's landing page is a good example of typical SOTA models' outputs when asked for a frontend landing page.
Vibe coding doesnt mean the author doesnt understand their code. Its likely that they don't want carpal tunnel from typing out trivial code and hence offload that labor to a machine.
You do realize that it's possible to ask AI to write code and then read the code yourself to ensure it's valid, right? I usually try to strip the pointless comments, but it's not the end of the world if people leave them in.
AI assisted coding/engineering becomes "vibe coding" when you decide to abdicate any understanding of what you are building, instead focusing only on the outcome
Vibe-coding as originally defined (by Karpathy?) implied not reading the code at all, just trying it and pasting back any error codes; repeat ad infinitum until it works or you give up.
Now the term has evolved into "using AI in coding" (usually with a hint of non rigor/casualness), but that's not what it originally meant.
Asking for those who, like me, haven't yet taken the time to find technical information on that webpage:
What exactly does that roundtrip latency number measure (especially your 1us)? Does zero copy imply mapping pages between processes? Is there an async kernel component involved (like I would infer from "io_uring") or just two user space processes mapping pages?
27us and 1us are both an eternity and definitely not SOTA for IPC. The fastest possible way to do IPC is with a shared memory resident SPSC queue.
The actual (one-way cross-core) latency on modern CPUs varies by quite a lot [0], but a good rule of thumb is 100ns + 0.1ns per byte.
This measures the time for core A to write one or more cache lines to a shared memory region, and core B to read them. The latency is determined by the time it takes for the cache coherence protocol to transfer the cache lines between cores, which shows up as a number of L3 cache misses.
Interestingly, at the hardware level, in-process vs inter-process is irrelevant. What matters is the physical location of the cores which are communicating. This repo has some great visualizations and latency numbers for many different CPUs, as well as a benchmark you can run yourself:
It may or may not be good, depending on a number of fact.
I did read the original linux zerocopy papers from google for example, and at the time (when using tcp) the juice was worth the squeeze when payload was larger than than 10 kilobytes (or 20? Don’t remember right now and i’m on mobile).
Also a common technique is batching, so you amortise the round-trip time (this used to be the cost of sendmmsg/recvmmsg) over, say, 10 payloads.
So yeah that number alone can mean a lot or it can mean very little.
In my experience people that are doing low latency stuff already built their own thing around msg_zerocopy, io_uring and stuff :)
io_uring is a tool for maximizing throughput not minimizing latency. So the correct measure is transactions per millisecond not milliseconds per transaction.
Little’s Law applies when the task monopolizes the time of the worker. When it is alternating between IO and compute, it can be off by a factor of two or more. And when it’s only considering IO, things get more muddled still.
It's not a local IPC exactly. The roundtrip benchmark stat is for a TCP server-client ping/pong call using a 2 KB payload; TCP is although on local loopback (127.0.0.1).
It's not a local IPC exactly. The roundtrip benchmark stat is for a TCP server-client ping/pong call using a 2 KB payload; TCP is although on local loopback (127.0.0.1).
The payload is encoded using myra-codec FFM MemorySegment directly into a pre-registered buffer in io_uring SQE on the server. Similarly, on the client side CQE writes encoded payload directly into a client provided MemorySegment. The whole process saves a few SYSCALLs. Also, the above process is zero copy.
Pretty much what NateB said* - but that might leave you at "what's wrong with that? that's how I could get it done"
There's WAY too much content, way too many names and stuff that feels subtly off. I'm 37, been on this site for 16 years. I'm assuming target audience here is enterprise Java developers, which isn't my home, so I'm sure I'm missing some stuff is idiomatic in that culture.
But the vast, vast amount of things that are completely unfamiliar tells me something else is going on and it's not good.
Like I bet this is f'ing cool, otherwise you wouldn't put in the effort to share it. But you're better off having something super brief** in a GitHub README than a pseudo-marketing site that's straining to fit a cool technical thing into the wrong template.
** what you wrote is great! "The payload is encoded using myra-codec FFM MemorySegment directly into a pre-registered buffer in io_uring SQE on the server. Similarly, on the client side CQE writes encoded payload directly into a client provided MemorySegment. The whole process saves a few SYSCALLs. Also, the above process is zero copy." -- then the site looks like it wants to sell N different products and confusing flowcharts, but really, you're just geeked out and did something cool and want to share the technical details. So it's designed for the wrong audience.
In my opinion adding kryo in the benchmark is somewhat disingenuous as it does not require a message schema definition while MyraCodec/SBE/FlatBuffers do.
The only thing that says is schemeless and is zero copy is Apache Fory which is missing from the benchmark.
This looks like most of it was vibecoded.
Unnecessary comments like:
can be found throughout the source, and the project's landing page is a good example of typical SOTA models' outputs when asked for a frontend landing page.Okay, but is that a bad thing?
If the author doesn't understand their own code, I probably won't
Vibe coding doesnt mean the author doesnt understand their code. Its likely that they don't want carpal tunnel from typing out trivial code and hence offload that labor to a machine.
JNI for io_uring is not trivial code.
"Vibe-coding" means the author deliberately does not understand their code. "AI-assisted engineering" is what you are thinking of.
For your pet project? No. For something you're building for others to use? Almost certainly yes.
You do realize that it's possible to ask AI to write code and then read the code yourself to ensure it's valid, right? I usually try to strip the pointless comments, but it's not the end of the world if people leave them in.
Yeah but you're leaving out a crucial part: the code is full of useless comments.
That leaves 2 options:
- they didn't read the code themselves to ensure it's valid
- they did read the code themselves but left the useless comments
No matter which happened it shows they're a bad developer and I don't want to run their code.
> I usually try to strip the pointless comments
You could add your own instead, explaining how things work?
> It's possible to ask AI to write code and then read the code yourself
Sure, but then it would not be vibecoding.
>> It's possible to ask AI to write code and then read the code yourself
> Sure, but then it would not be vibecoding.
Wait, what?
AI assisted coding/engineering becomes "vibe coding" when you decide to abdicate any understanding of what you are building, instead focusing only on the outcome
Vibe-coding as originally defined (by Karpathy?) implied not reading the code at all, just trying it and pasting back any error codes; repeat ad infinitum until it works or you give up.
Now the term has evolved into "using AI in coding" (usually with a hint of non rigor/casualness), but that's not what it originally meant.
The comments aren’t the problem.
27us roundtrip is not really state of the art for zero copy IPC, about 1us would be. What is causing this overhead?
Asking for those who, like me, haven't yet taken the time to find technical information on that webpage:
What exactly does that roundtrip latency number measure (especially your 1us)? Does zero copy imply mapping pages between processes? Is there an async kernel component involved (like I would infer from "io_uring") or just two user space processes mapping pages?
27us and 1us are both an eternity and definitely not SOTA for IPC. The fastest possible way to do IPC is with a shared memory resident SPSC queue.
The actual (one-way cross-core) latency on modern CPUs varies by quite a lot [0], but a good rule of thumb is 100ns + 0.1ns per byte.
This measures the time for core A to write one or more cache lines to a shared memory region, and core B to read them. The latency is determined by the time it takes for the cache coherence protocol to transfer the cache lines between cores, which shows up as a number of L3 cache misses.
Interestingly, at the hardware level, in-process vs inter-process is irrelevant. What matters is the physical location of the cores which are communicating. This repo has some great visualizations and latency numbers for many different CPUs, as well as a benchmark you can run yourself:
[0] https://github.com/nviennot/core-to-core-latency
It may or may not be good, depending on a number of fact.
I did read the original linux zerocopy papers from google for example, and at the time (when using tcp) the juice was worth the squeeze when payload was larger than than 10 kilobytes (or 20? Don’t remember right now and i’m on mobile).
Also a common technique is batching, so you amortise the round-trip time (this used to be the cost of sendmmsg/recvmmsg) over, say, 10 payloads.
So yeah that number alone can mean a lot or it can mean very little.
In my experience people that are doing low latency stuff already built their own thing around msg_zerocopy, io_uring and stuff :)
io_uring is a tool for maximizing throughput not minimizing latency. So the correct measure is transactions per millisecond not milliseconds per transaction.
Little’s Law applies when the task monopolizes the time of the worker. When it is alternating between IO and compute, it can be off by a factor of two or more. And when it’s only considering IO, things get more muddled still.
It's not a local IPC exactly. The roundtrip benchmark stat is for a TCP server-client ping/pong call using a 2 KB payload; TCP is although on local loopback (127.0.0.1).
Source: https://github.com/mvp-express/myra-transport/blob/main/benc...
indeed, you can get a packet from one box to another in 1-2us
with io_uring? How? I tried everything in the book
It's not a local IPC exactly. The roundtrip benchmark stat is for a TCP server-client ping/pong call using a 2 KB payload; TCP is although on local loopback (127.0.0.1).
The payload is encoded using myra-codec FFM MemorySegment directly into a pre-registered buffer in io_uring SQE on the server. Similarly, on the client side CQE writes encoded payload directly into a client provided MemorySegment. The whole process saves a few SYSCALLs. Also, the above process is zero copy.
Source: https://github.com/mvp-express/myra-transport/blob/main/benc...
P.S.: I had posted this as a reply to jeffrey but not able to see it. Hence, reposting as a direct reply to the main post for visibility as well.
Disclaimer: I am the author of https://mvp.express. I would love feedback, critical suggestions/advise.
Thanks -RR
Pretty much what NateB said* - but that might leave you at "what's wrong with that? that's how I could get it done"
There's WAY too much content, way too many names and stuff that feels subtly off. I'm 37, been on this site for 16 years. I'm assuming target audience here is enterprise Java developers, which isn't my home, so I'm sure I'm missing some stuff is idiomatic in that culture.
But the vast, vast amount of things that are completely unfamiliar tells me something else is going on and it's not good.
Like I bet this is f'ing cool, otherwise you wouldn't put in the effort to share it. But you're better off having something super brief** in a GitHub README than a pseudo-marketing site that's straining to fit a cool technical thing into the wrong template.
* https://news.ycombinator.com/item?id=46255661
** what you wrote is great! "The payload is encoded using myra-codec FFM MemorySegment directly into a pre-registered buffer in io_uring SQE on the server. Similarly, on the client side CQE writes encoded payload directly into a client provided MemorySegment. The whole process saves a few SYSCALLs. Also, the above process is zero copy." -- then the site looks like it wants to sell N different products and confusing flowcharts, but really, you're just geeked out and did something cool and want to share the technical details. So it's designed for the wrong audience.
In my opinion adding kryo in the benchmark is somewhat disingenuous as it does not require a message schema definition while MyraCodec/SBE/FlatBuffers do.
The only thing that says is schemeless and is zero copy is Apache Fory which is missing from the benchmark.