A few years back I patched the memory allocator used by the Cloudflare Workers runtime to overwrite all memory with a static byte pattern on free, so that uninitialized allocations contain nothing interesting.
We expected this to hurt performance, but we were unable to measure any impact in practice.
Everyone still working in memory-unsafe languages should really just do this IMO. It would have mitigated this Mongo bug.
> OpenBSD uses 0xdb to fill newly allocated memory and 0xdf to fill memory upon being freed. This helps developers catch "use-before-initialization" (seeing 0xdb) and "use-after-free" (seeing 0xdf) bugs quickly.
I like this. The only information leaking is whether the memory range was previously used. I suppose you may want to control for that. I'd be surprised if OpenBSD didn't provide a flag to just freed memory to the same value as never allocated.
Recent macOS versions zero out memory on free, which improves the efficacy of memory compression. Apparently it’s a net performance gain in the average case
A few years back I patched the memory allocator used by the Cloudflare Workers runtime to overwrite all memory with a static byte pattern on free, so that uninitialized allocations contain nothing interesting.
Note that many malloc implementations will do this for you given an appropriate environment, e.g. setting MALLOC_CONF to opt.junk=free will do this on FreeBSD.
FYI, at least in C/C++, the compiler is free to throw away assignments to any memory pointed to by a pointer if said pointer is about to be passed to free(), so depending on how you did this, no perf impact could have been because your compiler removed the assignment. This will even affect a call to memset()
I patched the free() implementation itself, not the code that calls free().
I did, of course, test it, and anyway we now run into the "freed memory" pattern regularly when debugging (yes including optimized builds), so it's definitely working.
That code is not guaranteed to work. Declaring memset_v as volatile means that the variable has to be read, but does not imply that the function must be called; the compiler is free to compile the function call as "tmp = memset_v; if (tmp != memset) tmp(...)" relying on its knowledge that in the likely case of equality the call can be optimized away.
Whilst the C standard doesn't guarantee it, both LLVM and GCC _do_. They have implementation-defined that it will work, so are not free to optimise it away.
Most of C++ programs written before P0593R6 depended on implementation behaviour, and were graciously allowed to not be undefined behaviour just 5 years ago. C++ as a language standard is mostly irrelevant, what one should care about is what the compiler authors consider valid code.
You have to rely on implementation for anything to do with what happens to memory after it is freed, or really almost anything to do with actual bytes in RAM.
The C committee gave you memset_explicit. But note that there is still no guarantee that information can not leak. This is generally a very hard problem as information can leak in many different ways as it may have been copied by the compiler. Fully memory safe languages (so "Safe Rust" but not necessarily real-word Rust) would offer a bit more protection by default, but then there are still side-channel issues.
Because, for the 1384th time, they're pretending they can ignore what the programmer explicitly told them to do
Creating memset_explicit won't fix existing code. "Oh but what if maybe" is just cope.
If I do memset then free then that's what I want to do
And the way things go I won't be surprised if they break memset_explicit for some other BS reason and then make you use memset_explicit_you_really_mean_it_this_time
Your problem is not the C committee but your lack of understanding how optimizing compilers work. WG14 could, of course, specify that a compiler has do exactly what you tell it to do. And in fact, every compiler supports this already: Im most cases even by default! Just do not turn on optimization. But this is not what most people want.
Once you accept that optimizing compilers do, well, optimizations, the question is what should be allowed and what not. Both inlining "memset" and eliminating dead stores are both simply optimizations which people generally want.
If you want a store not to be eliminated by a compiler, you can make it volatile. The C standard says this can not be deleted by optimizations. The criticism with this was that later undefined behavior could "undo" this by "travelling in time". We made it clear in ISO C23 that this not allowed (and I believe it never was) - against protests from some compiler folks. Compilers still do not fully conform to this, which shows the limited power WG14 has to change reality.
> Once you accept that optimizing compilers do, well, optimizations
Why in tarnation it is optimizing out a write to a pointer out before a function that takes said pointer? Imagine it is any other function besides free, see how ridiculous that sounds?
It's been many years since C compilers started making pathological-but-technically-justifiable optimizations that work against the programmer. The problem is the vast sea of "undefined behavior" — if you are not a fully qualified language lawyer versed in every nook and cranny of the C standard, prepare to be surprised.
Many of us who don't like working under such conditions have just moved on to other languages.
I agree that compilers were too aggressive in exploiting UB, but this is not the topic of this thread which has nothing to do with UB. But also the situation with UB is in practice not too bad. While compilers broke some old code which caused frustration, when writing new code most UB can easily be dealt with in practice by following some basic ground rules (e.g. no unsafe casts, being careful with pointer arithmetic) and by activating some compiler flags. It is not anything that should cause much trouble when programming in C.
Because it is a dead store. Removing dead stores does not sound ridiculous to me and neither is it to anybody using an optimizing compiler in the last decades.
The whole point of the optimizer is that it can detect inefficiencies by treating every statement as some combination of simple, fundamental operations. The compiler is not seeing "call memset() on pointer to heap", it's seeing "write of variable size" just before "deallocation". For some, optimizing that will be a problem, for others, not optimizing it will leave performance on the table.
There are still ways to obtain the desired behavior. Just put a call to a DLL or SO that implements what you need. The compiler cannot inspect the behavior of functions across module boundaries, so it cannot tell whether removing the call preserves semantics or not (for example, it could be that the external function sends the contents of the buffer to a file), so it will not remove it.
A modern compiler may also completely remove malloc / free pairs and move the computation to the stack. And I do not see what this has to do with C, it should be the same for most languages. C gives you tools to express low-level intent such as "volatile", but one has to use them.
Strong disagree. In C, malloc and free are functions, and I expect no magic to happen when calling a function. If malloc and free were keywords like sizeof, it would have been different.
Your problem is that you're treating words such as "function" and "call" as if they had meaning outside of the language itself (or, more specifically, outside of the C abstract machine), when the point of the compiler is precisely to melt away the language parts of the specified program and be left with a concrete program that matches its behavior. If you view a binary in a disassembler, you will not find any "functions" or "calls". Maybe that particular architecture happens to have a "call" instruction to jump to "functions", but these words are merely homophones with what C refers to as "functions" and "calls".
When you "call" a "function" in the source you're not specifying to the compiler that you want a specific opcode in the generated executable, you're merely specifying a particular observable behavior. This is why optimizations such as inlining and TCO are valid. If the compiler can prove that a heap allocation can be turned into a stack allocation, or even removed altogether (e.g. free(malloc(1ULL << 50))), the fact that these are exposed to the programmer as "functions" he can "call" poses no obstacle.
Closest to what you say that I can find is 5.1.2.3 §4 of N3096
In the abstract machine, all expressions are evaluated as specified by the semantics. An actual
implementation need not evaluate part of an expression if it can deduce that its value is not used
and that no needed side effects are produced (including any caused by calling a function or through
volatile access to an object)
Problem is, calling external library function has a needed side effect of calling that library function. I do not see language that allows simply not doing that, based on assumed but unknown function behaviour.
The behavior of the standard functions is not unknown, it is at least partially specified. If a user overrides them under the mistaken assumption that a call in source translates in a 1-to-1 correspondence to a call in binary, that's their problem.
Thanks, I did read it! Things like footnote 236: "This means that an implementation is required to provide an actual function for each library function, even if it also
provides a macro for that function", where macro is shown to use compiler builtin as an example.
Again, could you please explain how compiler can decide to remove call to a function in an external dynamically loaded library, that is not known at compile time, simply based on the name of the function (i.e. not because the call is unreachable)? I do not see any such language in the standard.
And yes, calling unknown function from a dynamically loaded library totally is a side effect.
> Again, could you please explain how compiler can decide to remove call to a function in an external dynamically loaded library, that is not known at compile time, simply based on the name of the function (i.e. not because the call is unreachable)? I do not see any such language in the standard.
> And yes, calling unknown function from a dynamically loaded library totally is a side effect.
The thing is that malloc/free aren't "unknown function[s]". From the C89 standard:
> All external identifiers declared in any of the headers are reserved, whether or not the associated header is included.
And from the C23 standard:
> All identifiers with external linkage in any of the following subclauses (including the future library directions) and errno are always reserved for use as identifiers with external linkage
malloc/free are defined in <stdlib.h> and so are reserved names, so compilers are able to optimize under the assumption that malloc/free will have the semantics dictated by the standard.
In fact, the C23 standard explicitly provides an example of this kind of thing:
> Because external identifiers and some macro names beginning with an underscore are reserved, implementations can provide special semantics for such names. For example, the identifier _BUILTIN_abs could be used to indicate generation of in-line code for the abs function. Thus, the appropriate header could specify
#define abs(x) _BUILTIN_abs(x)
> for a compiler whose code generator will accept it.
As they're freely replaceable through loading, and designed for that, I would strongly suggest that are among the most magical areas of the C standard.
We get a whole section for those in the standard: 7.24.3 Memory management functions
Hell, malloc is allowed to return you _less than you asked for_:
> The pointer returned if the allocation succeeds is suitably aligned so that it may be assigned to a pointer to any type of object with a fundamental alignment requirement and size less than or equal to the size requested
I read the text as saying the object size can be less or equal to returned memory size.
Anyway, section 7 is library. As you say, replacing through loading is a common thing to do — surely compiler is not free to simply elide external library function at will? This is not C++ after all, it must be sensible
If the function is equivalent to a no-op, and not explicitly marked as volatile for side-effects, it absolutely can elide it. If there is a side-effect in hardware or wider systems like the OS, then it must be marked as volatile. If the code is just code, then a function call that does effectively nothing, will probably become nothing.
That was one of the first optimisations we had, back with Fortran and COBOL. Before C existed - and as B started life as a stripped down Fortran compiler, the history carried through.
The K&R book describes the buddy system for malloc, and how its design makes it suitable for compiler optimisations - including ignoring a write to a pointer that does nothing, because the pointer will no longer be valid.
You are literally scaring me now. I'd understand such things being done when statically linking or running JIT, but for "normal" program which function implementation malloc() will link against is not known during compilation. How can compiler go, like, "eh, I'll assume free(malloc(x)) is NOP and drop it" and not break most existing code?
> but for "normal" program which function implementation malloc() will link against is not known during compilation. How can compiler go, like, "eh, I'll assume free(malloc(x)) is NOP and drop it" and not break most existing code?
I'd suspect that eliding suitable malloc/free pairs would not break most existing code because most existing code simply does not depend on malloc/free doing anything other than and/or beyond what the C standard requires.
How would you propose that eliding free(malloc(x)) would break "most" existing code, anyways?
As an example, user kentonv wrote: "I patched the memory allocator used by the Cloudflare Workers runtime to overwrite all memory with a static byte pattern on free". And compiler would, like, "nah, let's leave all that data on stack".
Or somebody would try to plug in mimalloc/jemalloc or a debug allocator and wonder what's going on.
>As an example, user kentonv wrote: "I patched the memory allocator used by the Cloudflare Workers runtime to overwrite all memory with a static byte pattern on free". And compiler would, like, "nah, let's leave all that data on stack".
Such a program would continue to function as normal; the dirty data would just be left on the stack. If the developer wants to clear that data too, they'd just have to modify the compiler to overwrite the stack just before (or just after) moving the stack pointer.
>Or somebody would try to plug in mimalloc/jemalloc or a debug allocator and wonder what's going on.
Again, that wouldn't be broken. They would see that no dynamic allocations were performed during that particular section. Which would be correct.
I'm a bit skeptical either example is representative of "most" existing software. If anything, the mere existence of __builtin_malloc and its default use should hint that most existing software doesn't care about malloc/free actually being called. That being said...
> As an example, user kentonv wrote: "I patched the memory allocator used by the Cloudflare Workers runtime to overwrite all memory with a static byte pattern on free". And compiler would, like, "nah, let's leave all that data on stack".
Strictly speaking, I don't think eliding malloc/free would "break" those programs because that behavior is there for security if/when something else goes wrong, not as part of the software's regular intended functionality (or at least I sure hope nothing relies on that behavior for proper functioning!).
> Or somebody would try to plug in mimalloc/jemalloc [] and wonder what's going on.
Why would mimalloc/jemalloc/some other general-purpose allocator care that it doesn't have to execute a matching malloc/free pair any more than the default allocator?
I'm not sure debug allocators would care either? If you're trying to debug mismatched malloc/free pairs then the ones the compiler elides are the ones you don't care about anyways since those are the ones that can be statically proven to be "self-contained" and/or correct. If you're gathering statistics then you probably care more about the malloc/free calls that do occur (i.e., the ones that can't be elided), not those that don't.
In any case, if you want to use a malloc/free implementation that promises more than the C standard does (e.g., special byte pattern on free, statistics/debug info tracking, etc.) there's always -fno-builtin-malloc (or memset_explicit if you're lucky enough to be using C23). Of course, the tradeoff is that you give up some potential performance.
Thank you for putting it in a much more correct and understandable language than I could. That is exactly what I am talking about: if you call __builtin_malloc (e.g. via macro definition in the libc header), compiler is free to do whatever it wants. However, calling "malloc" library function should call "malloc" library function, and anything else is unacceptable and a bug. There should be no case where compiler could assume anything about a function it does not see based simply on it's name. Neither malloc nor strlen.
> That is exactly what I am talking about: if you call __builtin_malloc (e.g. via macro definition in the libc header), compiler is free to do whatever it wants. However, calling "malloc" library function should call "malloc" library function, and anything else is unacceptable and a bug.
I think that's an overly narrow reading of the footnote. I don't see an obvious reason why "such names" in the footnote should only cover "some macro names beginning with an underscore" and not also "external identifiers". And if implementations are allowed to define special semantics for "external identifiers", then... well, that's exactly what they did!
In addition, there's still the as-if rule. The semantics of malloc/free are defined by the C standard; if the compiler can deduce that there is no observable difference between a version of the program that calls those and a version that does not, why does it matter that the call is emitted? A function call in and of itself is not a side effect, and since the C standard dictates what malloc/free do the compiler knows their possible side effects.
Furthermore, the addition of memset_explicit and its footnote ("The intention is that the memory store is always performed (i.e. never elided), regardless of optimizations. This is in
contrast to calls to the memset function (7.26.6.1)") implies that eliding calls is in fact acceptable behavior when optimizations are enabled. If eliding calls were not permissible when optimizing then what's the point of memset_explicit?
> There should be no case where compiler could assume anything about a function it does not see based simply on it's name.
Again, external identifiers defined by the C standard are reserved. Reserved external identifiers aren't just for show. From the C89 standard:
> If the program defines an external identifier with the same name as a reserved external identifier, even in a semantically equivalent form, the behavior is undefined.
And from C23:
> If the program declares or defines an identifier in a context in which it is reserved (other than as allowed by 7.1.4), the behavior is undefined.
This means that yes, under modern compilers' interpretation of UB compilers can assume things about functions based on their names because modern compilers generally optimize assuming UB does not happen. The compiler does not need to see the function's implementation because it is the function's implementation as far as it is concerned.
Ah yes, N2625 "What we think we reserve". Basically any C program containing variable or function "top", "END", "strict", "member" and so on is non-conforming and subject to undefined behaviour, so they define "potentially reserved" identifiers and as usual compiler vendors go and do the sane right thing.
Zeroing memory should absolutely be the default behavior for any generic allocator in 2025.
If you need better performance, write your own allocator optimized for your specific use case — it's not that hard.
Besides, you if you don't need to clear old allocations, there are likely other optimizations you'll be able to find which would never fly in a system allocator.
You know, I never even considered doing that but it makes sense; whatever overhead that's incurred by doing that static byte pattern is still almost certainly minuscule compared to the overhead of something like a garbage collector.
IMO the tradeoff that is important here is a few microseconds of time sanitizing the memory saves the millions of dollars of headache when memory unsafe languages fail (which happens regularly)
I agree. I almost feel like this should be like a flag in `free`. Like if you pass in 1 or something as a second argument (or maybe a `free_safe` function or something), it will automatically `memset` whatever it's freeing with 0's, and then do the normal freeing.
Alternatively, just make free do that by default, adding a fast_and_furious_free which doesn't do it, for the few hotspots where that tiny bit of performance is actually needed.
The default case should be the safe correct one, even if it “breaks” backward compatibility. Without it, we will forever be saddled with the design mistakes of the past.
Non-deterministic latency is a drawback, but garbage collection is not inherently slower than manual memory management/reference counting/etc. Depending on the usage pattern it can be faster. It's a set of trade-offs
The author seems to be unaware that Mongo internally develops in a private repo and commits are published later to the public one with https://github.com/google/copybara. All of the confusion around dates is due to this.
I was definitely unaware. I suspected something like this may be up when I talked about the zero-review of the apparent PR "I’m not aware of Mongo’s public review practices". This is great to know though. Updating the piece now to mention this and explain the date discrepancy
In something like a database zeroing or poisoning on free is probably a good idea. (These days probably all allocators should do it by default.)
Allocators are an interesting place to focus on for security. Chris did amazing work there for Blink that eventually rolled out to all of Chromium. The docs are a fun read.
To the extent that any of this was ever true, it hasn’t been true for at least a decade. After the WiredTiger acquisition they really got their engineering shit together. You can argue it was several years too late but it did happen.
I got heavily burned pre-wiredtiger and swore to never use it again. Started a new job which uses it and it’s been… Painless, stable and fast with excellent support and good libraries. They did turn it around for sure.
A highly cited reason for using mongo is that people would rather not figure out a schema. (N=3/3 for “serious” orgs I know using mongo).
That sort of inclination to push off doing the right thing now to save yourself a headache down the line probably overlaps with “let’s just make the db publicly exposed” instead of doing the work of setting up an internal network to save yourself a headache down the line.
> A highly cited reason for using mongo is that people would rather not figure out a schema.
Which is such a cop out, because there is always a schema. The only questions are whether it is designed, documented, and where it's implemented. Mongo requires some very explicit schema decisions, otherwise performance will quickly degrade.
Fowler describes it as Implicit vs Explicit schema, which feels right.
Kleppmann chooses "schema-on-read" vs "schema-on-write" for the same concept, which I find harder to grasp mentally, but describes when schema validation need occur.
There is a surprising amount of important data in various Mongo instances around the world. Particularly within high finance, with multi-TB setups sprouting up here and there.
I suspect that this is in part due to historical inertia and exposure to SecDB designs.[0] Financial instruments can be hideously complex and they certainly are ever-evolving, so I can imagine a fixed schema for essentially constantly shifting time series universe would be challenging. When financial institutions began to adopt the SecDB model, MongoDB was available as a high-volume, "schemaless" KV store, with a reasonably good scaling story.
Combine that with the relatively incestuous nature of finance (they tend to poach and hire from within their own ranks), the average tenure of an engineer in one organisation being less than 4 years and you have an osmotic process of spreading "this at least works in this type of environment" knowledge. Add the naturally risk-averse nature of finance[ß] and you can see how one successful early adoption will quickly proliferate across the industry.
ß: For an industry that loves to take financial risks - with other people's money of course, they're not stupid - the players in high finance are remarkably risk-averse when it comes to technology choices. Experimentation with something new and unknown carries a potentially unbounded downside with limited, slowly emerging upside.
I'd argue that there's a schema; it's just defined dynamically by the queries themselves. Given how much of the industry seems fine with dynamic typing in languages, it's always been weird to me how diehard people seem to be about this with databases. There have been plenty of legitimate reasons to be skeptical of mongodb over the years (especially in the early days), but this one really isn't any more of a big deal than using Python or JavaScript.
Yes there's a schema, but it's hard to maintain. You end up with 200 separate code locations rechecking that the data is in the expected shape. I've had to fix too many such messes at work after a project grinded to a halt. Ironically some people will do schemaless but use a statically typed lang for regular backend code, which doesn't buy you much. I'd totally do dynamic there. But DB schema is so little effort for the strong foundation it sets for your code.
Sometimes it comes from a misconception that your schema should never have to change as features are added, and so you need to cover all cases with 1-2 omni tables. Often named "node" and "edge."
> Ironically some people will do schemaless but use a statically typed lang for regular backend code, which doesn't buy you much. I'd totally do dynamic there.
I honestly feel like the opposite, at least if you're the only consumer of the data. I'd never really go out of my way to use a dynamically typed language, and at that point, I'm already going to be having to do something to get the data into my own language's types, and at that point, it doesn't really make a huge difference to me what format it used to be in. When there are a variety of clients being used though, this logic might not apply though.
The adage I always tell people is that in any successful system, the data will far outlive the code. People throw away front ends and middle layers all the time. This becomes so much harder to do if the schema is defined across a sprawling middle layer like you describe.
We just sit a data persistence service infront of mongo and so we can enforce some controls for everything there if we need them, but quite often we don’t.
It’s probably better to check what you’re working on than blindly assuming this thing you’ve gotten from somewhere is the right shape anyway.
As someone who has done a lot of Ruby coding I would say using a statically typed database is almost a must when using a dynamically type language. The database enforces the data model and the Ruby code was mostly just glue on top of that data model.
That's fair, I could see an argument for "either the schema or the language needs to enforce schema". It's not obvious to me that one of the two models of "only one of them is" deserves to much more criticism than the other though.
It's possible you didn't intend it, but your parent comment definitely came off as snarky, so I don't think you should be surprised that people responded in kind. You're honestly doing it again with the "let's stop feeling attacked" bit; whether you mean it or not, your phrasing comes across as pretty patronizing, and overall combined with the apparent dislike of people disagreeing with you after the snark it comes across as passive-aggressive. In general it's not going to go over well if you dish out criticism but can't take it.
In any case, you quite literally said there was a "lack of schemas", and I disagreed with that characterization. I certainly didn't feel attacked by it; I just didn't think it was the most accurate way to view things from a technical perspective.
NoSQL is used for high availability of data at scale - iMessage famously uses it for message threads, EA famously uses it for gaming matchmaking.
What you do is have both SQL and NoSQL. The NoSQL is basically caches of resources for high availability. Imagine you are making a social media app... Yes of course you have a SQL database that stores all the data, but you maintain API caches of posts in NoSQL.
Why? This gets to some of your other black vs white insults: NoSQL is typically WAY FASTER than SQL. That's why you use it. It's way faster to read a JSON file from a hard drive than it is to query a SQL database, always has been. So why not use NoSQL for EVERYTHING? Well, because you have duplicated data everywhere since it's not relational, it's just giant caches essentially. You also will get slow queries when the documents get huge.
Anyway you need both. It's not an either/or thing. I cannot believe this many years later people do not know the purpose of SQL and NoSQL and do not understand that it is not a competition at all. You want both!
Because nobody uses mongo for the reasons you listed. They use redis, dynamo, scylla or any number of enriched KV stores.
Mongo has spent its entire existence pretending to be a SQL database by poorly reinventing
everything you get for free in postgres or mysql or cockroach.
False. Mongo never pretended to be a SQL database. But some dimwits insisted on using it for transactions, for whatever reason, and so it got transactional support, way later in life, and in non-sharded clusters in the initial release. People that know what they are doing have been using MongoDB for reliable horizontally-scalable document storage basically since 3.4. With proper complex indexing.
Scylla! Yes, it will store and fetch your simple data very quickly with very good operational characteristics. Not so good for complex querying and indexing.
Yeah fair, I was being a bit lazy here when writing my comment. I've used nosql professionally quite a bit, but always set up by others. When working on personal projects I reach for SQL first because I can throw something together and don't need ideal performance. You're absolutely right that they both have their place.
That being said the question was genuine - because I don't keep up with the ecosystem, I don't know it's ever valid practice to have a nosql db exposed to the internet.
What they wrote was pretty benign. They just asked how common it is for Mongo to be exposed. You seem to have taken that as a completely different statement
I mean they said it's rarely used when in fact it's widely used by some of the world's biggest companies at the highest scale the internet knows. The other guy had a harsher comment sure, maybe I should duplicate my reply to them, but who knows what kinds of rules that breaks on this site lmao Happy Christmas & New Year buddy!
It could be because when you leave an SQL server exposed it often turns into much worse things. For example, without additional configuration, PostgreSQL will default into a configuration that can own the entire host machine. There is probably some obscure feature that allows system process management, uploading a shell script or something else that isn't disabled by default.
The end result is "everyone" kind of knows that if you put a PostgreSQL instance up publicly facing without a password or with a weak/default password, it will be popped in minutes and you'll find out about it because the attackers are lazy and just running crypto-mine malware, etc.
No one, if you aren't in the administration's good graces and something shitty happens unrelated to you, you've put a target on your back to be suspect #1.
I'm still thinking about the hypothetical optimism brought by OWASP top 10 hoping that major flaws will be solved and that buffer overflow has been there since the beginning... in 2003.
I mean giving everyone footguns and you'll find that is unavoidable forever. Thoughts and prayers to the Mongo devs until we migrate to a language that prevents this error.
Evidence of no exploitations? It's usually hard to prove a negative, except when you have all the logs at your fingertips you can sift through. Unless they don't, of course. In which case the point stands: they don't actually know at this point in time, if they can even know about it at all.
Specifically, it looks like the exflitration primitive relies on errors being emitted, and those errors are what leak the data. They're also rather characteristic. One wouldn't reasonably expect MongoDB to hold onto all raw traffic data flowing in and out, but would absolutely expect them to have the error logs, at least for some time back.
I feel like that's an issue not with what they said, but what they did. It would be better for them to have checked this quickly, but it would have been worse for them to have they did when they hadn't. What you're saying isn't wrong, but it's not really an answer to the question you're replying to.
> "No evidence of exploitation” is a pretty bog standard report
It is standard, yes. The problem with it as a statement is that it's true even if you've collected exactly zero evidence. I can say I don't have evidence of anyone being exploited, and it's definitely true.
It's not really my bar, I just explored this on behalf of the person you were replying to because I found it mildly interesting.
It is also a pretty standard response indeed. But now that it was highlighted, maybe it does deserve some scrutiny? Or is saying silly, possibly misleading things okay if that's what everyone has always been doing?
is it true that ubisoft got hacked and 900GB of data from their database was leaked due to mongobleed, i am seeing a lot of posts on social media under the #ubisoft tags today. can someone on HN confirm?
Details are still emerging, update in the last hour was that at least 5 different hacking groups were in ubisoft's systems and yeah some might have got their via bribes rather than mongodb https://x.com/vxunderground/status/2005483271065387461
Almost always when you hear about emails or payment info leaking (or when Twitter stored passwords in plaintext lol) it's from logs. And a lot of times logs are in NoSQL because it is only ever needed in that same JSON format and in a very highly available way (all you Heroku users tailing logs all day, yw) and then almost nobody encrypts phone numbers and emails etc. whenever those end up in logs.
There's basically no security around logs actually. They're just like snapshots of the backend data being sent around and nobody ever cares about it.
Anyway it has nothing to do with the choice to use NoSQL, it has more to do with how neglected security is around it.
Btw in case you are wondering in both the Twitter plaintext password case and in the Rainbow Six Siege data leak you mention were both logs that leaked. NoSQL backed logs sure, but it's more about the data security around logging IMO.
If it is, it's less fluffy and empty than most of LLM prose we're usually fed. It's well explained and has enough details to not be overwhelming.
Honestly, aside from the "<emoji> impact" section that really has an LLM smell (but remember that some people legit do this since it's in the llm training corpus), this more feels like LLM assisted (translated? reworded? grammar-checked?) that pure "explain this" prompt.
I did some research with it, and used it to help create the ASCII art a bit. That's about it.
I was afraid that adding the emoji would trigger someone to think it's AI.
In any case, nowadays I basically always get at least one comment calling me an AI on a post that's relatively popular. I assume it's more a sign of the times than the writing...
Thank you for the clarification! I'm sorry for engaging in the LLM hunt, I don't usually do. Please keep writing, this was a really good breakdown!
In hindsight, I would not even have thought about it if not for the comment I replied to. LLM prose fail to make me read whole paragraphs and I find myself skipping roughly the second half of every paragraph, which was definitely not the case for your article. I did somewhat skip at the emoji heading, not because of LLMs, but because of a saturation of emojis in some contexts that don't really need them.
I should have written "this could be LLM assisted" instead of "this more feels like LLM assisted", but well words.
Again, sorry, don't get discouraged by the LLM witch hunt.
I’m about ready to start flagging every comment that complains about the source material being LLM-generated. It’s tiresome, pointless, and adds absolutely nothing useful to the discussion.
If the material is wrong, explain why. Otherwise, shut up.
A few years back I patched the memory allocator used by the Cloudflare Workers runtime to overwrite all memory with a static byte pattern on free, so that uninitialized allocations contain nothing interesting.
We expected this to hurt performance, but we were unable to measure any impact in practice.
Everyone still working in memory-unsafe languages should really just do this IMO. It would have mitigated this Mongo bug.
> OpenBSD uses 0xdb to fill newly allocated memory and 0xdf to fill memory upon being freed. This helps developers catch "use-before-initialization" (seeing 0xdb) and "use-after-free" (seeing 0xdf) bugs quickly.
Looks like this is the default in OpenBSD.
I like this. The only information leaking is whether the memory range was previously used. I suppose you may want to control for that. I'd be surprised if OpenBSD didn't provide a flag to just freed memory to the same value as never allocated.
This makes me curious. This bit of information – knowing whether the memory range was previously used or not – how could it be exploited?
Recent macOS versions zero out memory on free, which improves the efficacy of memory compression. Apparently it’s a net performance gain in the average case
I wonder if Apple Silicon has hardware acceleration for memory zeroing... Knowing Apple, I wouldn't be surprised.
ARM in general does, or at least some modern variants. Various docs for Android and LLVM suggest it's part of the Memory Tagging Extension.
A few years back I patched the memory allocator used by the Cloudflare Workers runtime to overwrite all memory with a static byte pattern on free, so that uninitialized allocations contain nothing interesting.
Note that many malloc implementations will do this for you given an appropriate environment, e.g. setting MALLOC_CONF to opt.junk=free will do this on FreeBSD.
FYI, at least in C/C++, the compiler is free to throw away assignments to any memory pointed to by a pointer if said pointer is about to be passed to free(), so depending on how you did this, no perf impact could have been because your compiler removed the assignment. This will even affect a call to memset()
see here: https://godbolt.org/z/rMa8MbYox
I patched the free() implementation itself, not the code that calls free().
I did, of course, test it, and anyway we now run into the "freed memory" pattern regularly when debugging (yes including optimized builds), so it's definitely working.
However, if you recast to volatile, the compiler will keep it:
That code is not guaranteed to work. Declaring memset_v as volatile means that the variable has to be read, but does not imply that the function must be called; the compiler is free to compile the function call as "tmp = memset_v; if (tmp != memset) tmp(...)" relying on its knowledge that in the likely case of equality the call can be optimized away.
Whilst the C standard doesn't guarantee it, both LLVM and GCC _do_. They have implementation-defined that it will work, so are not free to optimise it away.
[0] https://llvm.org/docs/LangRef.html#llvm-memset-intrinsics
[1] https://gitweb.git.savannah.gnu.org/gitweb/?p=gnulib.git;a=b...
Relying on implementation behavior is the perfect way to introduce a hidden in plain site vulnerability.
Most of C++ programs written before P0593R6 depended on implementation behaviour, and were graciously allowed to not be undefined behaviour just 5 years ago. C++ as a language standard is mostly irrelevant, what one should care about is what the compiler authors consider valid code.
Using pragmas, attributes and optimisation guarantees is the point of implementation-defined behaviour in the first place.
The Linux kernel extensively uses gcc extensions. That doesn't inherently make it insecure.
You have to rely on implementation for anything to do with what happens to memory after it is freed, or really almost anything to do with actual bytes in RAM.
Yeah the C committee is wrong here
I don't see why?
The C committee gave you memset_explicit. But note that there is still no guarantee that information can not leak. This is generally a very hard problem as information can leak in many different ways as it may have been copied by the compiler. Fully memory safe languages (so "Safe Rust" but not necessarily real-word Rust) would offer a bit more protection by default, but then there are still side-channel issues.
Because, for the 1384th time, they're pretending they can ignore what the programmer explicitly told them to do
Creating memset_explicit won't fix existing code. "Oh but what if maybe" is just cope.
If I do memset then free then that's what I want to do
And the way things go I won't be surprised if they break memset_explicit for some other BS reason and then make you use memset_explicit_you_really_mean_it_this_time
Your problem is not the C committee but your lack of understanding how optimizing compilers work. WG14 could, of course, specify that a compiler has do exactly what you tell it to do. And in fact, every compiler supports this already: Im most cases even by default! Just do not turn on optimization. But this is not what most people want.
Once you accept that optimizing compilers do, well, optimizations, the question is what should be allowed and what not. Both inlining "memset" and eliminating dead stores are both simply optimizations which people generally want.
If you want a store not to be eliminated by a compiler, you can make it volatile. The C standard says this can not be deleted by optimizations. The criticism with this was that later undefined behavior could "undo" this by "travelling in time". We made it clear in ISO C23 that this not allowed (and I believe it never was) - against protests from some compiler folks. Compilers still do not fully conform to this, which shows the limited power WG14 has to change reality.
Nope it is the C committee
> Once you accept that optimizing compilers do, well, optimizations
Why in tarnation it is optimizing out a write to a pointer out before a function that takes said pointer? Imagine it is any other function besides free, see how ridiculous that sounds?
It's been many years since C compilers started making pathological-but-technically-justifiable optimizations that work against the programmer. The problem is the vast sea of "undefined behavior" — if you are not a fully qualified language lawyer versed in every nook and cranny of the C standard, prepare to be surprised.
Many of us who don't like working under such conditions have just moved on to other languages.
I agree that compilers were too aggressive in exploiting UB, but this is not the topic of this thread which has nothing to do with UB. But also the situation with UB is in practice not too bad. While compilers broke some old code which caused frustration, when writing new code most UB can easily be dealt with in practice by following some basic ground rules (e.g. no unsafe casts, being careful with pointer arithmetic) and by activating some compiler flags. It is not anything that should cause much trouble when programming in C.
Because it is a dead store. Removing dead stores does not sound ridiculous to me and neither is it to anybody using an optimizing compiler in the last decades.
Tree shaking is pretty standard. Optimising out the write sounds fine to me - with the exception of a volatile pointer. That, there, is a mistake.
Optimizing out a write to (example) an array on the stack seems fine to me.
Optimizing out a function call to a heap pointer (especially memset) seems wrong to me. You called the function, it should call the function!
But it's again the C language saving time not wearing a seatbelt or checking the tire pressure for saving 10s on a 2h trip
The whole point of the optimizer is that it can detect inefficiencies by treating every statement as some combination of simple, fundamental operations. The compiler is not seeing "call memset() on pointer to heap", it's seeing "write of variable size" just before "deallocation". For some, optimizing that will be a problem, for others, not optimizing it will leave performance on the table.
There are still ways to obtain the desired behavior. Just put a call to a DLL or SO that implements what you need. The compiler cannot inspect the behavior of functions across module boundaries, so it cannot tell whether removing the call preserves semantics or not (for example, it could be that the external function sends the contents of the buffer to a file), so it will not remove it.
A modern compiler may also completely remove malloc / free pairs and move the computation to the stack. And I do not see what this has to do with C, it should be the same for most languages. C gives you tools to express low-level intent such as "volatile", but one has to use them.
Strong disagree. In C, malloc and free are functions, and I expect no magic to happen when calling a function. If malloc and free were keywords like sizeof, it would have been different.
Your problem is that you're treating words such as "function" and "call" as if they had meaning outside of the language itself (or, more specifically, outside of the C abstract machine), when the point of the compiler is precisely to melt away the language parts of the specified program and be left with a concrete program that matches its behavior. If you view a binary in a disassembler, you will not find any "functions" or "calls". Maybe that particular architecture happens to have a "call" instruction to jump to "functions", but these words are merely homophones with what C refers to as "functions" and "calls".
When you "call" a "function" in the source you're not specifying to the compiler that you want a specific opcode in the generated executable, you're merely specifying a particular observable behavior. This is why optimizations such as inlining and TCO are valid. If the compiler can prove that a heap allocation can be turned into a stack allocation, or even removed altogether (e.g. free(malloc(1ULL << 50))), the fact that these are exposed to the programmer as "functions" he can "call" poses no obstacle.
Closest to what you say that I can find is 5.1.2.3 §4 of N3096
Problem is, calling external library function has a needed side effect of calling that library function. I do not see language that allows simply not doing that, based on assumed but unknown function behaviour.The behavior of the standard functions is not unknown, it is at least partially specified. If a user overrides them under the mistaken assumption that a call in source translates in a 1-to-1 correspondence to a call in binary, that's their problem.
You should read "7.1.4 1 Use of library functions". Also "calling a function" is not a side effect.
Thanks, I did read it! Things like footnote 236: "This means that an implementation is required to provide an actual function for each library function, even if it also provides a macro for that function", where macro is shown to use compiler builtin as an example.
Again, could you please explain how compiler can decide to remove call to a function in an external dynamically loaded library, that is not known at compile time, simply based on the name of the function (i.e. not because the call is unreachable)? I do not see any such language in the standard.
And yes, calling unknown function from a dynamically loaded library totally is a side effect.
> Again, could you please explain how compiler can decide to remove call to a function in an external dynamically loaded library, that is not known at compile time, simply based on the name of the function (i.e. not because the call is unreachable)? I do not see any such language in the standard.
> And yes, calling unknown function from a dynamically loaded library totally is a side effect.
The thing is that malloc/free aren't "unknown function[s]". From the C89 standard:
> All external identifiers declared in any of the headers are reserved, whether or not the associated header is included.
And from the C23 standard:
> All identifiers with external linkage in any of the following subclauses (including the future library directions) and errno are always reserved for use as identifiers with external linkage
malloc/free are defined in <stdlib.h> and so are reserved names, so compilers are able to optimize under the assumption that malloc/free will have the semantics dictated by the standard.
In fact, the C23 standard explicitly provides an example of this kind of thing:
> Because external identifiers and some macro names beginning with an underscore are reserved, implementations can provide special semantics for such names. For example, the identifier _BUILTIN_abs could be used to indicate generation of in-line code for the abs function. Thus, the appropriate header could specify
> for a compiler whose code generator will accept it.In C, malloc and free used to often be macros.
As they're freely replaceable through loading, and designed for that, I would strongly suggest that are among the most magical areas of the C standard.
We get a whole section for those in the standard: 7.24.3 Memory management functions
Hell, malloc is allowed to return you _less than you asked for_:
> The pointer returned if the allocation succeeds is suitably aligned so that it may be assigned to a pointer to any type of object with a fundamental alignment requirement and size less than or equal to the size requested
I read the text as saying the object size can be less or equal to returned memory size. Anyway, section 7 is library. As you say, replacing through loading is a common thing to do — surely compiler is not free to simply elide external library function at will? This is not C++ after all, it must be sensible
If the function is equivalent to a no-op, and not explicitly marked as volatile for side-effects, it absolutely can elide it. If there is a side-effect in hardware or wider systems like the OS, then it must be marked as volatile. If the code is just code, then a function call that does effectively nothing, will probably become nothing.
That was one of the first optimisations we had, back with Fortran and COBOL. Before C existed - and as B started life as a stripped down Fortran compiler, the history carried through.
The K&R book describes the buddy system for malloc, and how its design makes it suitable for compiler optimisations - including ignoring a write to a pointer that does nothing, because the pointer will no longer be valid.
Where exactly does K&R specify the buddy system?
You are literally scaring me now. I'd understand such things being done when statically linking or running JIT, but for "normal" program which function implementation malloc() will link against is not known during compilation. How can compiler go, like, "eh, I'll assume free(malloc(x)) is NOP and drop it" and not break most existing code?
> but for "normal" program which function implementation malloc() will link against is not known during compilation. How can compiler go, like, "eh, I'll assume free(malloc(x)) is NOP and drop it" and not break most existing code?
I'd suspect that eliding suitable malloc/free pairs would not break most existing code because most existing code simply does not depend on malloc/free doing anything other than and/or beyond what the C standard requires.
How would you propose that eliding free(malloc(x)) would break "most" existing code, anyways?
As an example, user kentonv wrote: "I patched the memory allocator used by the Cloudflare Workers runtime to overwrite all memory with a static byte pattern on free". And compiler would, like, "nah, let's leave all that data on stack".
Or somebody would try to plug in mimalloc/jemalloc or a debug allocator and wonder what's going on.
>As an example, user kentonv wrote: "I patched the memory allocator used by the Cloudflare Workers runtime to overwrite all memory with a static byte pattern on free". And compiler would, like, "nah, let's leave all that data on stack".
Such a program would continue to function as normal; the dirty data would just be left on the stack. If the developer wants to clear that data too, they'd just have to modify the compiler to overwrite the stack just before (or just after) moving the stack pointer.
>Or somebody would try to plug in mimalloc/jemalloc or a debug allocator and wonder what's going on.
Again, that wouldn't be broken. They would see that no dynamic allocations were performed during that particular section. Which would be correct.
I'm a bit skeptical either example is representative of "most" existing software. If anything, the mere existence of __builtin_malloc and its default use should hint that most existing software doesn't care about malloc/free actually being called. That being said...
> As an example, user kentonv wrote: "I patched the memory allocator used by the Cloudflare Workers runtime to overwrite all memory with a static byte pattern on free". And compiler would, like, "nah, let's leave all that data on stack".
Strictly speaking, I don't think eliding malloc/free would "break" those programs because that behavior is there for security if/when something else goes wrong, not as part of the software's regular intended functionality (or at least I sure hope nothing relies on that behavior for proper functioning!).
> Or somebody would try to plug in mimalloc/jemalloc [] and wonder what's going on.
Why would mimalloc/jemalloc/some other general-purpose allocator care that it doesn't have to execute a matching malloc/free pair any more than the default allocator?
I'm not sure debug allocators would care either? If you're trying to debug mismatched malloc/free pairs then the ones the compiler elides are the ones you don't care about anyways since those are the ones that can be statically proven to be "self-contained" and/or correct. If you're gathering statistics then you probably care more about the malloc/free calls that do occur (i.e., the ones that can't be elided), not those that don't.
In any case, if you want to use a malloc/free implementation that promises more than the C standard does (e.g., special byte pattern on free, statistics/debug info tracking, etc.) there's always -fno-builtin-malloc (or memset_explicit if you're lucky enough to be using C23). Of course, the tradeoff is that you give up some potential performance.
Thank you for putting it in a much more correct and understandable language than I could. That is exactly what I am talking about: if you call __builtin_malloc (e.g. via macro definition in the libc header), compiler is free to do whatever it wants. However, calling "malloc" library function should call "malloc" library function, and anything else is unacceptable and a bug. There should be no case where compiler could assume anything about a function it does not see based simply on it's name. Neither malloc nor strlen.
> That is exactly what I am talking about: if you call __builtin_malloc (e.g. via macro definition in the libc header), compiler is free to do whatever it wants. However, calling "malloc" library function should call "malloc" library function, and anything else is unacceptable and a bug.
I think that's an overly narrow reading of the footnote. I don't see an obvious reason why "such names" in the footnote should only cover "some macro names beginning with an underscore" and not also "external identifiers". And if implementations are allowed to define special semantics for "external identifiers", then... well, that's exactly what they did!
In addition, there's still the as-if rule. The semantics of malloc/free are defined by the C standard; if the compiler can deduce that there is no observable difference between a version of the program that calls those and a version that does not, why does it matter that the call is emitted? A function call in and of itself is not a side effect, and since the C standard dictates what malloc/free do the compiler knows their possible side effects.
Furthermore, the addition of memset_explicit and its footnote ("The intention is that the memory store is always performed (i.e. never elided), regardless of optimizations. This is in contrast to calls to the memset function (7.26.6.1)") implies that eliding calls is in fact acceptable behavior when optimizations are enabled. If eliding calls were not permissible when optimizing then what's the point of memset_explicit?
> There should be no case where compiler could assume anything about a function it does not see based simply on it's name.
Again, external identifiers defined by the C standard are reserved. Reserved external identifiers aren't just for show. From the C89 standard:
> If the program defines an external identifier with the same name as a reserved external identifier, even in a semantically equivalent form, the behavior is undefined.
And from C23:
> If the program declares or defines an identifier in a context in which it is reserved (other than as allowed by 7.1.4), the behavior is undefined.
This means that yes, under modern compilers' interpretation of UB compilers can assume things about functions based on their names because modern compilers generally optimize assuming UB does not happen. The compiler does not need to see the function's implementation because it is the function's implementation as far as it is concerned.
Ah yes, N2625 "What we think we reserve". Basically any C program containing variable or function "top", "END", "strict", "member" and so on is non-conforming and subject to undefined behaviour, so they define "potentially reserved" identifiers and as usual compiler vendors go and do the sane right thing.
That paper isn't relevant here. From the paper (emphasis added):
> 7.1.3 Reserved Identifiers
> [snip]
> Macro names and identifiers with external linkage that are specified in the C standard library clauses.
> This proposal does not propose any changes to these reserved identifiers.
Furthermore, that paper doesn't make the use of reserved external identifiers not UB, so there's no change there either.
Newer versions of C++ (and C, apparently) have functions so that the cast isn't necessary ( https://en.cppreference.com/w/c/string/byte/memset.html ).
Zeroing memory should absolutely be the default behavior for any generic allocator in 2025.
If you need better performance, write your own allocator optimized for your specific use case — it's not that hard.
Besides, you if you don't need to clear old allocations, there are likely other optimizations you'll be able to find which would never fly in a system allocator.
You know, I never even considered doing that but it makes sense; whatever overhead that's incurred by doing that static byte pattern is still almost certainly minuscule compared to the overhead of something like a garbage collector.
IMO the tradeoff that is important here is a few microseconds of time sanitizing the memory saves the millions of dollars of headache when memory unsafe languages fail (which happens regularly)
I agree. I almost feel like this should be like a flag in `free`. Like if you pass in 1 or something as a second argument (or maybe a `free_safe` function or something), it will automatically `memset` whatever it's freeing with 0's, and then do the normal freeing.
Alternatively, just make free do that by default, adding a fast_and_furious_free which doesn't do it, for the few hotspots where that tiny bit of performance is actually needed.
The default case should be the safe correct one, even if it “breaks” backward compatibility. Without it, we will forever be saddled with the design mistakes of the past.
https://news.ycombinator.com/item?id=46417221
Non-deterministic latency is a drawback, but garbage collection is not inherently slower than manual memory management/reference counting/etc. Depending on the usage pattern it can be faster. It's a set of trade-offs
Is this the same as enabling `init_on_free=1` in the kernel?
The author seems to be unaware that Mongo internally develops in a private repo and commits are published later to the public one with https://github.com/google/copybara. All of the confusion around dates is due to this.
I was definitely unaware. I suspected something like this may be up when I talked about the zero-review of the apparent PR "I’m not aware of Mongo’s public review practices". This is great to know though. Updating the piece now to mention this and explain the date discrepancy
The author of this post is incorrect about the timeline. Our Atlas clusters were upgraded days before the CVE was announced.
thanks! updated
In something like a database zeroing or poisoning on free is probably a good idea. (These days probably all allocators should do it by default.)
Allocators are an interesting place to focus on for security. Chris did amazing work there for Blink that eventually rolled out to all of Chromium. The docs are a fun read.
https://blog.chromium.org/2021/04/efficient-and-safe-allocat...
https://chromium.googlesource.com/chromium/src/+/master/base...
How often are mongo instances exposed to the internet? I'm more of an SQL person and for those I know it's pretty uncommon, but does happen.
From my experience, Mongo DB's entire raison d'etre is "laziness".
* Don't worry about a schema.
* Don't worry about persistence or durability.
* Don't worry about reads or writes.
* Don't worry about connectivity.
This is basically the entire philosophy, so it's not surprising at all that users would also not worry about basic security.
To the extent that any of this was ever true, it hasn’t been true for at least a decade. After the WiredTiger acquisition they really got their engineering shit together. You can argue it was several years too late but it did happen.
I got heavily burned pre-wiredtiger and swore to never use it again. Started a new job which uses it and it’s been… Painless, stable and fast with excellent support and good libraries. They did turn it around for sure.
Although interestingly, for all the mongo deployments I managed, the first time I saw a cluster publicly exposed without SSL was postgres :)
Not only that, but authentication is much harder than it needs to be to set up (and is off by default).
I'm sure there are publicly exposed MySQLs too
There are many more exposed MySQLs than MongoDBs:
https://www.shodan.io/search?query=mongodb https://www.shodan.io/search?query=mysql https://www.shodan.io/search?query=postgresql
But this must be proportional to the overall popularity.
Most of your points are wrong. Maybe only 1- is valid'ish.
Ultimate webscale!
A highly cited reason for using mongo is that people would rather not figure out a schema. (N=3/3 for “serious” orgs I know using mongo).
That sort of inclination to push off doing the right thing now to save yourself a headache down the line probably overlaps with “let’s just make the db publicly exposed” instead of doing the work of setting up an internal network to save yourself a headache down the line.
> A highly cited reason for using mongo is that people would rather not figure out a schema.
Which is such a cop out, because there is always a schema. The only questions are whether it is designed, documented, and where it's implemented. Mongo requires some very explicit schema decisions, otherwise performance will quickly degrade.
Fowler describes it as Implicit vs Explicit schema, which feels right.
Kleppmann chooses "schema-on-read" vs "schema-on-write" for the same concept, which I find harder to grasp mentally, but describes when schema validation need occur.
I would have hoped that there would be no important data in mongoDB.
But now we can at least be rest assured that the important data in mongoDB is just very hard to read with the lack of schemas.
Probably all of that nasty "schema" work and tech debt will finally be done by hackers trying to make use of that information.
There is a surprising amount of important data in various Mongo instances around the world. Particularly within high finance, with multi-TB setups sprouting up here and there.
I suspect that this is in part due to historical inertia and exposure to SecDB designs.[0] Financial instruments can be hideously complex and they certainly are ever-evolving, so I can imagine a fixed schema for essentially constantly shifting time series universe would be challenging. When financial institutions began to adopt the SecDB model, MongoDB was available as a high-volume, "schemaless" KV store, with a reasonably good scaling story.
Combine that with the relatively incestuous nature of finance (they tend to poach and hire from within their own ranks), the average tenure of an engineer in one organisation being less than 4 years and you have an osmotic process of spreading "this at least works in this type of environment" knowledge. Add the naturally risk-averse nature of finance[ß] and you can see how one successful early adoption will quickly proliferate across the industry.
0: This was discussed at HN back in the day too: https://calpaterson.com/bank-python.html
ß: For an industry that loves to take financial risks - with other people's money of course, they're not stupid - the players in high finance are remarkably risk-averse when it comes to technology choices. Experimentation with something new and unknown carries a potentially unbounded downside with limited, slowly emerging upside.
I'd argue that there's a schema; it's just defined dynamically by the queries themselves. Given how much of the industry seems fine with dynamic typing in languages, it's always been weird to me how diehard people seem to be about this with databases. There have been plenty of legitimate reasons to be skeptical of mongodb over the years (especially in the early days), but this one really isn't any more of a big deal than using Python or JavaScript.
Yes there's a schema, but it's hard to maintain. You end up with 200 separate code locations rechecking that the data is in the expected shape. I've had to fix too many such messes at work after a project grinded to a halt. Ironically some people will do schemaless but use a statically typed lang for regular backend code, which doesn't buy you much. I'd totally do dynamic there. But DB schema is so little effort for the strong foundation it sets for your code.
Sometimes it comes from a misconception that your schema should never have to change as features are added, and so you need to cover all cases with 1-2 omni tables. Often named "node" and "edge."
> Ironically some people will do schemaless but use a statically typed lang for regular backend code, which doesn't buy you much. I'd totally do dynamic there.
I honestly feel like the opposite, at least if you're the only consumer of the data. I'd never really go out of my way to use a dynamically typed language, and at that point, I'm already going to be having to do something to get the data into my own language's types, and at that point, it doesn't really make a huge difference to me what format it used to be in. When there are a variety of clients being used though, this logic might not apply though.
The adage I always tell people is that in any successful system, the data will far outlive the code. People throw away front ends and middle layers all the time. This becomes so much harder to do if the schema is defined across a sprawling middle layer like you describe.
We just sit a data persistence service infront of mongo and so we can enforce some controls for everything there if we need them, but quite often we don’t.
It’s probably better to check what you’re working on than blindly assuming this thing you’ve gotten from somewhere is the right shape anyway.
As someone who has done a lot of Ruby coding I would say using a statically typed database is almost a must when using a dynamically type language. The database enforces the data model and the Ruby code was mostly just glue on top of that data model.
That's fair, I could see an argument for "either the schema or the language needs to enforce schema". It's not obvious to me that one of the two models of "only one of them is" deserves to much more criticism than the other though.
What's weird to me is when dynamic typers don't acknowledge the tradeoff of quality vs upfront work.
I never said mongodb was wrong in that post, I just said it accumulated tech debt.
Let's stop feeling attacked over the negatives of tradeoffs
It's possible you didn't intend it, but your parent comment definitely came off as snarky, so I don't think you should be surprised that people responded in kind. You're honestly doing it again with the "let's stop feeling attacked" bit; whether you mean it or not, your phrasing comes across as pretty patronizing, and overall combined with the apparent dislike of people disagreeing with you after the snark it comes across as passive-aggressive. In general it's not going to go over well if you dish out criticism but can't take it.
In any case, you quite literally said there was a "lack of schemas", and I disagreed with that characterization. I certainly didn't feel attacked by it; I just didn't think it was the most accurate way to view things from a technical perspective.
Whatever horrors there are with mongo, it's still better than the shitshow that is Zope's ZODB.
Are you guys serious with these takes?
You very often have both NoSQL and SQL at scale.
NoSQL is used for high availability of data at scale - iMessage famously uses it for message threads, EA famously uses it for gaming matchmaking.
What you do is have both SQL and NoSQL. The NoSQL is basically caches of resources for high availability. Imagine you are making a social media app... Yes of course you have a SQL database that stores all the data, but you maintain API caches of posts in NoSQL.
Why? This gets to some of your other black vs white insults: NoSQL is typically WAY FASTER than SQL. That's why you use it. It's way faster to read a JSON file from a hard drive than it is to query a SQL database, always has been. So why not use NoSQL for EVERYTHING? Well, because you have duplicated data everywhere since it's not relational, it's just giant caches essentially. You also will get slow queries when the documents get huge.
Anyway you need both. It's not an either/or thing. I cannot believe this many years later people do not know the purpose of SQL and NoSQL and do not understand that it is not a competition at all. You want both!
Because nobody uses mongo for the reasons you listed. They use redis, dynamo, scylla or any number of enriched KV stores.
Mongo has spent its entire existence pretending to be a SQL database by poorly reinventing everything you get for free in postgres or mysql or cockroach.
Redis and Dynamo are NoSQL genius, and I said NoSQL the entire time.
False. Mongo never pretended to be a SQL database. But some dimwits insisted on using it for transactions, for whatever reason, and so it got transactional support, way later in life, and in non-sharded clusters in the initial release. People that know what they are doing have been using MongoDB for reliable horizontally-scalable document storage basically since 3.4. With proper complex indexing.
Scylla! Yes, it will store and fetch your simple data very quickly with very good operational characteristics. Not so good for complex querying and indexing.
Yeah fair, I was being a bit lazy here when writing my comment. I've used nosql professionally quite a bit, but always set up by others. When working on personal projects I reach for SQL first because I can throw something together and don't need ideal performance. You're absolutely right that they both have their place.
That being said the question was genuine - because I don't keep up with the ecosystem, I don't know it's ever valid practice to have a nosql db exposed to the internet.
What they wrote was pretty benign. They just asked how common it is for Mongo to be exposed. You seem to have taken that as a completely different statement
I mean they said it's rarely used when in fact it's widely used by some of the world's biggest companies at the highest scale the internet knows. The other guy had a harsher comment sure, maybe I should duplicate my reply to them, but who knows what kinds of rules that breaks on this site lmao Happy Christmas & New Year buddy!
They did not say it's rarely used at all.
The article links to a shodan scan reporting 213K exposed instances https://www.shodan.io/search?query=Product%3A%22MongoDB%22
It could be because when you leave an SQL server exposed it often turns into much worse things. For example, without additional configuration, PostgreSQL will default into a configuration that can own the entire host machine. There is probably some obscure feature that allows system process management, uploading a shell script or something else that isn't disabled by default.
The end result is "everyone" kind of knows that if you put a PostgreSQL instance up publicly facing without a password or with a weak/default password, it will be popped in minutes and you'll find out about it because the attackers are lazy and just running crypto-mine malware, etc.
My university has one exposed to the internet, and it's still not patched. Everyone is on holiday and I have no idea who to contact.
No one, if you aren't in the administration's good graces and something shitty happens unrelated to you, you've put a target on your back to be suspect #1.
"Look at me. I'm the DBA now"
-JS devs after "Signing In With Facebook" to MongoDB Atlas
AKA me
Sorry guys, I broke it
For a long time, the default install had it binding to all interfaces and with authentication disabled.
often. lots of data leaks happened because of this. people spin it up in a cloud vm and forget it has a public ip all the time.
I'm still thinking about the hypothetical optimism brought by OWASP top 10 hoping that major flaws will be solved and that buffer overflow has been there since the beginning... in 2003.
I mean giving everyone footguns and you'll find that is unavoidable forever. Thoughts and prayers to the Mongo devs until we migrate to a language that prevents this error.
> On Dec 24th, MongoDB reported they have no evidence of anybody exploiting the CVE
Absence of evidence is not evidence of absence...
What would you prefer them to say?
Evidence of no exploitations? It's usually hard to prove a negative, except when you have all the logs at your fingertips you can sift through. Unless they don't, of course. In which case the point stands: they don't actually know at this point in time, if they can even know about it at all.
Specifically, it looks like the exflitration primitive relies on errors being emitted, and those errors are what leak the data. They're also rather characteristic. One wouldn't reasonably expect MongoDB to hold onto all raw traffic data flowing in and out, but would absolutely expect them to have the error logs, at least for some time back.
I feel like that's an issue not with what they said, but what they did. It would be better for them to have checked this quickly, but it would have been worse for them to have they did when they hadn't. What you're saying isn't wrong, but it's not really an answer to the question you're replying to.
“No evidence of exploitation” is a pretty bog standard report I think? Made on Christmas Eve no less.
Do other CVE reports come with more strong statements? I’m not sure they do. But maybe you can provide some counter examples that meet your bar.
> "No evidence of exploitation” is a pretty bog standard report
It is standard, yes. The problem with it as a statement is that it's true even if you've collected exactly zero evidence. I can say I don't have evidence of anyone being exploited, and it's definitely true.
It's not really my bar, I just explored this on behalf of the person you were replying to because I found it mildly interesting.
It is also a pretty standard response indeed. But now that it was highlighted, maybe it does deserve some scrutiny? Or is saying silly, possibly misleading things okay if that's what everyone has always been doing?
Why is anyone using mongo for literally anything
Easy replication. I suppose it's faster than Postgres's JSONB, too.
I would rather not use it, but I see that there are legitimate cases where MongoDB or DynamoDB is a technically appropriate choice.
because it is "web scale"
ref: https://www.youtube.com/watch?v=b2F-DItXtZs
Whenever anyone writes about mongodb or redis I hear it in that voice.
Right? When they came out, it was all about NoSQL, which then turned out only mean key-value database, whom are plentiful.
This is a nasty ad repositorium datorum argumentation which I cannot tolerate.
I laughed.
Every time someone posts about NoSQL a thousand "programmers" reveal they have never had to support a lot of traffic lol
Nah, this time it was just you.
Related:
MongoBleed
https://news.ycombinator.com/item?id=46394620
> In C/C++, this doesn’t happen. When you allocate memory via `malloc()`, you get whatever was previously there.
What would break if the compiler zero'd it first? Do programs rely on malloc() giving them the data that was there before?
That's what calloc() is for
It takes time to zero out memory.
is it true that ubisoft got hacked and 900GB of data from their database was leaked due to mongobleed, i am seeing a lot of posts on social media under the #ubisoft tags today. can someone on HN confirm?
I read that hack was made possible by Ubisoft’s support staff taking bribes.
Details are still emerging, update in the last hour was that at least 5 different hacking groups were in ubisoft's systems and yeah some might have got their via bribes rather than mongodb https://x.com/vxunderground/status/2005483271065387461
I’ll give you $1000 to run Mongo.
TLDR: Blame logs not NoSQL.
Almost always when you hear about emails or payment info leaking (or when Twitter stored passwords in plaintext lol) it's from logs. And a lot of times logs are in NoSQL because it is only ever needed in that same JSON format and in a very highly available way (all you Heroku users tailing logs all day, yw) and then almost nobody encrypts phone numbers and emails etc. whenever those end up in logs.
There's basically no security around logs actually. They're just like snapshots of the backend data being sent around and nobody ever cares about it.
Anyway it has nothing to do with the choice to use NoSQL, it has more to do with how neglected security is around it.
Btw in case you are wondering in both the Twitter plaintext password case and in the Rainbow Six Siege data leak you mention were both logs that leaked. NoSQL backed logs sure, but it's more about the data security around logging IMO.
MongoDB has always sucked... But it's webscale (sic)
Do yourself a favour, use ToroDB instead (or even straight PostgreSQL's JSONB).
This has many similarities to the Heartbleed vulnerability: it involves trusting lengths from an attacker, leading to unauthorized revelation of data.
Have all Atlas clusters been auto-updated with a fix?
yes. apparently before Dec 19 too
"MongoBleed Explained by an LLM"
If it is, it's less fluffy and empty than most of LLM prose we're usually fed. It's well explained and has enough details to not be overwhelming.
Honestly, aside from the "<emoji> impact" section that really has an LLM smell (but remember that some people legit do this since it's in the llm training corpus), this more feels like LLM assisted (translated? reworded? grammar-checked?) that pure "explain this" prompt.
I didn't use AI in writing the post.
I did some research with it, and used it to help create the ASCII art a bit. That's about it.
I was afraid that adding the emoji would trigger someone to think it's AI.
In any case, nowadays I basically always get at least one comment calling me an AI on a post that's relatively popular. I assume it's more a sign of the times than the writing...
Thank you for the clarification! I'm sorry for engaging in the LLM hunt, I don't usually do. Please keep writing, this was a really good breakdown!
In hindsight, I would not even have thought about it if not for the comment I replied to. LLM prose fail to make me read whole paragraphs and I find myself skipping roughly the second half of every paragraph, which was definitely not the case for your article. I did somewhat skip at the emoji heading, not because of LLMs, but because of a saturation of emojis in some contexts that don't really need them.
I should have written "this could be LLM assisted" instead of "this more feels like LLM assisted", but well words.
Again, sorry, don't get discouraged by the LLM witch hunt.
I’m about ready to start flagging every comment that complains about the source material being LLM-generated. It’s tiresome, pointless, and adds absolutely nothing useful to the discussion.
If the material is wrong, explain why. Otherwise, shut up.
Though the source article was human written, the public exploit was developed with an LLM.
https://x.com/dez_/status/2004933531450179931
[dead]
[dead]