I’ll bet you some CERN PhD student has a forgotten 100 TB detector calibration database in sqlite somewhere in the dead caverns of collaboration effort.
It's an organization with an unpredictable return on investment, in practice, they don't really have any negative consequences if they waste public money, or if it was actually useless (unless too obvious to external people).
It's somewhat part of investing into experimental science.
Sabine really represents the “apathetic disillusioned” - no hope for fundamental physics. I left after we discovered the Higgs based on similar observations, there was too much invested in group think. If you have to raise $50B to ask “what happens if…” then the “if” has to be damned likely, or you have to share the risk with a large group.
My own conclusion, back then, was that the collider paradigm had run its course. Without economical tools, and no theory to guide experiments, the field was stuck.
I’m not apathetic though, and believe there are ways for both theories and experiments to break the stall.
Theory could dig into the backlog of shortcuts and dirty tricks, that underlines the machinery of QFT, are there other ways of probing the quantum fields?
Experimentalists can at the least get behind novel acceleration schemes like laser plasma wake fields, to reduce the massive capital risk of conducting model-free searches for new signatures. Or as the theorists, hunt for alternative ways to “excite the quantum fields”.
This may
Not be recipes, but as a community Particle physics has been way to focused on chasing resonances with ever larger machines.
I think that there are two problems here: one with physics and another with society. I'm somewhat familiar with academic world, and what she (and the letter) say about career scientists who optimize for formal metrics without any regard for any real scientific value strikes too close to home. They value complacency above else, original ideas and challengers to incumbents are driven away, and nobody really cares if studies replicate or data is massaged, whereas being even remotely ideologically impure kills your career instantly. I think society across the globe slowly starts to realize this state of affairs, and the backlash against science as a whole will be enormous.
In the eyes of the taxpayers, there's not that much of a difference between particle physicists who have been spending billions of dollars on bigger and bigger colliders, and psychologists who popularized non-replicating studies based on a sample of 35 university students but refused to touch IQ (one of the most, if not replicated psychology concepts ever), or biologists who invented new species out of thin air based on the most minuscule differences between population just to stop a development project they oppose for ideological reasons. All of this will cause severe backlash. Baby will be thrown out with the bathwater, trust and funding for science as a whole will suffer for decades.
There's one big difference to taxpayers: the psychologists and biologists don't cost billions of dollars.
National funding agencies are fundamentally in the business of choosing "this not that" because of the constraints of finite funding. Just like with VC math, it's difficult to imagine the benefits of the billion dollar collider outweigh the opportunity cost of investing in a large nation's worth of researchers to explore smaller, cheaper ideas.
That's too specific a message, with too much nuance, to try and get to the general public. When trust in science is going down, its going down as a whole.
The current US Presidential administration is in the process of throwing the entire bathroom out, tub and all. We're going to have substantially less reliable data over the next decade.
Thankfully, we have more than one country in the world, and we can compare outcomes when different countries pursue different strategies. Results may surprise. 2,5 years ago I, as the rest of HN, was sure that Twitter wouldn't be able to function in a few months after layoffs, but here we are.
I did not see many people betting real money on twitter not running, did you? You have have to filter out the momentary hype.
> The current US Presidential administration is in the process of throwing the entire bathroom out, tub and all.
Are they, maybe, I have not seen any popularized document detailing their plans and they are taking large steps that will have consequences. Since there is not well popularized plan uncertainty and doubt is filling that void, a consequence which is easy to anticipate(as are some consequences of uncertainty and doubt).
> We're going to have substantially less reliable data over the next decade.
With no well popularized plan, hard to tell, and even harder because there are people hard at work on their own vision of the future that do not 100% line up with the current administration's plans and those people will influence the next decade as well.
Yes, here we are; I spend almost no time there now since every time I drop by all I see is political noise and culture wars. Most people I used to follow now either take part in the shit show or have lost interest in posting.
Huh. I found myself nodding along, perhaps a bit hesitant to jump into conclusions[0], but then she drops from a final emotional outburst straight into... a sponsorship segment pushing some bullshit anti-data-broker data broker service. Feels like VPNs all over again, and I honestly can't treat the earlier parts of the video seriously anymore.
I've seen her writings here and there over the years, and I remember she was generally respected as an opinionated expert. What happened?
--
[0] - I've seen my share of dramas which started with a message like this video, and where I thought I had a good picture, until some time later some critical details came to light and I ended up flipping my stance on it 180°.
Are things going that bad for scientists these days?
(I'm not just reacting to the existence of a sponsorship section alone - more to the choice of the sponsor, the presentation, and the very juxtaposition of a "trust me while I quote from private conversation" and "this video was sponsored by a data broker company".)
She wraps up the video and unofficially tells people to leave via 'let me know in the comments', then speaks unemotive capitalist brand advertisement speak about a product, and the video ends. I honestly can't see the merit in your performative outrage here. It's very clearly known why people take advertisement deals, and its placement here is after all the content concluded.
You can find ways to dismiss opinions you don't like, but targeting individual contributors that use sponsorships to maintain or alleviate financial stability is an interesting attack. I guess if that's your only critique, you have no claims to any other negative critique about her points? Just a literal ad hominem? "What happened" to actual discourse?
> You can find ways to dismiss opinions you don't like, but targeting individual contributors that use sponsorships to maintain or alleviate financial stability is an interesting attack.
> Just a literal ad hominem?
I do not read what in the comment as an attack, but an outsider to the field trying to judge the opinion they are absorbing.
Some people are going to expect an experts to have an outside money stream that comes from their expertise so in video adds are not necessary. That is not a heuristic that is going to be right 100% of the time, but is is not the worst heuristic or an attack.
I can understand an expert moving to science communication / popularization, and I understand what the trappings of such enterprise are. However, the creators have a choice in what they advertise and how they do it, and the choices they make reflect on them and on how their message is received.
I'd say, my own emotions aside, she made a pragmatic mistake here - first delivering a quite powerful bomb that she explicitly acknowledges we need to trust her about (as it's an excerpt of a private conversation, not possible to independently verify), only to tell us - also explicitly - that the video was sponsored by a company. This is the kind of stuff you find in late-stage capitalism jokes, or movies about corporate utopia where this juxtaposition is exaggerated for effect. I did not see it coming, and I struggle to understand why it did.
But more generally, while I wouldn't describe it as "targeting individual contributors that use sponsorships to maintain or alleviate financial stability", I do believe that what kinds of sponsorships people choose and how they fulfill the sponsor's conditions does directly reflect on trustworthiness of the entire message - to think otherwise would be to believe that humans are capable of compartmentalizing their activities into high-integrity and low-integrity parts, which is something I don't believe humans are capable of over long-term. Maybe I'm wrong about that - if psychology says it's normal, then perhaps I need to reconsider my heuristic. But if I'm right, then this is directly on-topic and I believe it's right to bring it up, to the extent the creator/speaker is asking the audience to take them on trust. Aka. on authority. Which means it applies doubly so to the experts - their choice of what and how they advertise matters more, because they're lending their credibility to both the message and the ad that pays for it.
> I'd say, my own emotions aside, she made a pragmatic mistake here - first delivering a quite powerful bomb that she explicitly acknowledges we need to trust her about (as it's an excerpt of a private conversation, not possible to independently verify), only to tell us - also explicitly - that the video was sponsored by a company. This is the kind of stuff you find in late-stage capitalism jokes, or movies about corporate utopia where this juxtaposition is exaggerated for effect. I did not see it coming, and I struggle to understand why it did.
I agree here. I would say historically this was a more common view. What I observe is that more people are disregarding this and it is in part due to feedback loops between the very large audiences that web platforms provide today and the creator. If you message connects with a large enough group of people with enough loyalty/trust(that say weird sponsor messages do not effect there viewership) you can safely disregard the rest of audience to a large degree. Delivering with emotions, "delivering a quite powerful bomb"s, etc help build that loyal group of followers but also lead to a feedback cycle that can make things more one side/hyperbolic/etc.
This has a knock on effect in people, at least those like me, who now devalue many similar emotional pleas without evidence for both good and ill. After all if people are incentivized to be delivering impassioned speeches and "bomb"s, statistically there are going to be more people who do so inappropriately, sometimes it seems somewhat normalized just due to how much I see it. To me the message she delivered had little to no impact on me because it was not delivered with either reasonable evidence or at least a start of a plan or solution to the issue she is describing. This knock on effect just makes things worst to some extent though since creators already in that feedback loop have even less incentive to reach out to someone like me because they have to overcome the additional barrier, and would not when the same level of loyalty even if they did.
> But if I'm right, then this is directly on-topic and I believe it's right to bring it up, to the extent the creator/speaker is asking the audience to take them on trust. Aka. on authority. Which means it applies doubly so to the experts - their choice of what and how they advertise matters more, because they're lending their credibility to both the message and the ad that pays for it.
I agree it is on topic and relevant, but it did not register to me at all in this video because the trust was lost when there was only the impassioned deliver without hard evidence or an action plan to help address the issue. It likely also does not register to those in the loyal impassioned group of followers either.
Independent of potential regulation, creating more high trust, potentially collaborative, sources of information seems like like a partial way to counter balance some of this.
I'm not trying to dismiss her opinion. I'm surprised to one day read her essays and watch some talks, and now suddenly discover she's became another YouTube influencer.
It would be merely off-putting if it was tacked onto a video about hard, verifiable facts - but to make a video that explicitly asks the audience to trust her on her world, and then end it by saying the video itself was sponsored by a company? Surely she's aware of the optics? Even the regular YouTube influencers usually know better and skip the sponsorship section when making a complaint video.
I'm also not committing an ad hominem here (if anything, maybe some adjacent fallacy). This is the lens through which I view all YouTube channels, and it applies here, and doubly so given that this is not a video about independently verifiable facts - it's all based about an e-mail she claims she received, and she acknowledges that directly. We have to trust her at her word.
Now, I don't know about you, but if someone in one moment tells me some information, and then in the next moment starts giving me "capitalist brand advertisement speak" about some dubious product, I'm going to take it as a sign that someone doesn't actually care about my well-being, as manipulating someone into a bad deal for profit is plain malice[0]. Additionally, I might question that person's integrity - depending on obviously they're knowingly pushing a bad product, or how indifferent they are to what they're promoting. Which, in turn, will make me question everything they just told me before - after all, if they just demonstrated they're fine with lies or bullshit now, why should I assume they held themselves to the highest standards of scientific and personal truth moments before?
I'm honestly done complaining over YouTube creators; I just accepted that many of the well-known pop-sci channels turned into content marketing schemes. At least I stopped being surprised by exaggerations and inaccuracies in the main parts. I commented this time only because I totally didn't expect to see a reputable scientist doing this kind of stuff.
--
[0] - Yes, I stand firm on this, and yes, if you scale this view up, you end up considering the entire field of "marketing communication" (covering a subset of intersection of sales, marketing and advertising) as a cancer on modern society. I wrote an article about this some time ago, which I've been told has apparently popped up on HN last week.
You seem to somehow connect it to public spending. But "lost resources" happen in every kind of company and are just a function of size. A small office will lose some random notes, a data processing company will have a few TB hanging around due to forgotten cleanup processes. It just happens and one of the functions of your AWS TAM is to ask you sometimes about an unused bucket they noticed. The more you spend every month, the more things can become a rounding error in cost.
I've been using RWMutex'es around SQLite calls as a precaution since I couldn't quite figure out if it was safe for concurrent use. This is perhaps overkill?
Since I do a lot of cross-compiling I have been using https://modernc.org/sqlite for SQLite. Does anyone have some knowledge/experience/observations to share on concurrency and SQLite in Go?
> I've been using RWMutex'es around SQLite calls as a precaution since I couldn't quite figure out if it was safe for concurrent use. This is perhaps overkill?
You should not share sqlite connections between threads (or anything even remotely resembling threads): while the serialized mode ensures you won't corrupt the database, there is still per-connection state which may / will cause issues eventually e.g. [1][2][3]. Note that [1] does apply to read-only connections.
You can use separate connections concurrently. If you're using transactions and WAL mode you should have a pool of read-only connections (growable) and a single read/write connection. If you're not using multi-statement transactions (autocommit mode) then you can just have a pool.
If this is Go and database/sql (as the modernc tidbit points to) this comment is ill advised.
Go database/sql handles ensuring each actual database connection is only by a single goroutine at a time (before being put back into the pool for reuse), and no sane SQLite driver should have issues with this.
If you're working in Go with database/sql you're meant to not have to worry about goroutines or mutexes. The API you're using (database/sql) is goroutine safe (unless the driver author really messed things up).
To be clear: each database/sql "connection" is actually a "pool of connections", that are already handed out (and returned) in goroutine safe way. The advise is to use two connection pools: a read-only one of size N, and a read-write one of size 1.
Querying data with 2 threads shouldn't be 2x slower than using 1 thread; 4 threads shouldn't be 6x slower; or 8 threads 15x.
The folks at GoToSocial have long maitained a conccurency fix; I don't know if it's related to this performance issue.
Still the advice about using read-only and read-write connections still holds, for all drivers.
Why? Because in SQLite reads can be concurrent, but writes are exclusive. If you're using WAL, a single write can be concurrent with many reads, otherwise a writer also blocks readers. And SQLite handles all of this.
But SQLite locks are polling locks: a connection tries to acquire a lock; if it can't, it sleeps for a bit and tries again; rinse, repeat. When the other connection finally releases its lock, everyone else is sleeping. There's a lot of sleeping involved; also with exponential backoff.
Using read-only and read-write connections "fixes" this. Make the read-only a pool of N connections, and the read-write a single connection. Then, all waiting will be in Go channels, which is fast at waking waiters, and doesn't involve needless sleeping/pooling. As a bonus, context cancellation will be respected for this kind of blocking.
On this final point, my driver goes to great lengths to respect context cancellation, both of CPU queries (which you need to forcefully interrupt), and busy waiting on locks (which also doesn't work for most other drivers). So you can set a large busy timeout (one minute is the default) with the confidence that cancellation will work regardless. The dual connection strategy still offers performance benefits, though.
Three modes are supported: one where the library does no locking and you can only use the library from one thread at a time; one where global structures are locked but not connections so you can use each connection from one thread at a time; and one where everything is locked.
For actual concurrency you probably want one database connection per thread.
Do you really need concurrency? My understanding is read are like… picoseconds, because everything happens in memory. You don’t have a separate server to call.
A quick search suggests SSD reads are in the milliseconds. I’ve found that, or microseconds to be roughly accurate. SQLite queries, however, can take seconds (or more in pathological cases) depending on how optimized the indices are to the query being run, load, etc. This is one of the reasons I don’t love using SQLite in Bun servers. It makes me nervous that a bad query will bring down the single thread of the app.
it is easy to assume most programmers today know roughly how long things take since this has existed in various forms since Jeff/Peter first published it:
From my limited experience and understanding, I believe concurrent reads are OK, and concurrent writes are queued internally. As long as you don't mix the order of the writes you're doing, SQLite can queue them to as best as it can.
The database file itself is safe for concurrent use, but the internal sqlite3 database structure itself can be unsafe to share across thread, depending on how your sqlite3 library was compiled. See https://www.sqlite.org/threadsafe.html
> I couldn't quite figure out if it was safe for concurrent use
This has been a big issue I have found when researching SQLite and trying to determine if it's suitable. I can't figure out the right way to do certain stuff or docs are difficult/outdated etc in end I always end up defaulting back to Postgres
sqlite supports concurrent access. It's optional, though. It also requires help from the os/file system (that's usually only an issue in specific contexts).
So... you can get your ducks in a row in terms of checking sqlite docs, your sqlite config and compile-time options, and your runtime environment, and then remove the mutexes (modernc is another variable, but I don't know about that).
Or, if it's working for you, you can leave them in.
Reasons to remove them might be... it's a pain to maintain them; it might help performance (sqlite is inherently one writer multiple readers, and since you're using RW locks, your locks may already align with sqlite's ability to use concurrency). If those aren't issues for you then you can leave them in.
Yup, they win. My biggest SQLite database is 1.7TB with, as of just now 2314851188 records (all JSON documents with a few keyword indexes via json_extract).
Works like a charm, as in: the web app consuming the API linked to it returns paginated results for any relevant search term within a second or so, for a handful of concurrent users.
I think FS-level compression would be a perfect match. Has anyone tried it successfully on large SQLite DBs? (I tried but btrfs failed to do so, and I didn't get to the bottom of why).
> I think FS-level compression would be a perfect match. Has anyone tried it successfully on large SQLite DBs?
I've had decent success with `sqlite-zstd`[0] which is row-level compression but only on small (~10GB) databases. No reason why it couldn't work for bigger DBs though.
I can see that you're a user of AWS. Check some prices on dedicated servers one day. They're an order of magnitude cheaper than similar AWS instances, and more powerful because all compute and storage resources are local and unshared.
They do have a higher price floor, though. There are no $5/month dedicated servers anywhere - the cheapest is more like $40. There are $5/month virtual servers outside of AWS which are cheaper and more powerful than $5/month AWS instances.
A Windows Server VM on a self-hosted Hyper-V box, which has a whole bunch of 8TB NVMe drives; this VM has a 4TB virtual volume on one of those (plus a much smaller OS volume on another).
Using the SQLite backup API, which pretty much corresponds to the .backup CLI command. It doesn't block any reads or writes, so the performance impact is minimal, even if you do it directly to slow-ish storage.
That's not great advice for a large database (and I wouldn't recommend it for small databases either). That incidentally works when the db is small enough that the copy is nearly atomic, but with a big copy, you can end up with a corrupt database. SQLite is designed such that a crashed system at any time does not corrupt a database. It's not designed such that a database can be copied linearly from beginning to end while it's being written to without corruption. Simply copying the database is only good enough if you can ensure that there are no write transactions opened during the entire copy.
A reliable backup will need to use the .backup command, the .dump command, the backup API[0], or filesystem or volume snapshots to get an atomic snapshot.
> By default searchcode prioritizes system survival (hey its a free service!), and as such might do some load shedding, which can mean you don't see results you expect. You can do some things to help with this.
I don’t know whether that explains the results. Some similar queries, e.g., searching for `math.ceiling`, do return many results.
It doesn't list Go types from several k8s projects on github that I contribute to. Feel something is buggy about the filtering as well. I guess he will take some time to iron out all issues - suspect not all his data got migrated into the new db and the DB size should be far greater than 6TB. That feels rather low for github.
But I liked his tip about SQLite driver scalability to avoid that stupid locked error that I too have faced regularly. numCPUS for readers and single writer - will try that out.
I searched for several Skia classes and it never found the actual Skia repo, just forks and references from unrelated repos. It also failed to find several classes entirely. Skia exists in GitHub as well as in Chromium CodeSearch so it should have come up at least twice.
As a sanity check, "fwrite" only has 8 references in the entire database.
Yeah, agreed, I think the migration didn't actually work.
Yeah, I just searched for “driver_register”, a call that would show upin a large number of Linux drivers in the open source Linux kernel, not to mention other public-facing repos, and it only returned two results, neither from the mainline Linux kernel repo.
I have been contemplating postgres hosted by AWS vs locally using SQLite with my Django app. I have low volume in my app, low concurrent traffic. I may have joins in future since the data is highly relational.
I still chose postgres managed by AWS mainly to reduce the operational overhead, I keep thinking if I should have just gone with sqlite3 though
I host my stuff on railway and while I love SQLite, I usually go with their Postgres offering instead. It's actually less work for me and although I haven't tested this scientifically, it's hard too imagine the SQlite way would save me serious money.
I'd consider no relational db scales reads vertically better than SQLite. For writes, you can batch them or distribute them to attached dbs. But, either way, you may lose some transaction guarantee.
Reading without write contention is not a terribly difficult problem. You could use any database and it'd work fine. It's the mutations that distinguish db engines. Sqlite is indeed close to ideal but the comparison to other databases (at scale no less) is without substance.
Generally to scale reads appropriately (to >10k readers) you need things like connection pooling and effective replication of some time readable secondaries (which can maintain transaction guarantees if you do synchronous writes, which... maybe not great) - I dont think SQLite has either of those.
Fascinating read. Those suggesting Mongo are missing the point.
The author clearly mentioned that they want to work in the relational space. Choosing Mongo would require refactoring a significant part of the codebase.
Also, databases like MySQL, Postgres, and SQLite have neat indexing features that Mongo still lacks.
The author wanted to move away from a client-server database.
And finally, while wanting a 6.4TB single binary is wild, I presume that’s exactly what’s happening here. You couldn’t do that with Mongo.
This comment appears to be the only place between article and thread where mongo getting recommended is even mentioned. Maybe stop giving them free press. I certainly haven't heard anyone seriously suggest them for about about ten years now.
Edit: ok the other guy mentioning mongo is clearly being sarcastic
Presumably the database needs to be distributed to servers, too. The engine needs to access something. This is a necessity whether or not it's referred to as a binary.
Without reading the article, I always quickly estimate what a DRAM based in-memory database would cost. You build a mesh network of FPGA PCBs with 12 x 12.5 Gbps links interconnecting them. Each $100 FPGA has 25 GB/s DDR3 or DDR2 controllers with up to 256 GB. DDR3 DIM have been less than $1 per GB for many years.
6.4x($100+256x4x$1)=$7,193.6 worst case pricing
So this database would cost less than $8000 in DRAM chips hardware and can be searched/indexed in parallel in much less than a second.
Now you can buy old DDR2 chips harvested from ewaste at less than $1 per GB, so this database could cost as low as $1000 including the labour for harvesting.
You can make a wafer scale integration with SRAM that holds around 1 terabyte SRAM for $30K including mask set costs. This database would cost you $210.000. Note that SRAM is much faster than DDR DRAM.
You would need 15000/7 customers of this size database to cover the manufacturing of the wafers. I'm sure there are more than 2143 customers that would buy such a database.
Please invest in my company, I need 450,000,000 up front to manufacture these 1 terabyte wafers in a single batch.
We will sprinkle in the million core small reconfigurable processors[1] in between the terabyte SRAM for free.
For the observant: we have around 50 trillion transistors[2] to play with on a 300mm wafer. Our secret sauce is a 2 transistor design making SRAM 5 times denser that Apple Silicon SRAM densities at 3nm on their M4.
Actually we would not need reconfigurable processors, we would intersperse Morphle Logic in between the SRAM. Now you could reprogram the search logic to do video compression/decompression or encryption by reconfiguring the logic pattern[3]. Total search of each byte would be a few hundred nanoseconds.
The IRAM paper from 1997 mentions 180nm. These wafers cost $750 today (including mask set and profit margin) and now you would need less than a million investment up front to start mass manufactoring. You just would need 3600 times the wafer amount compared to the latest 3nm transistor density on a wafer.
> 64000x($100+256x$4)=$7,193,600 worst case pricing
Not sure how you arrived at this calculation. 256x$4 already accounts for 256GB. The database in the OP is 6400GB large. So shouldn't it be 25x($100+256x$4) = $28100?
FWIW in practice this number should be much, MUCH lower. You can get 1TB of ECC DDR4 for ~$2k, probably lower if you buy wholesale.
You might be right, I did the math too quickly and can't fix the posting.
As they say about this 'back of the envelope' type calculations will only you get 'ball park' estimates.
The $450 million is also an estimate that could be off by an order of a magnitude.
My argument still stands: in-memory databases with SRAM wafers are much faster and cheaper than customers (managers) realize, no need to have database software.
In 1997 the IRAM paper was still thinking about 180nm DRAM chips, in 2025 we can do 3nm wafer scale integrations so Moores law gives us affordable Terabyte in-memory databases that cost less than a programmer to write and manage databases.
In 1981 Dan Ingalls wrote[1] in Design Principles Behind Smalltalk: "An operating system is a collection of things that don't fit into a language. There shouldn't be one."
IN 2025 I argue that "a database is a collection of tricks to shuffle data between compute and storage memory. There shouldn't be one.".
I would go one step further and say a von Neumann architecture or Harvard architecture microprocessor is a collection of gates to compute in separated memory storage. There shouldn't be one.
I’ll bet you some CERN PhD student has a forgotten 100 TB detector calibration database in sqlite somewhere in the dead caverns of collaboration effort.
It could really happen.
It's an organization with an unpredictable return on investment, in practice, they don't really have any negative consequences if they waste public money, or if it was actually useless (unless too obvious to external people).
It's somewhat part of investing into experimental science.
I've been there (when it was 100GBs scale, 15 years ago), trust me, it can and do happen :)
Sabine Hossenfelder has recently released a video about it:
https://youtu.be/shFUDPqVmTg?si=xZAQZ725UEcO8lf_
Sabine really represents the “apathetic disillusioned” - no hope for fundamental physics. I left after we discovered the Higgs based on similar observations, there was too much invested in group think. If you have to raise $50B to ask “what happens if…” then the “if” has to be damned likely, or you have to share the risk with a large group. My own conclusion, back then, was that the collider paradigm had run its course. Without economical tools, and no theory to guide experiments, the field was stuck. I’m not apathetic though, and believe there are ways for both theories and experiments to break the stall. Theory could dig into the backlog of shortcuts and dirty tricks, that underlines the machinery of QFT, are there other ways of probing the quantum fields? Experimentalists can at the least get behind novel acceleration schemes like laser plasma wake fields, to reduce the massive capital risk of conducting model-free searches for new signatures. Or as the theorists, hunt for alternative ways to “excite the quantum fields”. This may Not be recipes, but as a community Particle physics has been way to focused on chasing resonances with ever larger machines.
I think that there are two problems here: one with physics and another with society. I'm somewhat familiar with academic world, and what she (and the letter) say about career scientists who optimize for formal metrics without any regard for any real scientific value strikes too close to home. They value complacency above else, original ideas and challengers to incumbents are driven away, and nobody really cares if studies replicate or data is massaged, whereas being even remotely ideologically impure kills your career instantly. I think society across the globe slowly starts to realize this state of affairs, and the backlash against science as a whole will be enormous.
In the eyes of the taxpayers, there's not that much of a difference between particle physicists who have been spending billions of dollars on bigger and bigger colliders, and psychologists who popularized non-replicating studies based on a sample of 35 university students but refused to touch IQ (one of the most, if not replicated psychology concepts ever), or biologists who invented new species out of thin air based on the most minuscule differences between population just to stop a development project they oppose for ideological reasons. All of this will cause severe backlash. Baby will be thrown out with the bathwater, trust and funding for science as a whole will suffer for decades.
There's one big difference to taxpayers: the psychologists and biologists don't cost billions of dollars.
National funding agencies are fundamentally in the business of choosing "this not that" because of the constraints of finite funding. Just like with VC math, it's difficult to imagine the benefits of the billion dollar collider outweigh the opportunity cost of investing in a large nation's worth of researchers to explore smaller, cheaper ideas.
That's too specific a message, with too much nuance, to try and get to the general public. When trust in science is going down, its going down as a whole.
The current US Presidential administration is in the process of throwing the entire bathroom out, tub and all. We're going to have substantially less reliable data over the next decade.
Thankfully, we have more than one country in the world, and we can compare outcomes when different countries pursue different strategies. Results may surprise. 2,5 years ago I, as the rest of HN, was sure that Twitter wouldn't be able to function in a few months after layoffs, but here we are.
I did not see many people betting real money on twitter not running, did you? You have have to filter out the momentary hype.
> The current US Presidential administration is in the process of throwing the entire bathroom out, tub and all.
Are they, maybe, I have not seen any popularized document detailing their plans and they are taking large steps that will have consequences. Since there is not well popularized plan uncertainty and doubt is filling that void, a consequence which is easy to anticipate(as are some consequences of uncertainty and doubt).
> We're going to have substantially less reliable data over the next decade.
With no well popularized plan, hard to tell, and even harder because there are people hard at work on their own vision of the future that do not 100% line up with the current administration's plans and those people will influence the next decade as well.
https://polymarket.com/event/will-twitter-report-any-outages...
Yes, here we are; I spend almost no time there now since every time I drop by all I see is political noise and culture wars. Most people I used to follow now either take part in the shit show or have lost interest in posting.
Huh. I found myself nodding along, perhaps a bit hesitant to jump into conclusions[0], but then she drops from a final emotional outburst straight into... a sponsorship segment pushing some bullshit anti-data-broker data broker service. Feels like VPNs all over again, and I honestly can't treat the earlier parts of the video seriously anymore.
I've seen her writings here and there over the years, and I remember she was generally respected as an opinionated expert. What happened?
--
[0] - I've seen my share of dramas which started with a message like this video, and where I thought I had a good picture, until some time later some critical details came to light and I ended up flipping my stance on it 180°.
>What happened?
Respected and opinionated experts gotta eat
Are things going that bad for scientists these days?
(I'm not just reacting to the existence of a sponsorship section alone - more to the choice of the sponsor, the presentation, and the very juxtaposition of a "trust me while I quote from private conversation" and "this video was sponsored by a data broker company".)
She wraps up the video and unofficially tells people to leave via 'let me know in the comments', then speaks unemotive capitalist brand advertisement speak about a product, and the video ends. I honestly can't see the merit in your performative outrage here. It's very clearly known why people take advertisement deals, and its placement here is after all the content concluded.
You can find ways to dismiss opinions you don't like, but targeting individual contributors that use sponsorships to maintain or alleviate financial stability is an interesting attack. I guess if that's your only critique, you have no claims to any other negative critique about her points? Just a literal ad hominem? "What happened" to actual discourse?
> You can find ways to dismiss opinions you don't like, but targeting individual contributors that use sponsorships to maintain or alleviate financial stability is an interesting attack.
> Just a literal ad hominem?
I do not read what in the comment as an attack, but an outsider to the field trying to judge the opinion they are absorbing.
Some people are going to expect an experts to have an outside money stream that comes from their expertise so in video adds are not necessary. That is not a heuristic that is going to be right 100% of the time, but is is not the worst heuristic or an attack.
I can understand an expert moving to science communication / popularization, and I understand what the trappings of such enterprise are. However, the creators have a choice in what they advertise and how they do it, and the choices they make reflect on them and on how their message is received.
I'd say, my own emotions aside, she made a pragmatic mistake here - first delivering a quite powerful bomb that she explicitly acknowledges we need to trust her about (as it's an excerpt of a private conversation, not possible to independently verify), only to tell us - also explicitly - that the video was sponsored by a company. This is the kind of stuff you find in late-stage capitalism jokes, or movies about corporate utopia where this juxtaposition is exaggerated for effect. I did not see it coming, and I struggle to understand why it did.
But more generally, while I wouldn't describe it as "targeting individual contributors that use sponsorships to maintain or alleviate financial stability", I do believe that what kinds of sponsorships people choose and how they fulfill the sponsor's conditions does directly reflect on trustworthiness of the entire message - to think otherwise would be to believe that humans are capable of compartmentalizing their activities into high-integrity and low-integrity parts, which is something I don't believe humans are capable of over long-term. Maybe I'm wrong about that - if psychology says it's normal, then perhaps I need to reconsider my heuristic. But if I'm right, then this is directly on-topic and I believe it's right to bring it up, to the extent the creator/speaker is asking the audience to take them on trust. Aka. on authority. Which means it applies doubly so to the experts - their choice of what and how they advertise matters more, because they're lending their credibility to both the message and the ad that pays for it.
> I'd say, my own emotions aside, she made a pragmatic mistake here - first delivering a quite powerful bomb that she explicitly acknowledges we need to trust her about (as it's an excerpt of a private conversation, not possible to independently verify), only to tell us - also explicitly - that the video was sponsored by a company. This is the kind of stuff you find in late-stage capitalism jokes, or movies about corporate utopia where this juxtaposition is exaggerated for effect. I did not see it coming, and I struggle to understand why it did.
I agree here. I would say historically this was a more common view. What I observe is that more people are disregarding this and it is in part due to feedback loops between the very large audiences that web platforms provide today and the creator. If you message connects with a large enough group of people with enough loyalty/trust(that say weird sponsor messages do not effect there viewership) you can safely disregard the rest of audience to a large degree. Delivering with emotions, "delivering a quite powerful bomb"s, etc help build that loyal group of followers but also lead to a feedback cycle that can make things more one side/hyperbolic/etc.
This has a knock on effect in people, at least those like me, who now devalue many similar emotional pleas without evidence for both good and ill. After all if people are incentivized to be delivering impassioned speeches and "bomb"s, statistically there are going to be more people who do so inappropriately, sometimes it seems somewhat normalized just due to how much I see it. To me the message she delivered had little to no impact on me because it was not delivered with either reasonable evidence or at least a start of a plan or solution to the issue she is describing. This knock on effect just makes things worst to some extent though since creators already in that feedback loop have even less incentive to reach out to someone like me because they have to overcome the additional barrier, and would not when the same level of loyalty even if they did.
> But if I'm right, then this is directly on-topic and I believe it's right to bring it up, to the extent the creator/speaker is asking the audience to take them on trust. Aka. on authority. Which means it applies doubly so to the experts - their choice of what and how they advertise matters more, because they're lending their credibility to both the message and the ad that pays for it.
I agree it is on topic and relevant, but it did not register to me at all in this video because the trust was lost when there was only the impassioned deliver without hard evidence or an action plan to help address the issue. It likely also does not register to those in the loyal impassioned group of followers either.
Independent of potential regulation, creating more high trust, potentially collaborative, sources of information seems like like a partial way to counter balance some of this.
I'm not trying to dismiss her opinion. I'm surprised to one day read her essays and watch some talks, and now suddenly discover she's became another YouTube influencer.
It would be merely off-putting if it was tacked onto a video about hard, verifiable facts - but to make a video that explicitly asks the audience to trust her on her world, and then end it by saying the video itself was sponsored by a company? Surely she's aware of the optics? Even the regular YouTube influencers usually know better and skip the sponsorship section when making a complaint video.
I'm also not committing an ad hominem here (if anything, maybe some adjacent fallacy). This is the lens through which I view all YouTube channels, and it applies here, and doubly so given that this is not a video about independently verifiable facts - it's all based about an e-mail she claims she received, and she acknowledges that directly. We have to trust her at her word.
Now, I don't know about you, but if someone in one moment tells me some information, and then in the next moment starts giving me "capitalist brand advertisement speak" about some dubious product, I'm going to take it as a sign that someone doesn't actually care about my well-being, as manipulating someone into a bad deal for profit is plain malice[0]. Additionally, I might question that person's integrity - depending on obviously they're knowingly pushing a bad product, or how indifferent they are to what they're promoting. Which, in turn, will make me question everything they just told me before - after all, if they just demonstrated they're fine with lies or bullshit now, why should I assume they held themselves to the highest standards of scientific and personal truth moments before?
I'm honestly done complaining over YouTube creators; I just accepted that many of the well-known pop-sci channels turned into content marketing schemes. At least I stopped being surprised by exaggerations and inaccuracies in the main parts. I commented this time only because I totally didn't expect to see a reputable scientist doing this kind of stuff.
--
[0] - Yes, I stand firm on this, and yes, if you scale this view up, you end up considering the entire field of "marketing communication" (covering a subset of intersection of sales, marketing and advertising) as a cancer on modern society. I wrote an article about this some time ago, which I've been told has apparently popped up on HN last week.
You seem to somehow connect it to public spending. But "lost resources" happen in every kind of company and are just a function of size. A small office will lose some random notes, a data processing company will have a few TB hanging around due to forgotten cleanup processes. It just happens and one of the functions of your AWS TAM is to ask you sometimes about an unused bucket they noticed. The more you spend every month, the more things can become a rounding error in cost.
Ah yes, if there was only some use for this electricity-malarkey.
> It's an organization with an unpredictable return on investment
Yes, science has extremely unpredictable return on investment.
What's your suggestion? Don't try?
I've been using RWMutex'es around SQLite calls as a precaution since I couldn't quite figure out if it was safe for concurrent use. This is perhaps overkill?
Since I do a lot of cross-compiling I have been using https://modernc.org/sqlite for SQLite. Does anyone have some knowledge/experience/observations to share on concurrency and SQLite in Go?
> I've been using RWMutex'es around SQLite calls as a precaution since I couldn't quite figure out if it was safe for concurrent use. This is perhaps overkill?
You should not share sqlite connections between threads (or anything even remotely resembling threads): while the serialized mode ensures you won't corrupt the database, there is still per-connection state which may / will cause issues eventually e.g. [1][2][3]. Note that [1] does apply to read-only connections.
You can use separate connections concurrently. If you're using transactions and WAL mode you should have a pool of read-only connections (growable) and a single read/write connection. If you're not using multi-statement transactions (autocommit mode) then you can just have a pool.
[1] https://sqlite.org/c3ref/errcode.html
[2] https://sqlite.org/c3ref/last_insert_rowid.html (less of an issue now that sqlite has RETURNING)
[3] https://sqlite.org/c3ref/changes.html (possibly same)
If this is Go and database/sql (as the modernc tidbit points to) this comment is ill advised.
Go database/sql handles ensuring each actual database connection is only by a single goroutine at a time (before being put back into the pool for reuse), and no sane SQLite driver should have issues with this.
If you're working in Go with database/sql you're meant to not have to worry about goroutines or mutexes. The API you're using (database/sql) is goroutine safe (unless the driver author really messed things up).
To be clear: each database/sql "connection" is actually a "pool of connections", that are already handed out (and returned) in goroutine safe way. The advise is to use two connection pools: a read-only one of size N, and a read-write one of size 1.
Regardless, mutexes are not needed.
If you're using database/sql you don't need locks around database access.
All sane Go SQLite drivers should be concurrency safe.
That said, (and I'm biased but) there's something fishy about modernc concurrency: https://github.com/cvilsmeier/go-sqlite-bench#concurrent
Querying data with 2 threads shouldn't be 2x slower than using 1 thread; 4 threads shouldn't be 6x slower; or 8 threads 15x.
The folks at GoToSocial have long maitained a conccurency fix; I don't know if it's related to this performance issue.
Still the advice about using read-only and read-write connections still holds, for all drivers.
Why? Because in SQLite reads can be concurrent, but writes are exclusive. If you're using WAL, a single write can be concurrent with many reads, otherwise a writer also blocks readers. And SQLite handles all of this.
But SQLite locks are polling locks: a connection tries to acquire a lock; if it can't, it sleeps for a bit and tries again; rinse, repeat. When the other connection finally releases its lock, everyone else is sleeping. There's a lot of sleeping involved; also with exponential backoff.
Using read-only and read-write connections "fixes" this. Make the read-only a pool of N connections, and the read-write a single connection. Then, all waiting will be in Go channels, which is fast at waking waiters, and doesn't involve needless sleeping/pooling. As a bonus, context cancellation will be respected for this kind of blocking.
On this final point, my driver goes to great lengths to respect context cancellation, both of CPU queries (which you need to forcefully interrupt), and busy waiting on locks (which also doesn't work for most other drivers). So you can set a large busy timeout (one minute is the default) with the confidence that cancellation will work regardless. The dual connection strategy still offers performance benefits, though.
https://www.sqlite.org/threadsafe.html
Three modes are supported: one where the library does no locking and you can only use the library from one thread at a time; one where global structures are locked but not connections so you can use each connection from one thread at a time; and one where everything is locked.
For actual concurrency you probably want one database connection per thread.
Do you really need concurrency? My understanding is read are like… picoseconds, because everything happens in memory. You don’t have a separate server to call.
This guy is great: https://fractaledmind.github.io/2024/04/15/sqlite-on-rails-t...
https://fractaledmind.github.io/images/railsworld-2024/111.p...
https://fractaledmind.github.io/2024/10/16/sqlite-supercharg...
You're off by two orders of magnitude.
More like six?
It would be more constructive if you guys posted the correct numbers and a source.
A quick search suggests SSD reads are in the milliseconds. I’ve found that, or microseconds to be roughly accurate. SQLite queries, however, can take seconds (or more in pathological cases) depending on how optimized the indices are to the query being run, load, etc. This is one of the reasons I don’t love using SQLite in Bun servers. It makes me nervous that a bad query will bring down the single thread of the app.
it is easy to assume most programmers today know roughly how long things take since this has existed in various forms since Jeff/Peter first published it:
https://colin-scott.github.io/personal_website/research/inte...
(This version is more useful since it shows change over time)
That’ll teach me to post first thing in the morning.
Something something, Cunningham's Law.
It would be mice for it to work in concurrent contexts as even using mutexes adds responsibility (for not messing up).
From my limited experience and understanding, I believe concurrent reads are OK, and concurrent writes are queued internally. As long as you don't mix the order of the writes you're doing, SQLite can queue them to as best as it can.
My sole reference: https://www.sqlite.org/lockingv3.html
The database file itself is safe for concurrent use, but the internal sqlite3 database structure itself can be unsafe to share across thread, depending on how your sqlite3 library was compiled. See https://www.sqlite.org/threadsafe.html
Thanks!
> I couldn't quite figure out if it was safe for concurrent use
This has been a big issue I have found when researching SQLite and trying to determine if it's suitable. I can't figure out the right way to do certain stuff or docs are difficult/outdated etc in end I always end up defaulting back to Postgres
There is a good chance this is the article the author mentioned reading: https://kerkour.com/sqlite-for-servers
I noted this on my website, some time ago [1]:
> Making use of sql.DB from multiple goroutines is safe, since Go manages its own internal mutex.
Then I quote the database/sql documentation [2]:
> DB is a database handle representing a pool of zero or more underlying connections. It's safe for concurrent use by multiple goroutines.
[1]: https://stigsen.dev/notes/go-database-thread-safety/ [2]: https://pkg.go.dev/database/sql@go1.23.3#DB
sqlite supports concurrent access. It's optional, though. It also requires help from the os/file system (that's usually only an issue in specific contexts).
So... you can get your ducks in a row in terms of checking sqlite docs, your sqlite config and compile-time options, and your runtime environment, and then remove the mutexes (modernc is another variable, but I don't know about that). Or, if it's working for you, you can leave them in.
Reasons to remove them might be... it's a pain to maintain them; it might help performance (sqlite is inherently one writer multiple readers, and since you're using RW locks, your locks may already align with sqlite's ability to use concurrency). If those aren't issues for you then you can leave them in.
Would be very interested too.
Yup, they win. My biggest SQLite database is 1.7TB with, as of just now 2314851188 records (all JSON documents with a few keyword indexes via json_extract).
Works like a charm, as in: the web app consuming the API linked to it returns paginated results for any relevant search term within a second or so, for a handful of concurrent users.
I think FS-level compression would be a perfect match. Has anyone tried it successfully on large SQLite DBs? (I tried but btrfs failed to do so, and I didn't get to the bottom of why).
I did a small benchmark for compression with VFS and column compression a while ago: https://logdy.dev/blog/post/part-3-log-file-compression-with... https://logdy.dev/blog/post/part-4-log-file-compression-with... It all depends on the use cases and read/write patterns. Imo if well designed could yield added value
> I think FS-level compression would be a perfect match. Has anyone tried it successfully on large SQLite DBs?
I've had decent success with `sqlite-zstd`[0] which is row-level compression but only on small (~10GB) databases. No reason why it couldn't work for bigger DBs though.
[0] https://github.com/phiresky/sqlite-zstd
> My biggest SQLite database is 1.7TB with
What do you run this on? Just some aws vpc with a huge disk attached?
I can see that you're a user of AWS. Check some prices on dedicated servers one day. They're an order of magnitude cheaper than similar AWS instances, and more powerful because all compute and storage resources are local and unshared.
They do have a higher price floor, though. There are no $5/month dedicated servers anywhere - the cheapest is more like $40. There are $5/month virtual servers outside of AWS which are cheaper and more powerful than $5/month AWS instances.
A Windows Server VM on a self-hosted Hyper-V box, which has a whole bunch of 8TB NVMe drives; this VM has a 4TB virtual volume on one of those (plus a much smaller OS volume on another).
How do you backup a file like that?
Using the SQLite backup API, which pretty much corresponds to the .backup CLI command. It doesn't block any reads or writes, so the performance impact is minimal, even if you do it directly to slow-ish storage.
> It doesn't block any reads or writes.
That's neat! I bet it keeps growing a WAL file while the backup is ongoing right?
Hard to imagine doing it any other way, which is probably fine up until you hit some larger files sizes.
That copies the entire file each time (not just deltas).
You may find sqlite_rsync better.
I use zfs snapshots, they work in diffs so they're very cheap to store, create, and replicate.
sqlite_rsync is new tool created by sqlite team. It might be useful.
Ctrl-c, Ctrl-v
That's not great advice for a large database (and I wouldn't recommend it for small databases either). That incidentally works when the db is small enough that the copy is nearly atomic, but with a big copy, you can end up with a corrupt database. SQLite is designed such that a crashed system at any time does not corrupt a database. It's not designed such that a database can be copied linearly from beginning to end while it's being written to without corruption. Simply copying the database is only good enough if you can ensure that there are no write transactions opened during the entire copy.
A reliable backup will need to use the .backup command, the .dump command, the backup API[0], or filesystem or volume snapshots to get an atomic snapshot.
[0]: https://www.sqlite.org/backup.html
You should use mongodb. It’s web scale
(References <https://www.youtube.com/watch?v=b2F-DItXtZs>)
This hits so much harder 15 years on.
Additionally Xtranormal missed out on the generative video curve
Someone might take that as advice.
MongoDB earns $1.7b a year in revenue.
A whole lot of people have already taken that advice.
Captive customers my friend.
nit: this code snippet has a typo (dbWrite wasn't defined)
``` dbRead, _ := connectSqliteDb("dbname.db") defer dbRead.Close() dbRead.SetMaxOpenConns(runtime.NumCPU())
dbRead, _ := connectSqliteDb("dbname.db") defer dbWrite.Close() dbWrite.SetMaxOpenConns(1)
```
Yeah, the second `dbRead` should prob be `dbWrite`:
dbWrite, _ := connectSqliteDb("dbname.db")
I was a bit confused by the code, then assumed it was a typo
searchcode doesn't seem to work for me. All queries (even the ones recommended by the site) unfortunately return zero results. Maybe it got hugged?
https://searchcode.com/?q=re.compile+lang%3Apython
It does mention doing load shedding:
> By default searchcode prioritizes system survival (hey its a free service!), and as such might do some load shedding, which can mean you don't see results you expect. You can do some things to help with this.
I don’t know whether that explains the results. Some similar queries, e.g., searching for `math.ceiling`, do return many results.
It doesn't list Go types from several k8s projects on github that I contribute to. Feel something is buggy about the filtering as well. I guess he will take some time to iron out all issues - suspect not all his data got migrated into the new db and the DB size should be far greater than 6TB. That feels rather low for github.
But I liked his tip about SQLite driver scalability to avoid that stupid locked error that I too have faced regularly. numCPUS for readers and single writer - will try that out.
Even taking of the language filter you only get 5 results for a very common function!
I had to scroll down the search page and select the sources and languages to get a result
I searched for several Skia classes and it never found the actual Skia repo, just forks and references from unrelated repos. It also failed to find several classes entirely. Skia exists in GitHub as well as in Chromium CodeSearch so it should have come up at least twice.
As a sanity check, "fwrite" only has 8 references in the entire database.
Yeah, agreed, I think the migration didn't actually work.
Yeah, I just searched for “driver_register”, a call that would show upin a large number of Linux drivers in the open source Linux kernel, not to mention other public-facing repos, and it only returned two results, neither from the mainline Linux kernel repo.
I use marmot which gives me a multi master SQLite.
It uses a simple crdt table structure, and allows me to have many live SQLite instances in all data enters.
A Nat Jetstream server is used as a core with all SQLite DBS connected to it.
It operates off the WAL and is simple to install.
It also replicates all blobs to S3 , with the directory structure in the SQLite db.
With a Cloudflare domain , the users request is automatically sent to the nearest Db.
So it replaces cloudscapes D1 system for free . Just a hetzner 4 euro cos is enough.
Mine is holding about 200 gb of data.
https://github.com/maxpert/marmot
Wait ...
Is dbWrite ever declared? I know it's just an example, but still ...Now I have to ask myself whether this error resulted from relying on a human, or not relying on one.
Typo
I've been looking for a service just like searchcode, to try and track down obscure source code. All the best, hope it can be sustainable for you.
I have been contemplating postgres hosted by AWS vs locally using SQLite with my Django app. I have low volume in my app, low concurrent traffic. I may have joins in future since the data is highly relational.
I still chose postgres managed by AWS mainly to reduce the operational overhead, I keep thinking if I should have just gone with sqlite3 though
I host my stuff on railway and while I love SQLite, I usually go with their Postgres offering instead. It's actually less work for me and although I haven't tested this scientifically, it's hard too imagine the SQlite way would save me serious money.
I'd consider no relational db scales reads vertically better than SQLite. For writes, you can batch them or distribute them to attached dbs. But, either way, you may lose some transaction guarantee.
Reading without write contention is not a terribly difficult problem. You could use any database and it'd work fine. It's the mutations that distinguish db engines. Sqlite is indeed close to ideal but the comparison to other databases (at scale no less) is without substance.
Generally to scale reads appropriately (to >10k readers) you need things like connection pooling and effective replication of some time readable secondaries (which can maintain transaction guarantees if you do synchronous writes, which... maybe not great) - I dont think SQLite has either of those.
Fascinating read. Those suggesting Mongo are missing the point.
The author clearly mentioned that they want to work in the relational space. Choosing Mongo would require refactoring a significant part of the codebase.
Also, databases like MySQL, Postgres, and SQLite have neat indexing features that Mongo still lacks.
The author wanted to move away from a client-server database.
And finally, while wanting a 6.4TB single binary is wild, I presume that’s exactly what’s happening here. You couldn’t do that with Mongo.
This comment appears to be the only place between article and thread where mongo getting recommended is even mentioned. Maybe stop giving them free press. I certainly haven't heard anyone seriously suggest them for about about ten years now.
Edit: ok the other guy mentioning mongo is clearly being sarcastic
The binary would only contain the database engine; not the database. It’s probably a few MB, if I had to guess.
Presumably the database needs to be distributed to servers, too. The engine needs to access something. This is a necessity whether or not it's referred to as a binary.
the person suggesting mongodb in the replies isnt serious, its just a meme reference
Is the site like grep.app?
Without reading the article, I always quickly estimate what a DRAM based in-memory database would cost. You build a mesh network of FPGA PCBs with 12 x 12.5 Gbps links interconnecting them. Each $100 FPGA has 25 GB/s DDR3 or DDR2 controllers with up to 256 GB. DDR3 DIM have been less than $1 per GB for many years.
6.4x($100+256x4x$1)=$7,193.6 worst case pricing
So this database would cost less than $8000 in DRAM chips hardware and can be searched/indexed in parallel in much less than a second.
Now you can buy old DDR2 chips harvested from ewaste at less than $1 per GB, so this database could cost as low as $1000 including the labour for harvesting.
You can make a wafer scale integration with SRAM that holds around 1 terabyte SRAM for $30K including mask set costs. This database would cost you $210.000. Note that SRAM is much faster than DDR DRAM.
You would need 15000/7 customers of this size database to cover the manufacturing of the wafers. I'm sure there are more than 2143 customers that would buy such a database.
Please invest in my company, I need 450,000,000 up front to manufacture these 1 terabyte wafers in a single batch. We will sprinkle in the million core small reconfigurable processors[1] in between the terabyte SRAM for free.
For the observant: we have around 50 trillion transistors[2] to play with on a 300mm wafer. Our secret sauce is a 2 transistor design making SRAM 5 times denser that Apple Silicon SRAM densities at 3nm on their M4.
Actually we would not need reconfigurable processors, we would intersperse Morphle Logic in between the SRAM. Now you could reprogram the search logic to do video compression/decompression or encryption by reconfiguring the logic pattern[3]. Total search of each byte would be a few hundred nanoseconds.
The IRAM paper from 1997 mentions 180nm. These wafers cost $750 today (including mask set and profit margin) and now you would need less than a million investment up front to start mass manufactoring. You just would need 3600 times the wafer amount compared to the latest 3nm transistor density on a wafer.
[1] Intelligent RAM (IRAM): chips that remember and compute (1997) https://sci-hub.ru/10.1109/ISSCC.1997.585348
[2] Merik Voswinkel - Smalltalk and Self Hardware https://www.youtube.com/watch?v=vbqKClBwFwI&t=262s
[3] Morphle Logic https://github.com/fiberhood/MorphleLogic/blob/main/README_M...
> 64000x($100+256x$4)=$7,193,600 worst case pricing
Not sure how you arrived at this calculation. 256x$4 already accounts for 256GB. The database in the OP is 6400GB large. So shouldn't it be 25x($100+256x$4) = $28100?
FWIW in practice this number should be much, MUCH lower. You can get 1TB of ECC DDR4 for ~$2k, probably lower if you buy wholesale.
You might be right, I did the math too quickly and can't fix the posting. As they say about this 'back of the envelope' type calculations will only you get 'ball park' estimates.
The $450 million is also an estimate that could be off by an order of a magnitude.
My argument still stands: in-memory databases with SRAM wafers are much faster and cheaper than customers (managers) realize, no need to have database software.
In 1997 the IRAM paper was still thinking about 180nm DRAM chips, in 2025 we can do 3nm wafer scale integrations so Moores law gives us affordable Terabyte in-memory databases that cost less than a programmer to write and manage databases.
In 1981 Dan Ingalls wrote[1] in Design Principles Behind Smalltalk: "An operating system is a collection of things that don't fit into a language. There shouldn't be one."
IN 2025 I argue that "a database is a collection of tricks to shuffle data between compute and storage memory. There shouldn't be one.".
I would go one step further and say a von Neumann architecture or Harvard architecture microprocessor is a collection of gates to compute in separated memory storage. There shouldn't be one.
[1] https://www.cs.virginia.edu/~evans/cs655/readings/smalltalk....
[dead]