Because you'll slowly start building the individual pieces of the database over the file system, until you've just recreated a database. Database didn't spawn out of nothing, people were writing to raw files on disk, and kept on solving the same issues over and over (data definitions, indexes, relations, cache/memory management, locks, et al).
So your question is: Why does the industry focus on reusable solutions to hard problems, over piece-meal recreating it every project? And when phased in that way, the answer is self-evident. Productivity/cost/ease.
Historically, Unix (and many other operating systems) stored file names as an unsorted list so using the FS as KV store had O(N) lookup times whereas a single-file hashed database like dbm, ndbm, gdbm, bdb, etc gave you O(1) access.
If you're using a relational DB, like SQL, as a relational database, then it gives you a lot the FS doesn't give you. If you're using a relational database as a key-value store, SQLite is 35% than the filesystem [1]
Perhaps one of the biggest users of the filesystem as a KV store is git -- (not an llm, I just wanted to use --) .git/objects/xx/xxxxx maps the sha1 file hash to the compressed data, splayed by the first 2 bytes. However git also uses a database of sorts (.git/objects/pack/....). To sum up the git pack-objects man page, it's more efficient.
The tabular SQL database generally regarded as the fastest, SQLite, is measured as only averaging about 3x faster than the file system assuming you are using well optimized approaches in each system. That said the file system could be faster depending upon the operation.
The advantages of a database:
* Locking - Tables applying locking conventions to prevent race conditions from multiple operations writing competing changes to the same data repository. If its only a single application that has access to the data this can be solved with queues, but locking is necessary when multiple applications write to the same data at the same time.
* API - SQL provides a grammar that many people are familiar with. In practice this all goes to shit when you have to write a bunch of stored procedures, SQL functions, or table triggers. I really don't like SQL.
* References - In RDBMSs records of tables can reference records of other tables using secondary keys that point to unique identifiers of the given other table. This is typically employed as secondary keys. This is also solved auto-magically in languages where objects are passed by reference, but that isn't the file system.
---
If your database, data system, whatever is in memory only there are very few real advantages to using something like SQL. If the data is on disk the file system is the lower level and is designed in such a way as to optimize access to the file system by something like SQL, such as with Logical Volume Manager that can create data volumes that span different hardware.
I think there are a lot of good answers here, but it really comes down to the type of content being stored and access patterns.
A database is a data structure with (generally) many small items that need to be precisely updated, read and manipulated.
A lot of files don't necessarily have this access pattern (for instance rendering a large video file) ... a filesystem has a generic access pattern and is a lower level primitive than a database.
For this same reason you even have different kinds of database for different types of access patterns and data types (e.g Elasticsearch for full text search, MongoDB for JSON, Postgres for SQL)
Filesystem is generic and low-level, database is a higher order abstraction.
Just like any other abstraction that helps you do things more efficiently. Database is an abstraction over the more crude file system. It is similar to asking the question "why not write direct assembly code instead of a programming language". The answer is the same.
It turns out having a defined abstraction like a database makes things faster than having to rely on a rawer interface like filesystems because you can then reduce the number of system calls and context switches necessary. If you wanted to optimize that in your own code rather than relying on a database, you'd end up reinventing a database system of sorts, when (probably) better solutions already exist.
But this would only relate to local databases wouldn't it? Having to connect to a postgres server or something similar. The latency for queries would be far higher than using the file system.
It turns out to be quite unreasonably difficult to put something aside for a time -- from milliseconds to decades -- and then go back and find it again. The difficulty is all in the "find", because things that anyone might care about finding have various different aspects that different people may care about. There is no middle ground between the electronic equivalent of a pile of papers on one's desk and the full capabilities of the relational model. Filesystems are somewhere in between; but there is no useful place in between.
A database is a file system when you get down to it. The reason people use them is to abstract up a layer so you can query the data and get the results you want instead of having to iterate through direct reads of a disk, then having to read, parse, and filter what you want from those reads. You could always write code to help do those things direct from disk, but you know what you have just written if you do so? A database!
It would be straightforward, for instance, to implement a lot of the functionality of a filesystem in a database with BLOBs. Random access might be a hassle, but people are getting used to "filesystem-like" systems which are bad at random access like S3.
> You could always write code to help doing those things direct from disk, but you know what you have just written if you do so? A database!
Yes, but that's my point. Why is this not a part of the standard library / typical package with very little friction with the rest of the code, instead of a separate program that the standard library / typical packages provide in an attempt to reduce the friction?
Or are you making the general point that databases already existed prior to the standard libraries etc, and this is just a case of interfacing with an existing technology instead of rebuilding from scratch?
Because a reasonably well optimized database with support for indexes, data integrity enforcement, transactions, and all the other important things we expect from a good (relational) database is complex enough that it takes a rather large codebase to do it reasonably well. It’s not something you slap together out of a handful of function calls.
ETA: look at SQLite for an example — it’s a relatively recent and simple entrant in the field and the closest you’ll find in the mainstream to a purely filesystem based RDBMS. How would you provide a stdlib that would let you implement something like that reasonably simply? What would be the use case for it?
Because you'll slowly start building the individual pieces of the database over the file system, until you've just recreated a database. Database didn't spawn out of nothing, people were writing to raw files on disk, and kept on solving the same issues over and over (data definitions, indexes, relations, cache/memory management, locks, et al).
So your question is: Why does the industry focus on reusable solutions to hard problems, over piece-meal recreating it every project? And when phased in that way, the answer is self-evident. Productivity/cost/ease.
Historically, Unix (and many other operating systems) stored file names as an unsorted list so using the FS as KV store had O(N) lookup times whereas a single-file hashed database like dbm, ndbm, gdbm, bdb, etc gave you O(1) access.
If you're using a relational DB, like SQL, as a relational database, then it gives you a lot the FS doesn't give you. If you're using a relational database as a key-value store, SQLite is 35% than the filesystem [1]
Perhaps one of the biggest users of the filesystem as a KV store is git -- (not an llm, I just wanted to use --) .git/objects/xx/xxxxx maps the sha1 file hash to the compressed data, splayed by the first 2 bytes. However git also uses a database of sorts (.git/objects/pack/....). To sum up the git pack-objects man page, it's more efficient.
1. https://www.sqlite.org/fasterthanfs.html
The tabular SQL database generally regarded as the fastest, SQLite, is measured as only averaging about 3x faster than the file system assuming you are using well optimized approaches in each system. That said the file system could be faster depending upon the operation.
The advantages of a database:
* Locking - Tables applying locking conventions to prevent race conditions from multiple operations writing competing changes to the same data repository. If its only a single application that has access to the data this can be solved with queues, but locking is necessary when multiple applications write to the same data at the same time.
* API - SQL provides a grammar that many people are familiar with. In practice this all goes to shit when you have to write a bunch of stored procedures, SQL functions, or table triggers. I really don't like SQL.
* References - In RDBMSs records of tables can reference records of other tables using secondary keys that point to unique identifiers of the given other table. This is typically employed as secondary keys. This is also solved auto-magically in languages where objects are passed by reference, but that isn't the file system.
---
If your database, data system, whatever is in memory only there are very few real advantages to using something like SQL. If the data is on disk the file system is the lower level and is designed in such a way as to optimize access to the file system by something like SQL, such as with Logical Volume Manager that can create data volumes that span different hardware.
I think there are a lot of good answers here, but it really comes down to the type of content being stored and access patterns.
A database is a data structure with (generally) many small items that need to be precisely updated, read and manipulated.
A lot of files don't necessarily have this access pattern (for instance rendering a large video file) ... a filesystem has a generic access pattern and is a lower level primitive than a database.
For this same reason you even have different kinds of database for different types of access patterns and data types (e.g Elasticsearch for full text search, MongoDB for JSON, Postgres for SQL)
Filesystem is generic and low-level, database is a higher order abstraction.
Just like any other abstraction that helps you do things more efficiently. Database is an abstraction over the more crude file system. It is similar to asking the question "why not write direct assembly code instead of a programming language". The answer is the same.
More of a sidenote than an answer but a database system can be faster than using the disk directly: https://sqlite.org/fasterthanfs.html#approx.
It turns out having a defined abstraction like a database makes things faster than having to rely on a rawer interface like filesystems because you can then reduce the number of system calls and context switches necessary. If you wanted to optimize that in your own code rather than relying on a database, you'd end up reinventing a database system of sorts, when (probably) better solutions already exist.
But this would only relate to local databases wouldn't it? Having to connect to a postgres server or something similar. The latency for queries would be far higher than using the file system.
It turns out to be quite unreasonably difficult to put something aside for a time -- from milliseconds to decades -- and then go back and find it again. The difficulty is all in the "find", because things that anyone might care about finding have various different aspects that different people may care about. There is no middle ground between the electronic equivalent of a pile of papers on one's desk and the full capabilities of the relational model. Filesystems are somewhere in between; but there is no useful place in between.
A database is a file system when you get down to it. The reason people use them is to abstract up a layer so you can query the data and get the results you want instead of having to iterate through direct reads of a disk, then having to read, parse, and filter what you want from those reads. You could always write code to help do those things direct from disk, but you know what you have just written if you do so? A database!
I'd say the filesystem is a database.
It would be straightforward, for instance, to implement a lot of the functionality of a filesystem in a database with BLOBs. Random access might be a hassle, but people are getting used to "filesystem-like" systems which are bad at random access like S3.
> You could always write code to help doing those things direct from disk, but you know what you have just written if you do so? A database!
Yes, but that's my point. Why is this not a part of the standard library / typical package with very little friction with the rest of the code, instead of a separate program that the standard library / typical packages provide in an attempt to reduce the friction?
Or are you making the general point that databases already existed prior to the standard libraries etc, and this is just a case of interfacing with an existing technology instead of rebuilding from scratch?
Because a reasonably well optimized database with support for indexes, data integrity enforcement, transactions, and all the other important things we expect from a good (relational) database is complex enough that it takes a rather large codebase to do it reasonably well. It’s not something you slap together out of a handful of function calls.
ETA: look at SQLite for an example — it’s a relatively recent and simple entrant in the field and the closest you’ll find in the mainstream to a purely filesystem based RDBMS. How would you provide a stdlib that would let you implement something like that reasonably simply? What would be the use case for it?
I do that a little bit, for simple cache where I don't need it to be always up to date but close enough and on a single instance only.
Some of my horizontally scaled services have like 500mb disk.