The cost estimates are particularly notable: if they're right that's a cost of about $3/day for 6TB/day of written data, 2TB/day of deletes and 50K read queries.
Storing all those TBs of data in S3 is where the real cost lies. I think it costs $5520 to store 8TB*30 = 240TB in S3, and if you retain all data your monthly cost goes up by $5520 every month.
I think the idea is that the deletes would eventually be compacted, so it's ultimately half as much, but I digress.
The cost isn't that bad all things considered. Hot, durable and available data ain't that cheap, especially in the cloud. Self-hosting is within an order of magnitude.
I think ideally you could map retention of cold data to file objects itself and using key space naming strategy and lifecycle rules, expire the data that is not needed, thus saving on the storage costs (as much as possible hopefully)
I just want to be able to append metadata to a Parquet file at the end without rewriting the whole file. Tombstones could be baked in the parquet file this way.
It does work with "one more file" but it's not good for performance.
That’s the whole reason of existence of Iceberg, Delta and Hudi right?
Not as easy as just appending metadata to a parquet file but in the other hand, parquet was never and probably shouldn’t be designed with that functionality in mind.
Yeah , I poorly phrased it - I meant in an ideal situation with the benefits of parquet like columnar file structure. I very much understand that it’s not possible on parquet today for the reasons you mentioned and others.
This is a really clever design.
The cost estimates are particularly notable: if they're right that's a cost of about $3/day for 6TB/day of written data, 2TB/day of deletes and 50K read queries.
Storing all those TBs of data in S3 is where the real cost lies. I think it costs $5520 to store 8TB*30 = 240TB in S3, and if you retain all data your monthly cost goes up by $5520 every month.
I think the idea is that the deletes would eventually be compacted, so it's ultimately half as much, but I digress.
The cost isn't that bad all things considered. Hot, durable and available data ain't that cheap, especially in the cloud. Self-hosting is within an order of magnitude.
I think ideally you could map retention of cold data to file objects itself and using key space naming strategy and lifecycle rules, expire the data that is not needed, thus saving on the storage costs (as much as possible hopefully)
I just want to be able to append metadata to a Parquet file at the end without rewriting the whole file. Tombstones could be baked in the parquet file this way.
It does work with "one more file" but it's not good for performance.
That’s the whole reason of existence of Iceberg, Delta and Hudi right?
Not as easy as just appending metadata to a parquet file but in the other hand, parquet was never and probably shouldn’t be designed with that functionality in mind.
Yeah. Or just sub out the data with null bytes. Something like that could be nice too.
Are you familiar with Parquet ? you can't do that at all, you need to rewrite the whole file.
Yeah , I poorly phrased it - I meant in an ideal situation with the benefits of parquet like columnar file structure. I very much understand that it’s not possible on parquet today for the reasons you mentioned and others.