'Networking was a substantial cost and required experimentation. We did not use DHCP as most enterprise switches don’t support it and we wanted public IPs for the nodes for convenient and performant access from our servers. While this is an area where we would have saved time with a cloud solution, we had our networking up within days and kinks ironed out within ~3 weeks.'
Where does the switch choice come into whether you DHCP? Wth would you want public IPs.
It really feels like they wanted 30 PB of storage accessible over HTTP and literally nothing else. No redundancy, no NAT, dead simple nginx config + some code to track where to find which file on the filesystem. I like that.
This was not written by a network person, quite clearly. Hopefully it's just a misunderstanding, otherwise they do need someone with literally any clue about networks.
yeah misunderstanding we'll update the post-- separately it's true that we aren't network specialists and the network wrangling was prob disproportionately hard for us/ shouldn't have taken so long.
I assume your actual training is being done somewhere else? Did you try getting colocation space in the same datacentre as somewhere with the compute - it would have reduced your internet costs even further.
yeah the cost calculus is very different for gpus, it absolutely makes sense for us to be using cloud there. also hardly any datacenters can support the power density, esp in downtown sf
Many switches are L3 capable, making them in effect a router. Considering their internet lines appear to be hooked up to their 100 Gbps switch, I'd guess this is one of the L3 ones.
No DHCP doesn't mean public IPs nor impact the need for NAT, it just means the hosts have to be explicitly configured with IP addresses, default gateways if they need egress, and DNS.
Those IPs you end up assigning manually could be private ones or routable ones. If private, authorized traffic could be bridged onto the network by anything, such as a random computer with 2 NICs, one of which is connected eventually to the Internet and one of which is on the local network.
If public, a firewall can control access just as well as using NAT can.
I mean generally above a certain size of deployment DHCP is much more trouble then it's worth.
DHCP is really only worth it when your hosts are truly dynamic (i.e. not controlled by you). Otherwise it's a lot easier to handle IP allocation as part of the asset lifecycle process.
Heck even my house IoT network is all static IPs because at the small scale it's much more robust to not depend on my home router for address assignment - replacing a smart bulb is a big enough event, so DHCP is solely for bootstrapping in that case.
At the enterprise level unpacking a server and recording the asset IDs etc is the time to assign IP addresses.
Just wanted to say, thanks for doing this! Now the old rant...
I started my career when on-prem was the norm and remember so much trouble. When you have long-lived hardware, eventually, no matter how hard you try, you just start to treat it as a pet and state naturally accumulates. Then, as the hardware starts to be not good enough, you need to upgrade. There's an internal team that presents the "commodity" interface, so you have to pick out your new hardware from their list and get the cost approved (it's a lot harder to just spend a little more and get a little more). Then your projects are delayed by them racking the new hardware and you properly "un-petting" your pets so they can respawn on the new devices, etc.
Anyways, when cloud came along, I was like, yeah we're switching and never going back. Buuut, come to find out that's part of the master plan: it's a no-brainer good deal until you and everyone in your org/company/industry forgets HTF to rack their own hardware, and then it starts to go from no-brainer to brainer. And basically unless you start to pull back and rebuild that muscle, it will go from brainer to no-brainer bad deal. So thanks for building this muscle!
Yeah from memory on-prem was always cheaper, it just removed a lot of logistic obstacles and made everything convenient under one bill.
IIRC the wisdom of the time cloud started becoming popular was to always be on-prem and use cloud to scale up when demand spiked. But over time temporarily scaling up became permanent, and devs became reliant on instantly spawning new machines for things other than spikes in demand and now everyone defaults to cloud and treats it as the baseline. In the process we lost the grounding needed to assess the real cost of things and predictably the cost difference between cloud and on-prem has only widened.
we're in a pretty unique situation in that very early on we fundamentally can't afford the hyperscaler clouds to cover operations, so we're forced to develop some expertise. turned out to be reasonably chill and we'll prob stick with it for the foreseeable future, but we have seen a little bit of the state-creep you mention so tbd.
I'm not op, but thanks for this. Like I mentioned in another comment, the wholesale move to the cloud has caused so many skills to become atrophied. And it's good that someone is starting to exercise that skill again, like you said. The hyperscalers are mostly to blame for this, the marketing FUD being that you can't possibly do it yourself, there are too many things to keep track of, let us do it (while conveniently leaving out how eye-wateringly expensive they are in comparison).
it means that even after negotiating much better terms than baseline we run into the fact that cloud providers just have a higher cost basis for the more premium/general product.
Nice writeup. All of the technical detail is great!
I'm curious about the process of getting colo space. Did you use a broker? Did you negotiate, and if so, how large was the difference in price between what you initially were quoted and what you ended up paying?
We reached out to almost every colocation space in SF/some in Fremont to get quotes. There wasn't a difference between the quote price and what we ended up paying, though we did negotiate terms + one-time costs.
For a workload of that size you would be able to negotiate private pricing with AWS or any cloud provider, not just CloudFlare. You can get a private pricing deal on S3 with as little as half a PB. Not saying that your overall expenses would be cheaper w/a CSP than DIY, but its not exactly an apples to apples comparison of taking full retail prices for the CSPs against eBayed equipment and free labor (minus the cost of the pizza).
egress costs are the crux for AWS and they didn't budge when we tried to negotiate that we them, it's just entirely unusable for AI training otherwise. I think the cloudflare private quote is pretty representative of the cheaper end of managed object-bucket storage.
obv as we took on this project the delta between our cluster and the next-best option got smaller, in part bc the ability to host it ourselves gives us negotiating leverage, but managed bucket products are fundamentally overspecced for simple pretraining dumps. glacier does a nice job fitting the needs of archival storage for a good cost, but there's nothing similar for ML needs atm.
You could get pretty close to the cost 1$/TB/month using Hetzner's sx135 with 8x22TB so 140TB in raidz1 for 240 eur. Maybe you get a better rate if you rent 200 of them. Someone else takes care of a lot of risks and you can sleep well at night
yeah it's totally plausible that we go with something like this in the future. We have similar offers where we could separate out either the financing, the build-out, or both and just do the software.
(for Hetzner in particular it was a massive pain when we were trying to get CPU quotas with them for other data operations, and we prob don't want to have it in Europe, but it's been pretty easy to negotiate good quotes on similar deals locally now that we've shown we can do it ourselves)
I don't think Hetzner provides locations in SF. Those 100GBit connections don't do much if they need to connect outside the city the rest of the equipment is in, but maybe peering has gotten better and my views are outdated.
“Solve computer use” and previous work is audio conversation model. How do these go together? Is the idea to replace keyboard and mouse with spoken commands? a la Star Trek
just general research work. Once the recipes are efficient enough the modality is a smaller detail.
On the product side we're trying to orient more towards 'productive work assistant' rather than the default pull of audio models towards being an 'ai friend'.
>We threw a hard drive stacking party in downtown SF and got our friends to come, offering food and custom-engraved hard drives to all who helped. The hard drive stacking started at 6am and continued for 36 hours (with a break to sleep), and by the end of that time we had 30 PB of functioning hardware racked and wired up.
I've mentioned this story before, but we had massive drive failures when bringing up multiple disk arrays. We get them racked on a friday afternoon, and then I wrote a quick and dirty shell script to read/write data back and forth between them over the weekend that was to kick in after they finished striping the raid arrays. By quick and dirty I mean there was no logging, and just a bunch of commands saved as .sh. Came in on Monday to find massive failures in all of the arrays, but no insight into when they failed during the stripe or during stressing them. It was close to 50% failure rate. Turned out to be a bad batch from the factory. Multiple customers of our vendor were complaining. All the drives were replaced by the manufacturer. It just delayed the storage being available to production. After that, not one of them failed in the next 12 months before I left for another job.
The disk failure rates are very low when compared to decade ago. I used to change more than a dozen disks every week a decade ago. Now it's an eyebrow raising event which I seldom see.
I think following Backblaze's hard disk stats is enough at this point.
Backblaze reports an annual failure rate of 1.36% [0]. Since their cluster uses 2,400 drives, they would likely see ~32 failures a year (extra ~$4,000 annual capex, almost negligible).
They mentioned the cluster being used enterprise drives, I can see the desire to save money but agree, that is going to be one expensive mistake down the road.
I should also note personally for home cluster use, I learned quickly that used drives didn’t seem to make sense. Too much performance variability.
we don't have perfect metrics here but this seems to match our experience; a lot of failures happened shortly after install before the bulk of the data download onto the heap, so actual data loss is lower than hardware failure rates
Used drives make sense if maintaining your home server is a hobby. It's fun to diagnose and solve problem in home servers, and failing drives give me a reason to work on the server. (I'm only half-joking, it's kind of fun)
With 30PB it's likely they will simply let capacity fall as drives fail.
They apparently have zero need for redundancy in their use case, and the failure rate won't be high enough to take out a significant percentage of their capacity.
yeah, exactly! we have a 100G uplink, and then we use nginx secure links that we then just curl from the machines using HTTP. (funnily HTTPS adds overhead so we just pre-sign URLs)
DWDM tech improvements have outpaced nearly every other form of technology growth, so the same single pair of fiber that used to carry 10 Mbps can now carry 20 Tbps, which is a 2,000,000x multiplier. The same somewhat-fixed supply of fiber can go a very long way today, so the price pressure for access is less than you might expect.
I'm now envisioning a poster with a strand of fiber wearing aviators with large font size Impact font reading Dark Fiber with literal laser beams coming out of the eyes.
yeah that's why we started paying people near the second half- not super clearly stated in the blogpost, but the novelty definitely wore off with plenty of drives left to stack, so we switched strategies to get it done in time.
I think everyone who showed up for a couple hours as part of the party had a good time tho, and the engraved hard drives we were giving out weren't cheap :p
It's quite cheap to just store data at rest, but I'm pretty confused by the training and networking set up here. It sounds like from other comments that you're not going to put the GPUs in the same location, so you'll be doing all training over X 100 Gbps lines between sites? Aren't you going to end up totally bottlenecked during pretraining here?
yeah we just have the 100gig link, atm that's about all the gpu clusters can pull but we'll prob expand bandwidth and storage as we scale.
I guess worth noting that we do have a bunch of 4090s in the colo and it's been super helpful for e.g. calculating embeddings and such for data splits.
How did you arrive at the decision of not putting the GPU machines in the colo? Were the power costs going to be too high? Or do you just expect to need more physical access to the GPU machines vs the storage ones?
When I was working at sfcompute prior to this we saw multiple datacenters literally catch on fire bc the industry was not experienced with the power density of h100s. Our training chips just aren't a standard package in the way JBODs are.
I wonder if they'll go with "toploaders" - like Backblaze Storage Pods - later. They have better density and faster setup, as they don't have to screw in every drive.
They got used drives. I wonder if they did any testing? I've gotten used drives that were DOA, which showed up in tests - SMART tests, short and long, then writing pseudorandom data to verify capacity.
yeah we're very interested in trying toploaders, we'll do a test rack next time we expand and switch to that if it goes well.
w.r.t. testing the main thing we did was try to buy a bit from each supplier a month or two ahead of time, so by the time we were doing the full build that rack was a known variable. We did find one drive lot which was super sketchy and just didn't include it in the bulk orders later. diversity in suppliers helps a lot with tail risk
"don't have to screw in every drive" is relative, but at least tool-less drive carriers are a thing now.
A lot of older toploaders from vendors like Dell are not tool-free. If you bought vendor drives and one fails, you RMA it and move on. However if you want to replace failed drives in the field, or want to go it alone from the start with refurbished drives... you'll be doing a lot of screwing. They're quite fragile and the plastic snaps easily. It's pretty tedious work.
Their electricity costs are $10K per month or about $120K per year. At an interest rate of 7% that's $1.7M of capital tied up in power bills.
At that rate I wonder if it makes sense to do a massive solar panel and battery installation. They're already hosting all of their compute and storage on prem, so why not bring electricity generation on prem as well?
And not just any video data, they specifically mentioned screen recordings for agentic computer uses. A very specific kind of video. My guess is they have a partnership with someone like Rewind.ai
Not included is overhead of dealing with maintenance. S3/R2 generally don’t require OPS type dedicated to care and feeding. This type of setup will likely require someone to spend 5 hours a week dealing with it.
a) 5hrs/week is negligible compared to that potential AWS bill.
b) The seem tolerant of failures so it's not going to be anything like 5hrs/week of physical maintenance. It will be bursty though (eg. box died, time to replace it...) but assuming they have spares of everything sitting around / already racked it shouldn't be a big deal.
5h a week is basically 3 days a month. So if you have an issue that takes a couple of days per month to fix, which seems very fair, you're at that point.
I once had about three racks full of servers under my control, admittedly they weren't a ton of disks, but still the hardware maintenance effort was pretty much negligible over a few years (until it all went to the cloud).
The majority of server wrangling work I spent dealing with OS updates and, most annoyingly, OpenStack. But that's something you can't escape even if you run your stuff in the cloud...
With S3/R2 whatever, you do get away from it. You dump a bunch of files on them and then retrieve them. OS Updates, Disk Failures, OpenStack, additional hardware? Pssh, that's S3 company problem, not yours.
$LastJob we ran a ton of Azure Web App Containers, alot of OS work no longer existed so it's possible with Cloud to remove alot of OS toil.
The biggest part that is always missing in such comparisons is the employee salaries. In the calculation they give $354k/year of total cost per year. But now add the cost of staff in SF to operate that thing.
The biggest part missing from the opposing side is: Their view is very much rooted in the pre-Cloud hardware infrastructure world, where you'd pay sysadmins a full salary to sit in a dark room to monitor these servers.
The reality nowadays is: the on-prem staff is covered in the colo fees, which is split between everyone coloing in the location and reasonably affordable. The software-level work above that has massively simplified over the past 15 years, and effectively rivals the volume of work it would take to run workloads in the cloud (do you think managing IAM and Terraform is free?)
> do you think managing IAM and Terraform is free?
No, but I would argue that a SaaS offering, where the whole maintenance of the storage system is maintained for you actually requires less maintenance hours than hosting 30 PB in a colo.
In terraform you define the S3 bucket and run terraform apply. Afterwards the company's credit card is the limit. Setting up and operating 30 PB yourself is an entirely different story.
yeah colo help has been great, we had a power blip and without any hassle they covered the cost and installation of UPSes for every rack, without us needing to think abt it outside of some email coordination.
Small startup teams can sometimes get away with datacenter management being a side task that gets done on an as-needed basis at first. It will come with downtime and your stability won't be anywhere near as good as Cloudflare or AWS no matter how well you plan, though.
Every real-world colocation or self-hosting project I've ever been around has underestimate their downtime and rate of problems by at least an order of magnitude. The amount of time lost to driving to the datacenter, waiting for replacement parts to arrive, and scrambling to patch over unexpected failure modes is always much higher than expected.
There is a false sense of security that comes in the early days of the project when you think you've gotten past the big issues and developed a system that's reliable enough. The real test is always 1-2 years later when teams have churned, systems have grown, and the initial enthusiasm for playing with hardware has given way to deep groans whenever the team has to draw straws to see who gets to debug the self-hosted server setup this time or, worse, drive to the datacenter again.
fwiw our first test rack has been up for about a year now and the full cluster has been operational for training for the past ~6 months. having it right down the block from our office has been incredibly helpful, I am a bit worried abt what e.g. freemont would look like if we expand there.
I think another big crux here is that there isn't really any notion of cluster-wide downtime, aside from e.g. a full datacenter power outage (which we've had ig, and now have UPSes in each rack kindly provided and installed by our datacenter). On the software/network level the storage isn't really coordinated in any manner, so failures of one machine only reflect as a degradation to the total theoretical bandwidth for training. This means that there's generally no scrambling and we can just schedule maintenance at our leisure. Last time I drew straws for maintenance I clocked a 30min round-trip to walk over and plug a crash cart into each of the 3 problematic machines to reboot and re-intialize and that was it.
Again having it right by the office is super nice, we'll need to really trust our kvm setup before considering anything offsite.
> The amount of time lost to driving to the datacenter, waiting for replacement parts to arrive, and scrambling to patch over unexpected failure modes is always much higher than expected.
I don't have this experience at all. Our colo handled almost all work. the only time i ever went to the server farm was to build out whole new racks. Even replacing servers the colo handled for us at good cost.
Our reliability came from software not hardware, though of course we had hundreds of spares sitting by, the defense in depth (multiple datacenters, each datacenter having 2 'brains' which could hotswap, each client multiply backed up on 3-4 machines)...
servers going down were fairly common place, servers dying were commonplace. i think once we had a whole rack outage when the switch died, and we flipped it to the backup.
Yes these things can be done and a lot cheaper than paying AWS.
> Our reliability came from software not hardware, though of course we had hundreds of spares sitting by, the defense in depth (multiple datacenters, each datacenter having 2 'brains' which could hotswap, each client multiply backed up on 3-4 machines)...
Of course, but building and managing the software stack, managing hundreds of spares across locations, spanning across datacenters, having a hotswap backup system is not a simple engineering endeavor.
The only way to reach this point is to invest a very large amount of time into it. It requires additional headcount or to put other work on pause.
I was trying to address the type of buildout in this article: Small team, single datacenter, gets the job done but comes with tradeoffs.
The other type of self buildout that you describe is ideal when you have a larger team and extra funds to allocate to putting it all together, managing it, and staffing it. However, once you do that it's not fair to exclude the cost of R&D and the ongoing headcount needs.
It's tempting to sweep it under the rug and call it part of the overall engineering R&D budget, but there is no question a large cost associated with what you described as opposed to spinning up an AWS or Cloudflare account and having access to your battle-tested storage system a few minutes later.
To be fair, what's described here is much more robust than what you get with a simple AWS setup. At a minimum that's a multi-region setup, but if the DCs have different owners I'd even compare it to a multi-cloud setup.
not caring about redundancy/reliability is really nice, each healthy HDD is just the same +20TB of pretraining data and every drive lost is the same marginal cost.
I've built and maintained similar setups (10PB range). Honestly, you just shove disks into it, and when they fail you replace them. You need folks around to handle things like controller / infrastructure failure, but hopefully you're paying them to do other stuff, too.
You are under the assumption that only Ceph (and similar complex software) requires staff, whereas plain 30 PB can be operated basically just by rebooting from time to time.
I think that anyone with actual experience of operating thousands of physical disks in datacenters would challenge this assumption.
we have 6 months of experience operating thousands of physical disks in datacenters now! it's about a couple hours a month of employee time in steady-state.
How about all the other infrastructure. Since you are obviously not using the cloud, you must have massive amounts of GPUs and operating systems. All of that has been working together, it's not just keep watching for the physical disks and all is set.
Don't get me wrong, I buy the actual numbers regarding hardware costs, but in addition to that presenting the rest as basically a one man show in terms of maintenance hours is the point where I'm very sceptical.
oh we use cloud gpus, infiniband h100s absolutely aren't something we want to self-host. not aws tho, they're crazy overpriced; mithril and sfcompute!
we also use cloudflare extensively for everything that isn't the core heap dataset, the convenience of buckets is totally worth it for most day-to-day usage.
the heap is really just the main pretraining corpus and nothing else.
How is it going to work when the GPU is in the cloud and the storage is miles away in a local colo in SF down the street? I was under the impression that the GPUs has to go multiple times over the training dataset, which means transfer 30 PB multiple times in and out of the clouds. Is the data link even fast enough? How much are you charged for data transfer fees.
Assuming that they end up hiring a full time ops person at 500k annually total costs (250k base for a data center wizard), then that's 42k extra a month, or ~$70k. Still 200k per month lower than their next best offering.
This concern troll that everyone trots out when anyone brings up running their own gear is just exhausting. The hyperscalers have melted people’s brains to a point where they can’t even fathom running shit for themselves.
Yes, drives are going to fail. Yes, power supplies are going to burn out. Yes, god, you’re going to get new parts. Yes, you will have to actually talk to vendors.
Big. Deal. This shit is -not- hard.
For the amount of money you save by doing it like that, you should be clamoring to do it yourself. The concern trolling doesn’t make any sort of argument against it, it just makes you look lazy.
Very good point. There was something on the HN front page like this about self-hosted email, too.
I point out to people that AWS is between ten to one hundred times more expensive than a normal server. The response is "but what if I only need it to handle peak load three hours a day?" Then you still come out ahead with your own server.
We have multiple colo cages. We handle enough traffic - terabytes per second - that we'll never move those to cloud. Yet management always wants more cloud. While simultaneously complaining about how we're not making enough money.
I don't think the answer is so black-and-white. IMO This only realistically applies to larger companies or ones that either push lots of traffic or have a need for large amounts of compute/storage/etc.
But for smaller groups that don't have large/sustained workloads, I think they can absolutely save money compared to colo/dedicated servers using one of multiple different kinds of AWS services.
I have several customers that coast along just fine with a $50/mo EC2 instance or less, compared to hundreds per month for a dedicated server... I wouldn't call that "ten times" by any stretch.
> We kept this obsessively simple instead of using MinIO or Ceph because we didn’t need any of the features they provided; it’s much, much simpler to debug a 200-line program than to debug Ceph, and we weren’t worried about redundancy or sharding. All our drives were formatted with XFS.
What do you plan to do if you start getting corruption and bitrot? The complexity of S3 comes with a lot of hard guarantees for data integrity.
Aren't those netapp shelves pretty old at this point? See a lot of people recommending against them even for homelab type uses. You can get those 60 drive SuperMicro JBODs for pretty cheap now, and those aren't too old, would have been my choice.
Plus, the TCO is already way under the cloud equiv. so might as well spend a little more to get something much newer and more reliable
The cost difference is huge. Modern compute is just so much bigger than one would think. Hurricane Electric is incredibly cheap too. And Digital Realty in the city are pretty good. The funny thing is that the Monkeybrains guys will make room for you at $75/amp but that isn't competitive when a 9654 based system pulls 2+ amps at peak.
Still fun for someone wanting to stick a computer in a DC though.
Networking is surprisingly hard but we also settled for the cheapo life QSFP instead of the new Cisco switches that do 800 Gbps that are coming. Great writeup.
One that would be fun is about the mechanics of layout and cabling and that sort of thing. Learning all that manually was a pain in the ass. It's not just written down somewhere and I should have done it when I was doing it but now I no longer am doing it and so can't provide good photos.
My question isn't why do it yourself. A quick back of the envelope math shows AWS being much more expensive. My question is why San Francisco? It's one of the most expensive real estate markets in the US (#2 residential, #1 commercial), and electricity is expensive. $0.71/KwH peak residential rate! A jaunt down 280 to San Jose's gonna be cheaper, at the expense of. having to take that drive to get hands on. But I'm sure you can find someone who's capable of running a DC that lives in San Jose and needs a job so the SF team doesn't have to commute down to South Bay. Now obviously there's something to be said for having the rack in the office, I know of at least two (three, now) in San Francisco, it just seems like a weird decision if you're already worrying about money to the point of not using AWS.
Article says their recurring cost is $17.5k, they'll spend at least that amount in terms of human time tending to their cluster if they have to drive to it. It's also a question of magnitudes, going from $0.5m/mo to $0.05m/mo (hard costs plus the extra headaches of dealing with cluster) is an order of magnitude, even if you could cut another order of magnitude it wouldn't be as impactful.
Problem when you self-roll this is that you inevitably make mistakes and the cycle time of going down and up ruins everything. Access trumps everything.
You can get a DC guy but then he doesn't have much to do post setup and if you contract that you're paying mondo dollars anyway to get it right and it's a market for lemons (lots of bullshitters out there who don't know anything).
Is it correct that you have zero data redundancy? This may work for you if you're just hoarding videos from YouTube, but not for most people who require an assurance that their data is safe. Even for you, it may hurt proper benchmarking, reproducibility, and multi-iteration training if the parent source disappears.
Did you do any kind of redundancy at least (eg: putting every 10 disks in RAID 5 or RAID Z1)? Or I suppose your training application doesn't mind if you shed a few terabytes of data every so often?
atm we don't and we're a bit unsure whether it's a free lunch wrt adding complexity. there's a really nice property of having isolated hard drives where you can take any individual one and `sudo mount` it and you have a nice chunk of training data, and that's something anyone can feel comfortable touching without any onboarding to some software stack
I do appreciate the scrappiness of your solution. Used drives for a storage cluster is like /r/homelab on steroids. And since it's pretraining data, I suppose data integrity isn't critical.
Most venture-backed startups would have just paid the AWS or Cloudflare tax. I certainly hope your VCs appreciate how efficient you are being with their capital :)
$125/disk, 12k/mo depreciation cost which i assume means disk failures, so ~100 disks/mo or 1200/yr, which is half of their disks a year - seems like a lot.
It's an accounting term. You need to report the value of assets of your company each reporting cycle. This allows you to report company profit more accurately since the 2400 drives aren't likely not worth what the company originally paid. It's stated as a tax write-off but people get confused with that term (they think X written off == X less tax paid). It's better to correctly state it as a way to more accurately report profit (which may end up with less company tax paid but obviously not 1:1 since company tax is not 100%).
So anyway you basically pretend you resold the drives today. Here they are assuming in 3 years time no one will pay anything for the drives. Somewhat reasonable to be honest since the setup's bespoke and you'll only get a fraction of the value of 3 year old drives if you resold them.
The networking stuff seems....odd.
'Networking was a substantial cost and required experimentation. We did not use DHCP as most enterprise switches don’t support it and we wanted public IPs for the nodes for convenient and performant access from our servers. While this is an area where we would have saved time with a cloud solution, we had our networking up within days and kinks ironed out within ~3 weeks.'
Where does the switch choice come into whether you DHCP? Wth would you want public IPs.
It really feels like they wanted 30 PB of storage accessible over HTTP and literally nothing else. No redundancy, no NAT, dead simple nginx config + some code to track where to find which file on the filesystem. I like that.
This was not written by a network person, quite clearly. Hopefully it's just a misunderstanding, otherwise they do need someone with literally any clue about networks.
yeah misunderstanding we'll update the post-- separately it's true that we aren't network specialists and the network wrangling was prob disproportionately hard for us/ shouldn't have taken so long.
I assume your actual training is being done somewhere else? Did you try getting colocation space in the same datacentre as somewhere with the compute - it would have reduced your internet costs even further.
yeah the cost calculus is very different for gpus, it absolutely makes sense for us to be using cloud there. also hardly any datacenters can support the power density, esp in downtown sf
> Wth would you want public IPs.
So anyone can download 30 PB of data with ease of course.
They didn't seem to want to use a router. Purpose-built 100 Gbps routers are a bit expensive, but you can also turn a computer into one.
Many switches are L3 capable, making them in effect a router. Considering their internet lines appear to be hooked up to their 100 Gbps switch, I'd guess this is one of the L3 ones.
> Wth would you want public IPs.
Possibly to avoid needing NAT (or VPN) gateway that can handle 100Gbps.
No DHCP doesn't mean public IPs nor impact the need for NAT, it just means the hosts have to be explicitly configured with IP addresses, default gateways if they need egress, and DNS.
Those IPs you end up assigning manually could be private ones or routable ones. If private, authorized traffic could be bridged onto the network by anything, such as a random computer with 2 NICs, one of which is connected eventually to the Internet and one of which is on the local network.
If public, a firewall can control access just as well as using NAT can.
I know, I was specifically answering the question of "why the hell would you want public IPs".
I don't know why their network setup wouldn't support DHCP, that's extremely common especially in "enterprise" switches via DHCP forwarding.
I don't know what they're doing, but Mikrotik can perhaps route that → https://mikrotik.com/product/ccr2216_1g_12xs_2xq#fndtn-testr... and is about the cost of their used thing.
And I think this would be a banger for IPv6 if they really "need" public IPs.
Exactly what I came in to say, CCR2216 can do this for < $2k, and does it well.
I mean generally above a certain size of deployment DHCP is much more trouble then it's worth.
DHCP is really only worth it when your hosts are truly dynamic (i.e. not controlled by you). Otherwise it's a lot easier to handle IP allocation as part of the asset lifecycle process.
Heck even my house IoT network is all static IPs because at the small scale it's much more robust to not depend on my home router for address assignment - replacing a smart bulb is a big enough event, so DHCP is solely for bootstrapping in that case.
At the enterprise level unpacking a server and recording the asset IDs etc is the time to assign IP addresses.
As a fan of eBay for homelab gear, I appreciate the can-do scrappiness of doing it for a startup.
To adapt the old enterprise information infrastructure saying for startups:
"Nobody Ever Got Fired for Buying eBay"
Just wanted to say, thanks for doing this! Now the old rant...
I started my career when on-prem was the norm and remember so much trouble. When you have long-lived hardware, eventually, no matter how hard you try, you just start to treat it as a pet and state naturally accumulates. Then, as the hardware starts to be not good enough, you need to upgrade. There's an internal team that presents the "commodity" interface, so you have to pick out your new hardware from their list and get the cost approved (it's a lot harder to just spend a little more and get a little more). Then your projects are delayed by them racking the new hardware and you properly "un-petting" your pets so they can respawn on the new devices, etc.
Anyways, when cloud came along, I was like, yeah we're switching and never going back. Buuut, come to find out that's part of the master plan: it's a no-brainer good deal until you and everyone in your org/company/industry forgets HTF to rack their own hardware, and then it starts to go from no-brainer to brainer. And basically unless you start to pull back and rebuild that muscle, it will go from brainer to no-brainer bad deal. So thanks for building this muscle!
Yeah from memory on-prem was always cheaper, it just removed a lot of logistic obstacles and made everything convenient under one bill.
IIRC the wisdom of the time cloud started becoming popular was to always be on-prem and use cloud to scale up when demand spiked. But over time temporarily scaling up became permanent, and devs became reliant on instantly spawning new machines for things other than spikes in demand and now everyone defaults to cloud and treats it as the baseline. In the process we lost the grounding needed to assess the real cost of things and predictably the cost difference between cloud and on-prem has only widened.
we're in a pretty unique situation in that very early on we fundamentally can't afford the hyperscaler clouds to cover operations, so we're forced to develop some expertise. turned out to be reasonably chill and we'll prob stick with it for the foreseeable future, but we have seen a little bit of the state-creep you mention so tbd.
Wanna see us do it again?
I'm not op, but thanks for this. Like I mentioned in another comment, the wholesale move to the cloud has caused so many skills to become atrophied. And it's good that someone is starting to exercise that skill again, like you said. The hyperscalers are mostly to blame for this, the marketing FUD being that you can't possibly do it yourself, there are too many things to keep track of, let us do it (while conveniently leaving out how eye-wateringly expensive they are in comparison).
The other thing the cloud does not let you do is make trade offs.
Sometimes you can afford not to have triple redundant 1000GB network or a simple single machine with raid may have acceptable down time.
yeah this
it means that even after negotiating much better terms than baseline we run into the fact that cloud providers just have a higher cost basis for the more premium/general product.
Everyone should give AWS the middle finger and start doing this. Beyond cost, it's a matter of sovereignty over one's computing and data.
Nice writeup. All of the technical detail is great!
I'm curious about the process of getting colo space. Did you use a broker? Did you negotiate, and if so, how large was the difference in price between what you initially were quoted and what you ended up paying?
We reached out to almost every colocation space in SF/some in Fremont to get quotes. There wasn't a difference between the quote price and what we ended up paying, though we did negotiate terms + one-time costs.
For a workload of that size you would be able to negotiate private pricing with AWS or any cloud provider, not just CloudFlare. You can get a private pricing deal on S3 with as little as half a PB. Not saying that your overall expenses would be cheaper w/a CSP than DIY, but its not exactly an apples to apples comparison of taking full retail prices for the CSPs against eBayed equipment and free labor (minus the cost of the pizza).
egress costs are the crux for AWS and they didn't budge when we tried to negotiate that we them, it's just entirely unusable for AI training otherwise. I think the cloudflare private quote is pretty representative of the cheaper end of managed object-bucket storage.
obv as we took on this project the delta between our cluster and the next-best option got smaller, in part bc the ability to host it ourselves gives us negotiating leverage, but managed bucket products are fundamentally overspecced for simple pretraining dumps. glacier does a nice job fitting the needs of archival storage for a good cost, but there's nothing similar for ML needs atm.
You could get pretty close to the cost 1$/TB/month using Hetzner's sx135 with 8x22TB so 140TB in raidz1 for 240 eur. Maybe you get a better rate if you rent 200 of them. Someone else takes care of a lot of risks and you can sleep well at night
yeah it's totally plausible that we go with something like this in the future. We have similar offers where we could separate out either the financing, the build-out, or both and just do the software.
(for Hetzner in particular it was a massive pain when we were trying to get CPU quotas with them for other data operations, and we prob don't want to have it in Europe, but it's been pretty easy to negotiate good quotes on similar deals locally now that we've shown we can do it ourselves)
I don't think Hetzner provides locations in SF. Those 100GBit connections don't do much if they need to connect outside the city the rest of the equipment is in, but maybe peering has gotten better and my views are outdated.
You're good. The speed of light through a glass fiber is still just as slow as it ever was.
Had the pleasure of helping rack drives! Nothing more fun than an insane amount of data :P
Thanks for helping!!!
“Solve computer use” and previous work is audio conversation model. How do these go together? Is the idea to replace keyboard and mouse with spoken commands? a la Star Trek
Make me transparent aluminum!
just general research work. Once the recipes are efficient enough the modality is a smaller detail.
On the product side we're trying to orient more towards 'productive work assistant' rather than the default pull of audio models towards being an 'ai friend'.
>We threw a hard drive stacking party in downtown SF and got our friends to come, offering food and custom-engraved hard drives to all who helped. The hard drive stacking started at 6am and continued for 36 hours (with a break to sleep), and by the end of that time we had 30 PB of functioning hardware racked and wired up.
So how many actual man hours for 2400 drives?
around 250
I love this story. This is true hacking and startup cost awareness.
Thanks!! :)
No mention of disk failure rates? curious how it's holding up after a few months
I've mentioned this story before, but we had massive drive failures when bringing up multiple disk arrays. We get them racked on a friday afternoon, and then I wrote a quick and dirty shell script to read/write data back and forth between them over the weekend that was to kick in after they finished striping the raid arrays. By quick and dirty I mean there was no logging, and just a bunch of commands saved as .sh. Came in on Monday to find massive failures in all of the arrays, but no insight into when they failed during the stripe or during stressing them. It was close to 50% failure rate. Turned out to be a bad batch from the factory. Multiple customers of our vendor were complaining. All the drives were replaced by the manufacturer. It just delayed the storage being available to production. After that, not one of them failed in the next 12 months before I left for another job.
> next 12 months before I left for another job
Heh, that's a clever solution to the problem of managing storage through the full 10 year disk lifecycle.
The disk failure rates are very low when compared to decade ago. I used to change more than a dozen disks every week a decade ago. Now it's an eyebrow raising event which I seldom see.
I think following Backblaze's hard disk stats is enough at this point.
Backblaze reports an annual failure rate of 1.36% [0]. Since their cluster uses 2,400 drives, they would likely see ~32 failures a year (extra ~$4,000 annual capex, almost negligible).
[0] https://www.backblaze.com/cloud-storage/resources/hard-drive...
Their rate will probably be higher since they are utilizing used drives. From the spec:
2,400 drives. Mostly 12TB used enterprise drives (3/4 SATA, 1/4 SAS). The JBOD DS4246s work for either.
They mentioned the cluster being used enterprise drives, I can see the desire to save money but agree, that is going to be one expensive mistake down the road.
I should also note personally for home cluster use, I learned quickly that used drives didn’t seem to make sense. Too much performance variability.
If I remember correctly, most drives either:
1. Fail in the first X amount of time
2. Fail towards the end of their rated lifespan
So buying used drives doesn't seem like the worst idea to me. You've already filtered out the drivers that would fail early.
Disclaimer: I have no idea what I'm talking about
we don't have perfect metrics here but this seems to match our experience; a lot of failures happened shortly after install before the bulk of the data download onto the heap, so actual data loss is lower than hardware failure rates
Over in hardware-land we call this "the bathtub curve".
in a datacenter context failure rates are just a remote-hands recurring cost so it's not too bad with front-loaders
e.g. have someone show up to the datacenter with a grocery list of slot indices and a cart of fresh drives every few months.
Used drives make sense if maintaining your home server is a hobby. It's fun to diagnose and solve problem in home servers, and failing drives give me a reason to work on the server. (I'm only half-joking, it's kind of fun)
good point
HDDs - are never one time costs. Do datacenters also offer ordering and replacing HDDs?
With 30PB it's likely they will simply let capacity fall as drives fail.
They apparently have zero need for redundancy in their use case, and the failure rate won't be high enough to take out a significant percentage of their capacity.
Where does one get “90 million hours of video data”?
I’m also curious about this. I don’t recall seeing that mentioned in the article
So how do they get this data to the GPUs now...? Just run it over the public internet to the datacenter?
yeah, exactly! we have a 100G uplink, and then we use nginx secure links that we then just curl from the machines using HTTP. (funnily HTTPS adds overhead so we just pre-sign URLs)
7.5k for zayo 100gig so that's like half of the MRC
They can rent a dark fiber for themselves for that distance, and it'll be cheap.
However, as they noted they use 100gbps capacity from their ISP.
Does San Francisco really still have dark fiber? That 90s bubble sure did overshoot demand.
DWDM tech improvements have outpaced nearly every other form of technology growth, so the same single pair of fiber that used to carry 10 Mbps can now carry 20 Tbps, which is a 2,000,000x multiplier. The same somewhat-fixed supply of fiber can go a very long way today, so the price pressure for access is less than you might expect.
I think these days folks say "dark fiber" for any kind of connection you buy. It bothers me too.
I meant a “single mode, non terminated fiber optic cable from point to point”. In other words, your own cable without any other traffic on it.
A shared one will be metro Ethernet in my parlance.
We want to get darkfiber from the datacenter to the office. I love 100Gbps
I'm now envisioning a poster with a strand of fiber wearing aviators with large font size Impact font reading Dark Fiber with literal laser beams coming out of the eyes.
Cool write-up.
I do feel sorry for the friends that go suckered into doing a bunch of grunt work for free though
yeah that's why we started paying people near the second half- not super clearly stated in the blogpost, but the novelty definitely wore off with plenty of drives left to stack, so we switched strategies to get it done in time.
I think everyone who showed up for a couple hours as part of the party had a good time tho, and the engraved hard drives we were giving out weren't cheap :p
It's quite cheap to just store data at rest, but I'm pretty confused by the training and networking set up here. It sounds like from other comments that you're not going to put the GPUs in the same location, so you'll be doing all training over X 100 Gbps lines between sites? Aren't you going to end up totally bottlenecked during pretraining here?
yeah we just have the 100gig link, atm that's about all the gpu clusters can pull but we'll prob expand bandwidth and storage as we scale.
I guess worth noting that we do have a bunch of 4090s in the colo and it's been super helpful for e.g. calculating embeddings and such for data splits.
How did you arrive at the decision of not putting the GPU machines in the colo? Were the power costs going to be too high? Or do you just expect to need more physical access to the GPU machines vs the storage ones?
When I was working at sfcompute prior to this we saw multiple datacenters literally catch on fire bc the industry was not experienced with the power density of h100s. Our training chips just aren't a standard package in the way JBODs are.
I wonder if they'll go with "toploaders" - like Backblaze Storage Pods - later. They have better density and faster setup, as they don't have to screw in every drive.
They got used drives. I wonder if they did any testing? I've gotten used drives that were DOA, which showed up in tests - SMART tests, short and long, then writing pseudorandom data to verify capacity.
yeah we're very interested in trying toploaders, we'll do a test rack next time we expand and switch to that if it goes well.
w.r.t. testing the main thing we did was try to buy a bit from each supplier a month or two ahead of time, so by the time we were doing the full build that rack was a known variable. We did find one drive lot which was super sketchy and just didn't include it in the bulk orders later. diversity in suppliers helps a lot with tail risk
"don't have to screw in every drive" is relative, but at least tool-less drive carriers are a thing now.
A lot of older toploaders from vendors like Dell are not tool-free. If you bought vendor drives and one fails, you RMA it and move on. However if you want to replace failed drives in the field, or want to go it alone from the start with refurbished drives... you'll be doing a lot of screwing. They're quite fragile and the plastic snaps easily. It's pretty tedious work.
Used Supermicro machines of this generation and very cheap (all things considered)
https://www.theserverstore.com/supermicro-superstorage-ssg-6...
Their electricity costs are $10K per month or about $120K per year. At an interest rate of 7% that's $1.7M of capital tied up in power bills.
At that rate I wonder if it makes sense to do a massive solar panel and battery installation. They're already hosting all of their compute and storage on prem, so why not bring electricity generation on prem as well?
At 120K per year over the three year accounting life of the hardware, that's 360k... how do you get to 1.7M?
Let's just say we're not seeing all of these sudden private nuclear reactor investments for no reason.
great write up, really appreciate the explanations / showing the process
But where do you get 90 million hours worth of video data?
And not just any video data, they specifically mentioned screen recordings for agentic computer uses. A very specific kind of video. My guess is they have a partnership with someone like Rewind.ai
Arrr matey
Shows how crazy cheap on prem can be. tips hat
Not included is overhead of dealing with maintenance. S3/R2 generally don’t require OPS type dedicated to care and feeding. This type of setup will likely require someone to spend 5 hours a week dealing with it.
a) 5hrs/week is negligible compared to that potential AWS bill.
b) The seem tolerant of failures so it's not going to be anything like 5hrs/week of physical maintenance. It will be bursty though (eg. box died, time to replace it...) but assuming they have spares of everything sitting around / already racked it shouldn't be a big deal.
True, this is a large reason why we chose to have the datacenter a couple blocks away from the office.
Why 5h a week? Just for hardware?
5h a week is basically 3 days a month. So if you have an issue that takes a couple of days per month to fix, which seems very fair, you're at that point.
I once had about three racks full of servers under my control, admittedly they weren't a ton of disks, but still the hardware maintenance effort was pretty much negligible over a few years (until it all went to the cloud).
The majority of server wrangling work I spent dealing with OS updates and, most annoyingly, OpenStack. But that's something you can't escape even if you run your stuff in the cloud...
With S3/R2 whatever, you do get away from it. You dump a bunch of files on them and then retrieve them. OS Updates, Disk Failures, OpenStack, additional hardware? Pssh, that's S3 company problem, not yours.
$LastJob we ran a ton of Azure Web App Containers, alot of OS work no longer existed so it's possible with Cloud to remove alot of OS toil.
And this is actually relatively expensive.
tips hat back
i am still confused what their software stack is, they dont use ceph but bought netapp, so they use nfs?
The NetApps are just disk shelves, can plug it into a SAS controller and use whatever software stack you please.
but they have multiple head nodes, so its some distributed setup or just active/passive type thing?
I'm guessing the client software (outside the dc) is responsible for enumerating all the nodes which all get their own IP.
The biggest part that is always missing in such comparisons is the employee salaries. In the calculation they give $354k/year of total cost per year. But now add the cost of staff in SF to operate that thing.
The biggest part missing from the opposing side is: Their view is very much rooted in the pre-Cloud hardware infrastructure world, where you'd pay sysadmins a full salary to sit in a dark room to monitor these servers.
The reality nowadays is: the on-prem staff is covered in the colo fees, which is split between everyone coloing in the location and reasonably affordable. The software-level work above that has massively simplified over the past 15 years, and effectively rivals the volume of work it would take to run workloads in the cloud (do you think managing IAM and Terraform is free?)
> do you think managing IAM and Terraform is free?
No, but I would argue that a SaaS offering, where the whole maintenance of the storage system is maintained for you actually requires less maintenance hours than hosting 30 PB in a colo.
In terraform you define the S3 bucket and run terraform apply. Afterwards the company's credit card is the limit. Setting up and operating 30 PB yourself is an entirely different story.
yeah colo help has been great, we had a power blip and without any hassle they covered the cost and installation of UPSes for every rack, without us needing to think abt it outside of some email coordination.
Small startup teams can sometimes get away with datacenter management being a side task that gets done on an as-needed basis at first. It will come with downtime and your stability won't be anywhere near as good as Cloudflare or AWS no matter how well you plan, though.
Every real-world colocation or self-hosting project I've ever been around has underestimate their downtime and rate of problems by at least an order of magnitude. The amount of time lost to driving to the datacenter, waiting for replacement parts to arrive, and scrambling to patch over unexpected failure modes is always much higher than expected.
There is a false sense of security that comes in the early days of the project when you think you've gotten past the big issues and developed a system that's reliable enough. The real test is always 1-2 years later when teams have churned, systems have grown, and the initial enthusiasm for playing with hardware has given way to deep groans whenever the team has to draw straws to see who gets to debug the self-hosted server setup this time or, worse, drive to the datacenter again.
fwiw our first test rack has been up for about a year now and the full cluster has been operational for training for the past ~6 months. having it right down the block from our office has been incredibly helpful, I am a bit worried abt what e.g. freemont would look like if we expand there.
I think another big crux here is that there isn't really any notion of cluster-wide downtime, aside from e.g. a full datacenter power outage (which we've had ig, and now have UPSes in each rack kindly provided and installed by our datacenter). On the software/network level the storage isn't really coordinated in any manner, so failures of one machine only reflect as a degradation to the total theoretical bandwidth for training. This means that there's generally no scrambling and we can just schedule maintenance at our leisure. Last time I drew straws for maintenance I clocked a 30min round-trip to walk over and plug a crash cart into each of the 3 problematic machines to reboot and re-intialize and that was it.
Again having it right by the office is super nice, we'll need to really trust our kvm setup before considering anything offsite.
> The amount of time lost to driving to the datacenter, waiting for replacement parts to arrive, and scrambling to patch over unexpected failure modes is always much higher than expected.
I don't have this experience at all. Our colo handled almost all work. the only time i ever went to the server farm was to build out whole new racks. Even replacing servers the colo handled for us at good cost.
Our reliability came from software not hardware, though of course we had hundreds of spares sitting by, the defense in depth (multiple datacenters, each datacenter having 2 'brains' which could hotswap, each client multiply backed up on 3-4 machines)...
servers going down were fairly common place, servers dying were commonplace. i think once we had a whole rack outage when the switch died, and we flipped it to the backup.
Yes these things can be done and a lot cheaper than paying AWS.
> Our reliability came from software not hardware, though of course we had hundreds of spares sitting by, the defense in depth (multiple datacenters, each datacenter having 2 'brains' which could hotswap, each client multiply backed up on 3-4 machines)...
Of course, but building and managing the software stack, managing hundreds of spares across locations, spanning across datacenters, having a hotswap backup system is not a simple engineering endeavor.
The only way to reach this point is to invest a very large amount of time into it. It requires additional headcount or to put other work on pause.
I was trying to address the type of buildout in this article: Small team, single datacenter, gets the job done but comes with tradeoffs.
The other type of self buildout that you describe is ideal when you have a larger team and extra funds to allocate to putting it all together, managing it, and staffing it. However, once you do that it's not fair to exclude the cost of R&D and the ongoing headcount needs.
It's tempting to sweep it under the rug and call it part of the overall engineering R&D budget, but there is no question a large cost associated with what you described as opposed to spinning up an AWS or Cloudflare account and having access to your battle-tested storage system a few minutes later.
To be fair, what's described here is much more robust than what you get with a simple AWS setup. At a minimum that's a multi-region setup, but if the DCs have different owners I'd even compare it to a multi-cloud setup.
not caring about redundancy/reliability is really nice, each healthy HDD is just the same +20TB of pretraining data and every drive lost is the same marginal cost.
When you lose 20 TB of video, where do you get 20 TB of new video to replace it?
I've built and maintained similar setups (10PB range). Honestly, you just shove disks into it, and when they fail you replace them. You need folks around to handle things like controller / infrastructure failure, but hopefully you're paying them to do other stuff, too.
someone has to go and power-cycle the machines every couple months it's chill, that's the point of not using ceph
You are under the assumption that only Ceph (and similar complex software) requires staff, whereas plain 30 PB can be operated basically just by rebooting from time to time.
I think that anyone with actual experience of operating thousands of physical disks in datacenters would challenge this assumption.
we have 6 months of experience operating thousands of physical disks in datacenters now! it's about a couple hours a month of employee time in steady-state.
How about all the other infrastructure. Since you are obviously not using the cloud, you must have massive amounts of GPUs and operating systems. All of that has been working together, it's not just keep watching for the physical disks and all is set.
Don't get me wrong, I buy the actual numbers regarding hardware costs, but in addition to that presenting the rest as basically a one man show in terms of maintenance hours is the point where I'm very sceptical.
oh we use cloud gpus, infiniband h100s absolutely aren't something we want to self-host. not aws tho, they're crazy overpriced; mithril and sfcompute!
we also use cloudflare extensively for everything that isn't the core heap dataset, the convenience of buckets is totally worth it for most day-to-day usage.
the heap is really just the main pretraining corpus and nothing else.
How is it going to work when the GPU is in the cloud and the storage is miles away in a local colo in SF down the street? I was under the impression that the GPUs has to go multiple times over the training dataset, which means transfer 30 PB multiple times in and out of the clouds. Is the data link even fast enough? How much are you charged for data transfer fees.
Assuming that they end up hiring a full time ops person at 500k annually total costs (250k base for a data center wizard), then that's 42k extra a month, or ~$70k. Still 200k per month lower than their next best offering.
So the drives are never going to fail? PSUs are never going to burn out? You are never going to need to procure new parts? Negotiate with vendors?
They mention data loss is acceptable, so im guessing they're only fixing big outages.
Ignoring failed hdds week likely mean very little maintenance.
This concern troll that everyone trots out when anyone brings up running their own gear is just exhausting. The hyperscalers have melted people’s brains to a point where they can’t even fathom running shit for themselves.
Yes, drives are going to fail. Yes, power supplies are going to burn out. Yes, god, you’re going to get new parts. Yes, you will have to actually talk to vendors.
Big. Deal. This shit is -not- hard.
For the amount of money you save by doing it like that, you should be clamoring to do it yourself. The concern trolling doesn’t make any sort of argument against it, it just makes you look lazy.
Very good point. There was something on the HN front page like this about self-hosted email, too.
I point out to people that AWS is between ten to one hundred times more expensive than a normal server. The response is "but what if I only need it to handle peak load three hours a day?" Then you still come out ahead with your own server.
We have multiple colo cages. We handle enough traffic - terabytes per second - that we'll never move those to cloud. Yet management always wants more cloud. While simultaneously complaining about how we're not making enough money.
I don't think the answer is so black-and-white. IMO This only realistically applies to larger companies or ones that either push lots of traffic or have a need for large amounts of compute/storage/etc.
But for smaller groups that don't have large/sustained workloads, I think they can absolutely save money compared to colo/dedicated servers using one of multiple different kinds of AWS services.
I have several customers that coast along just fine with a $50/mo EC2 instance or less, compared to hundreds per month for a dedicated server... I wouldn't call that "ten times" by any stretch.
So now you have all
- your storage in one place
- you own all backup,
-- off site backup (hot or cold)
- uptime worries
- maintenance drives
-- how many can fail. before it is a problem
- maintenance machines
-- how many can fail. before it is a problem
- maintenance misc/datacenter
- What to do the electricity is cut off suddenly
-- do you have a backup provider?
-- disel generators?
-- giant batteries?
-- Will the backup power also run cooling?
-natural disaster
-- earthquake
-- flooding
-- heatwave
- physical security
- employee training / (esp. if many quit)
- backup for networking (and power for it)
- employees on call 24/7
- protection against hacking
+++++
I agree that a lot of cloud providers overcharge by a lot, but doing it all yourself gives you a lot of headaches.
co-hosting would seem like a valuable partial mitigator.
Most of these come from your colo provider (including a good backup power and networking story), and you can pay remote hands for a lot of the rest.
Things like "protection from hacking" also don't come from AWS.
> We kept this obsessively simple instead of using MinIO or Ceph because we didn’t need any of the features they provided; it’s much, much simpler to debug a 200-line program than to debug Ceph, and we weren’t worried about redundancy or sharding. All our drives were formatted with XFS.
What do you plan to do if you start getting corruption and bitrot? The complexity of S3 comes with a lot of hard guarantees for data integrity.
our training stack doesn't make strong assumptions about data integrity, it's chill
Aren't those netapp shelves pretty old at this point? See a lot of people recommending against them even for homelab type uses. You can get those 60 drive SuperMicro JBODs for pretty cheap now, and those aren't too old, would have been my choice.
Plus, the TCO is already way under the cloud equiv. so might as well spend a little more to get something much newer and more reliable
yeah it's on the wishlist to try
damn this is cool as hell. estimate on the maintenance cost in person-hours/month?
Around 2-5 hours/month, mostly powercycling the servers and replacing hard drives
And how much did the training data cost?
IPMI is great and all, but I still prefer serial ports and remote PDUs. Never met a BMC I could trust.
Try Lenovo. Their BMCs Don't Suck (tm).
The cost difference is huge. Modern compute is just so much bigger than one would think. Hurricane Electric is incredibly cheap too. And Digital Realty in the city are pretty good. The funny thing is that the Monkeybrains guys will make room for you at $75/amp but that isn't competitive when a 9654 based system pulls 2+ amps at peak.
Still fun for someone wanting to stick a computer in a DC though.
Networking is surprisingly hard but we also settled for the cheapo life QSFP instead of the new Cisco switches that do 800 Gbps that are coming. Great writeup.
One that would be fun is about the mechanics of layout and cabling and that sort of thing. Learning all that manually was a pain in the ass. It's not just written down somewhere and I should have done it when I was doing it but now I no longer am doing it and so can't provide good photos.
Would have been much easier and probably cheaper to buy gear from 45drives.
My question isn't why do it yourself. A quick back of the envelope math shows AWS being much more expensive. My question is why San Francisco? It's one of the most expensive real estate markets in the US (#2 residential, #1 commercial), and electricity is expensive. $0.71/KwH peak residential rate! A jaunt down 280 to San Jose's gonna be cheaper, at the expense of. having to take that drive to get hands on. But I'm sure you can find someone who's capable of running a DC that lives in San Jose and needs a job so the SF team doesn't have to commute down to South Bay. Now obviously there's something to be said for having the rack in the office, I know of at least two (three, now) in San Francisco, it just seems like a weird decision if you're already worrying about money to the point of not using AWS.
Article says their recurring cost is $17.5k, they'll spend at least that amount in terms of human time tending to their cluster if they have to drive to it. It's also a question of magnitudes, going from $0.5m/mo to $0.05m/mo (hard costs plus the extra headaches of dealing with cluster) is an order of magnitude, even if you could cut another order of magnitude it wouldn't be as impactful.
it's not just in sf it's across the street from our office
this has been incredibly nice for our first hardware project, if we ever expand substantially then we'd def care more about the colo costs.
Problem when you self-roll this is that you inevitably make mistakes and the cycle time of going down and up ruins everything. Access trumps everything.
You can get a DC guy but then he doesn't have much to do post setup and if you contract that you're paying mondo dollars anyway to get it right and it's a market for lemons (lots of bullshitters out there who don't know anything).
Learned this lesson painfully.
the doodles are great
Thanks! Lots of hard work went into them.
Is it correct that you have zero data redundancy? This may work for you if you're just hoarding videos from YouTube, but not for most people who require an assurance that their data is safe. Even for you, it may hurt proper benchmarking, reproducibility, and multi-iteration training if the parent source disappears.
Definitely much less redundancy, this was definitely a tradeoff we made for pretraining data and cost.
Did you do any kind of redundancy at least (eg: putting every 10 disks in RAID 5 or RAID Z1)? Or I suppose your training application doesn't mind if you shed a few terabytes of data every so often?
atm we don't and we're a bit unsure whether it's a free lunch wrt adding complexity. there's a really nice property of having isolated hard drives where you can take any individual one and `sudo mount` it and you have a nice chunk of training data, and that's something anyone can feel comfortable touching without any onboarding to some software stack
how long do you think it'll be before you fill all of it and have to build another cluster LOL
Already filled up and looking to possibly copy and paste :)
So, others have asked, and I'm curious myself are you sourcing the videos yourselves or third parties?
My guess would be they are running some dummy app like quote of the day or something and it records the screen at 1fps or so.
Used Disks, No DR, not exactly a real shoot out.
True, though this is specifically for pretraining data (S3 wouldn't sell us used disk + no DR storage).
I do appreciate the scrappiness of your solution. Used drives for a storage cluster is like /r/homelab on steroids. And since it's pretraining data, I suppose data integrity isn't critical.
Most venture-backed startups would have just paid the AWS or Cloudflare tax. I certainly hope your VCs appreciate how efficient you are being with their capital :)
worth stressing that we literally could not afford pretraining without this, approx our entire seed round would go into cloud storage costs
You're in a seismically active part of the world. Will the venture last in a total loss scenario?
They spent $300,000 on drives, with AWS they would have spent 4x that PER MONTH. They're already ahead of the cloud.
AWS/cloud doesn't factor into my question what so ever. Loss of equipment is one thing. Loss of all data is quite a different story.
We're currently 1/1 for the recent 4.3 magnitude earthquake (though if SF crumbles we might lose data)
4.3 is a baby quake. I'd hope that you'd be 1/1!
$125/disk, 12k/mo depreciation cost which i assume means disk failures, so ~100 disks/mo or 1200/yr, which is half of their disks a year - seems like a lot.
It's an accounting term. You need to report the value of assets of your company each reporting cycle. This allows you to report company profit more accurately since the 2400 drives aren't likely not worth what the company originally paid. It's stated as a tax write-off but people get confused with that term (they think X written off == X less tax paid). It's better to correctly state it as a way to more accurately report profit (which may end up with less company tax paid but obviously not 1:1 since company tax is not 100%).
So anyway you basically pretend you resold the drives today. Here they are assuming in 3 years time no one will pay anything for the drives. Somewhat reasonable to be honest since the setup's bespoke and you'll only get a fraction of the value of 3 year old drives if you resold them.
oh i see, thanks! i might be too used to reading backblaze reports :p
no, we wanted to be conservative by depreciating somewhat more aggressively than that. we have much closer to 5% yearly disk failure rates.