Zero Disk Architecture

24 Nov 2024

This is a follow up to my post: Disaggregated Storage - a brief introduction

State is pain

In my previous post, I explained how a disk attached to a machine makes things difficult. Vertical scaling has its limits, and when you hit that limit, you can’t do horizontal scaling right away because of the attached disk. Mainstream databases like Postgres or MySQL don’t scale horizontally. I recently learned that BlueSky team switched from Postgres to a combination of Scylla and SQLite. One of the reasons was because (vanilla) Postgres is not horizontally scalable, but Scylla is.

State is pain. Since the machine is stateful, you lose elasticity and scalability. So, the solution was to separate state from compute, so that they become independently scalable.

Disaggregated Storage

Disaggregated Storage solves many problems associated with the traditional coupled architecture:

Scalable and elastic. Limits of vertical scaling do not apply
Databases are ‘serverless’ - instant startup and shutdown
Instant failover without any need of a hot standby

But there was a big cliffhanger at the end of the post. The storage server. If I am writing a storage server, then won’t I need to manage the state? It looks like we are back where we started. We need a storage server which is strongly consistent, elastic, horizontally scalable, and preferably has auto sharding.

So…what are my options?

In large companies, you can offload the storage server problem to another team and live peacefully. For example, Amazon has a transaction log service (the details aren’t public) which is used by Aurora and MemoryKV.
Use an existing open source storage engine. For Disaggregated Storage on SQLite, I went with this route and used Foundation DB. One problem with this approach is, you need to run and manage the cluster by yourself. I don’t know any hosted KV Store providers.
Become a cracked engineer and build my own storage server. But this will take years! We want to ship fast and ship yesterday, so not an option.

It seems most database companies roll their own storage server. However, there is one more option which is a mix of #1 and #2: Amazon S3.

Zero Disk Architecture

The idea is simple. Instead of writing to a storage server, we will write to S3. Thus we will not manage any storage server, rather we offload it to the smart folks at AWS. S3 meets all our requirements. As a bonus, you get infinite storage space. S3 came out in 2006 and it has proven test of time. It is designed to provide 99.999999999% (that’s eleven nines) durability and 99.99% availability guarantees. I believe the next generation of infrastructure systems will be built on zero disk paradigm.

This idea is not new. In 2008, there was a research paper ‘Building a Database on S3’ - a paper way ahead of its time, with lots of interesting ideas for today’s cloud computing. The researchers experimented with storing a B-tree on S3 using SQS as a Write-Ahead Log (WAL). They also provided analysis on latency when writing to S3 and the associated costs. The paper had some flaws, like they dropped ACID properties. However, we are in 2025, and we can do better.

Then, why has no one built such a system until now? My guess: latency and cost. However, S3 keeps getting better. They keep reducing the price all the time. The cost and latency are both going down as technology improves! Amazon S3 Express One Zone was launched last year and it’s supposed to be 10x faster. Another reason I think is B-Tree vs LSM Tree. LSM Tree workload is more suited for S3. As most newer databases adapt to LSM, they’re closer to S3. In the paper also they map B-Tree on S3.

Another reason I suspect is lack of features like conditional writes. Without this, you need an external system to provide transactional and ACID properties. S3 recently added this which gives you CAS-style operations on S3 objects.

Databases typically operate with pages, which are 4KiB in size. But object storages operate at much bigger sizes. The cost will be insanely high if we write every 4KiB object. So we will batch them at the compute layer till say 512KiB and then write all the pages as a single object. Suppose a transaction has sent a commit request, when do you acknowledge it as committed? If the local batch is not full, then do you make the client wait or cache the writes at compute and return success? If you do the latter, there is a risk of data loss. If you wait, then latency shoots up. Like everything in engineering, there is a trade-off: latency vs durability.

Smaller payloads also mean more requests, but that increases both cost and provides better durability and latency. This adds one more parameter: cost vs latency vs durability.

I stole this trade-off diagram from Jack Vanlightly’s excellent article. Chris Riccomini also explored this concept and coined catchy ‘LCD model’ term.

If you want to optimize for latency, you can first write to S3 Express One Zone (supposedly has single digit millisecond latency) and then offload that data to S3 later. In this case, One Zone becomes an intermediate cache server.

For OLTP databases, this can be still slow. That’s why databases like Neon, TiDB etc. have a Raft cluster setup which receives the writes. Then they are written to S3. This also saves on cost because instead of many smaller writes, you can make one large write to S3.

So depending on the trade-offs you want to make, you can write directly to S3 (standard or Express One Zone) or use a write through cache server. Zero disk architecture is also very attractive for systems where you don’t care about latency. For example, OLAP databases, data warehouse systems.

Here are some systems which use S3 (or similar) as a primary store: Snowflake, WarpStream, SlateDB, Turbo Puffer, Clickhouse, Quickwit, Milvus, WeSQL, Chroma, Bufstream, MemQ.

Zero Disk Architeture is a very compelling because you are not managing any storage server. You are not managing the state. The problem is for AWS S3 to deal with now. On top of it, you get all the benefits of disaggregated storage I highlighted earlier.

It’s time we use the S3 as the brother Bezos intended. The malloc of the web.

1. Any object store would work. But I like S3.
2. If any Amazon engineers would like to share more details about the Transaction Log, hit me up please.
3. Jack also wrote an excellent cost analysis: A Cost Analysis of Replication vs S3 Express One Zone in Transactional Data Systems
4. In S3, if you store 100 billion objects, you might lose one in a year. To put it another way: if you store 10 million objects, you might lose one in 10,000 years. If a dinosaur had stored 1,000 objects, they may be still intact after 65 million years 🦖
5. Due to operational complexity and trade-offs, disaggregated storage (with/without zero disk architecture) makes sense primarily for database vendors and large tech companies, rather than organizations running just a few databases.
Thanks to Mr. Bhat, and Rishi for reading an early draft of this post.