Design Google Drive

Google Drive/Dropbox System Design

Cloud file system

Requirements:

Less bottlenecked by speed but by how intense it is on local device.

Capacity estimates:

Document permissions:

schema: userId, docId
Indexing and partitioning by userId, which allows us to quickly see which documents a given user has access to

Document permissions DB:

File Uploads:

Split files into smaller chunks
Pros:
Parallel uploading
Only load modified chunks on file changes
Cons:
Added complexity on client to do file diffing, have to keep track of chunks in DB

Chunking Data schema:

Chunks table: Id, version, hash, link
Indexing by Id then version allows quick queries to determine all chunks for a given file version

Chunks DB considerations:

Within a single leader, we need to be able to use locking to keep the proper version number. Cannot allow two users to modify the same version of a document at the same time.
In multi-leader approach, using last write wins, one file version gets clobbered, no history. However, if we needed the extra write throughput we could make it work.

Sibilngs:

Seems useful to use an ACID db so that all rows are updated atomically.

Uploading Actual File

Pushing File changes to Users

We want to ideally keep # of connected maintained by each client low. The best way to do this is to have a single node responsible for a given client.

Hybrid Approach:

For most documents < 10 users have access
We can push these somewhere for the user, shard by userId
For super popular documents, 100k have access
Push changes to a popular document service, shard by documentId
3 connections is worse than 1, but better then having to send the popular files everywhere

File Routing **Review

Upon making changes to a file, we need to propagate changes
File is uploaded to s3 and chunks db. Kafka queue then shards file. We use permissions db. Then file is sent to unpopular files or popular files. .

Metadata Reading

Since we will have many consumers of each metadata change, something like kafkaesque makes sense here.
Persistence ensures that clients that go down can re-read missed changes

Manav's Digital Garden