⬅ Previous Topic
Relational vs NoSQL Databases: A Beginner’s GuideNext Topic ⮕
Indexing and Denormalization⬅ Previous Topic
Relational vs NoSQL Databases: A Beginner’s GuideNext Topic ⮕
Indexing and DenormalizationWhen a system grows large—too many users, too much data, and high traffic—databases often become a bottleneck. To solve this, we split the data into smaller chunks. This is where partitioning and sharding come in. Both aim to divide data across multiple machines or database tables to improve performance and scalability.
Partitioning is the process of dividing a large dataset into smaller parts, called partitions. Each partition holds a subset of the data. Partitioning is typically done within the same database system.
There are several strategies for partitioning data:
In horizontal partitioning, rows of a table are divided into multiple tables. Each table has the same schema but holds different data.
Example: Consider a table Users
with millions of rows. You can horizontally partition it like this:
Users_India
: All users from IndiaUsers_US
: All users from the USUsers_Others
: All remaining usersThis makes it faster to query region-specific data and reduces load on any one table.
In vertical partitioning, we split a table by columns. Each partition stores a subset of columns.
Example: For the same Users
table:
Users_Basic
: UserID
, Name
, Email
Users_Profile
: UserID
, ProfilePic
, Bio
It helps in reducing the amount of data read if only some columns are frequently queried.
Sharding is a type of horizontal partitioning but done across multiple physical machines or database instances. Each shard is a separate database that stores part of the data.
Key difference: Partitioning can be within the same server; sharding is across multiple servers.
Let’s say we have a web application with 100 million users. Storing all of them in one database table makes it slow and hard to scale.
We can shard users based on the first letter of their username:
Each shard is hosted on a separate database server. When a new user signs up, the application checks their username and stores it in the correct shard.
Answer: This is handled using a shard key. A shard key is a field (like username or user ID) that helps route the request to the correct shard.
Choosing the right shard key is important. A bad key can cause uneven data distribution, leading to a “hot shard.”
If we shard based on country
, and 90% of users are from India, then the shard for India will be overloaded.
One solution is to use a hash function on the user ID or username. This randomly spreads users across shards.
Example:
Let’s say we have 4 shards and use a simple hash function:
shard_number = hash(username) % 4
This distributes users more evenly, regardless of geographic or alphabetical distribution.
Instead of hashing, we divide data based on ranges.
Example: For transaction IDs:
ID 1 - 1,000,000
: Shard 1ID 1,000,001 - 2,000,000
: Shard 2This is simple but may lead to unbalanced shards if one range is accessed more frequently.
A: No, typically one user’s complete data is stored in one shard to avoid complex joins across shards.
A: That part of the data becomes unavailable unless replication or backups are in place. That’s why redundancy is important in production systems.
A: Yes, but it’s a complex process called resharding. Systems like MongoDB, Cassandra, and DynamoDB provide tools to help with dynamic sharding.
Sharding and partitioning are essential techniques in system design for handling large-scale data efficiently. While partitioning helps divide data within a database, sharding distributes it across multiple machines for better scalability and performance. By understanding the strategies and trade-offs, you can design systems that scale smoothly as traffic and data grow.
⬅ Previous Topic
Relational vs NoSQL Databases: A Beginner’s GuideNext Topic ⮕
Indexing and DenormalizationYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.