BlockByte
Posts
Database Sharding 101: Essential Guide to Scaling Your Data

Database Sharding 101: Essential Guide to Scaling Your Data

Understanding Database Sharding: An Overview of Scaling Solutions and Partitioning Methods with Industry Examples

April 12, 2024

Today's Insights:

Introduction to Database Sharding
Database Scaling Techniques and Partitioning
Sharding Approaches and Performance Optimization
Industry Example from Notion.so

What is Database Sharding?

Database sharding is a method of dividing a large database into smaller, manageable pieces, known as "shards." Each shard can be hosted on a separate server, making it a powerful tool for dealing with large datasets.

Purpose of Database Sharding: The primary purpose of database sharding is to enhance performance by distributing the workload across multiple servers. This setup helps in managing large volumes of data more efficiently and ensures smoother operation of database systems.

Benefits of Database Sharding: One of the major benefits of database sharding is improved data management and faster query response times. It also offers excellent scalability, making it easier to scale out and meet increasing data demands as your organization grows.

Scaling Techniques in Databases

In database management, scaling techniques are essential for improving performance and managing larger data volumes. There are two main types of scaling: horizontal and vertical. Each type is selected based on specific performance needs and growth objectives. Often, vertical scaling is implemented initially to enhance a system's capacity before adopting more complex strategies like sharding, as it provides a straightforward way to boost processing power with existing infrastructure.

Horizontal Scaling

Horizontal scaling, or scaling out, involves adding more machines of similar specifications to your resource pool. This method boosts capacity by spreading the workload across several servers, enhancing system throughput and fault tolerance. It's especially useful for systems needing high availability or handling numerous simultaneous requests.

Horizontal Scaling

Vertical Scaling

Vertical scaling, or scaling up, involves upgrading existing hardware, such as adding more CPUs, RAM, or storage to a server. This method increases processing power without the need to manage more servers. However, there is a limit to how much a single server can be upgraded, so vertical scaling may need to be supplemented by horizontal scaling as demands increase.

Vertical Scaling

Partition Strategies in Database Sharding

In database sharding, partition strategies play a crucial role in data management. Here’s a concise overview:

Vertical Partitioning: The process divides a database into distinct parts based on columns. For example, in the given diagram, the customer_base table is split into VP1, which includes columns id, first_name, and last_name, essentially personal information of the customers. VP2 is composed of the columns id and country, segregating the location data. This separation allows systems to access only the data they require, which can lead to more efficient data processing and storage.

Vertical Partitioning

Horizontal Partitioning: This approach segments a database table by rows instead of columns. The diagram demonstrates horizontal partitioning where the original customer_base table is divided into two parts: HP1 contains rows for customers with IDs 1 and 2, and HP2 holds rows for customers with IDs 3 to 5. This type of partitioning is beneficial for distributing data across different servers or regions, enhancing query performance by localizing the data and reducing the load on any single server.

Horizontal Partitioning

Sharding Approaches

In the technical sphere of database management, sharding is a sophisticated method of data partitioning designed to enhance scalability and performance. Sharding approaches typically fall into categories such as range-based sharding and key-based sharding.

Key-based Sharding:

key-based sharding employs a shard key, which is then processed through a hash function to assign each data entry to a shard. The hash function's output determines the shard a particular piece of data will reside on, with the goal of evenly distributing data across shards.

Key-based Sharding

Key-based Sharding Process:
- The customer_base table's column_1 serves as the shard key.
- A hash function is applied to the values in column_1, assigning a hash value to each row.
Allocation of Data:
- Rows with hash values of 1 (A and C) are grouped into one shard.
- Rows with hash values of 2 (B and D) are placed into a separate shard.

Range-based Sharding

Range-based sharding is a database partitioning technique that organizes records into different shards based on a defined range of a key attribute, such as revenue. In this method, one shard might contain all records with revenues below a certain amount, while another shard would include records exceeding that amount.

Range-based Sharding

Range-based Sharding Process:
- The customer_base table is segmented into shards according to the revenue.
Allocation of Data:
- One shard contains customers with revenue less than 300€ (Phil and Harry).
- Another shard holds customers with revenue greater than 300€ (Claire and Nora).

Scaling Reads

Scaling reads through replication. In this setup, a master database handles all write operations, while multiple replica databases are used for read operations. This replication allows the system to manage increased read loads effectively by distributing the read requests across several replicas. By separating write and read operations in this manner, the master database's load is reduced, leading to improved performance and faster query responses for users. This method is particularly advantageous in read-heavy environments, ensuring that the system can handle a large number of concurrent read operations without degrading performance.

Scaling Reads

Industry Insight: How notion.so Executes Theory

In early 2023, Notion upgraded its live database cluster to a larger setup without any downtime to handle increased traffic. Initially, Notion operated a single large Postgres database on Amazon RDS, but due to growth, they moved to horizontal sharding, spreading the load across multiple databases. Before the upgrade, their system included 32 databases partitioned by workspace ID, but this setup struggled with high CPU and disk bandwidth utilization, and connection limits from PgBouncer during scaling.

To resolve these issues, Notion implemented horizontal resharding, increasing the number of database instances from 32 to 96. This expansion was managed using Terraform for provisioning and involved dividing existing logical schemas across more machines. Data synchronization was achieved through Postgres logical replication, ensuring historical data was copied and new changes continuously applied. Verification involved dark reads, comparing outputs from both old and new databases to confirm consistency.

Notion also restructured its PgBouncer clusters to manage the increased connection loads. The transition to the new shards was carefully executed to prevent data loss and ensure ongoing data synchronization. This strategic enhancement in database capacity significantly reduced CPU and IOPS utilization to about 20% during peak times, a notable improvement from previous levels. Overall, the careful planning and execution of the resharding process enabled Notion to expand its database capacity significantly, boosting performance while maintaining an uninterrupted user experience.

Notion.so - Database shards

Source: The Great Re-shard: adding Postgres capacity (again) with zero downtime