Indexing and Denormalization

What is Indexing in Databases?

Indexing is a technique used in databases to speed up the retrieval of rows from a table. It works similarly to an index in a book — instead of scanning every page (or every row), the database can quickly jump to the right location using the index.

Why is Indexing Important?

In large systems, tables can grow to millions of rows. Without indexing, searching or filtering data using WHERE clauses would require a full scan of the table, making queries very slow. Indexes help in improving the performance of:

Search operations using WHERE clauses
Sorting operations (ORDER BY)
Join operations between tables

Example: Indexing in a Users Table

Imagine we have a users table with 10 million records. We often search for users by their email address:

SELECT * FROM users WHERE email = 'alice@example.com';

Without an index on the email column, the database scans all 10 million rows — a slow and costly operation.

By creating an index:

CREATE INDEX idx_users_email ON users(email);

Now, the database can quickly jump to the row matching the email, reducing search time drastically.

💡 Question: Does indexing always improve performance?

Answer: No. While indexes speed up read operations, they add overhead during insert, update, and delete operations because the index also needs to be updated. You should only index columns that are frequently searched or sorted.

Types of Indexes

Single-column Index: Indexes one column (e.g., email).
Composite Index: Indexes multiple columns (e.g., first_name and last_name).
Unique Index: Ensures all values in the column are unique (e.g., username).
Full-text Index: Useful for search features like article keywords or blog titles.

When Not to Use Indexes

Don’t add indexes to every column. Avoid indexing:

Columns that are rarely queried
Columns with high update frequency
Columns with low cardinality (e.g., gender: 'M'/'F')

What is Denormalization?

Denormalization is the process of introducing redundancy into a database by combining tables or storing duplicated data to improve read performance.

In normalized databases, data is split across multiple tables to remove redundancy (e.g., separate tables for users and orders).

Example: Normalized vs Denormalized Design

In a normalized model:

Table: users
- user_id
- name

Table: orders
- order_id
- user_id
- order_amount

To get a user's name and their orders:

SELECT users.name, orders.order_amount 
FROM users 
JOIN orders ON users.user_id = orders.user_id;

This join operation is efficient with small datasets, but in high-traffic systems, frequent joins become a performance bottleneck.

In a denormalized model, we may store user names directly in the orders table:

Table: orders
- order_id
- user_id
- user_name
- order_amount

Now we don’t need to perform a join to get the user name for an order — we trade storage space for speed.

💡 Question: Isn't storing the same data in multiple places risky?

Answer: Yes. Denormalization leads to data duplication. If a user's name changes, we have to update it in multiple places. However, for read-heavy systems, the performance benefits often outweigh the complexity of maintaining consistency.

When to Denormalize

Read-heavy applications (e.g., social networks, reporting dashboards)
Systems where joins are becoming a bottleneck
Scenarios where data consistency can be managed or is less critical

Real-World Analogy

Think of a library database:

Normalized: Book table stores author_id, and a separate Author table stores author details. Good for clean storage.
Denormalized: Book table stores full author name. Faster retrieval without joins.

💡 Question: Should I denormalize all my data for speed?

Answer: No. Denormalization should be a deliberate choice after identifying performance issues in normalized models. Not all systems benefit from denormalization.

Key Takeaways

Indexing is used to improve the speed of data retrieval in databases.
Use indexes selectively — they come with trade-offs during writes.
Denormalization reduces joins and improves read performance at the cost of data redundancy.
Both indexing and denormalization are performance tuning tools used in system design.

⬅ Previous TopicSharding and Partitioning in System Design

Next Topic ⮕CAP Theorem Explained for Beginners