Monitoring and Logging in System Design

Introduction to Monitoring and Logging

Monitoring and logging are essential components in system design that help engineers understand how a system behaves in real-time, detect issues early, and troubleshoot problems effectively. For any large-scale or production-level application, observability is critical to ensure reliability and maintain user satisfaction.

What is Monitoring?

Monitoring is the process of collecting, analyzing, and using metrics to track the performance, health, and availability of a system. It helps answer questions like:

Is the server running?
Is the database response time normal?
Are error rates increasing?

What is Logging?

Logging refers to the recording of events and messages that occur within an application or system. Logs contain detailed information such as error traces, user actions, system warnings, and other operational data that help in debugging and analysis.

Why Do We Need Monitoring and Logging?

Consider a web application that starts responding slowly or returns error pages. Without logs and monitoring data, it would be very difficult to understand what went wrong. But with proper logs and a monitoring dashboard, you can quickly identify:

Which component failed
When the failure started
How often it is occurring

Example: Monitoring a Food Delivery App

Let’s say you’ve built a food delivery app like Zomato or Swiggy. Your backend has multiple components like:

API Gateway
Order Service
Payment Service
Restaurant Notification Service

To monitor the health of this system, you might want to track metrics like:

Number of orders placed per minute
Payment success rate
Latency of notification delivery

These metrics would be visualized in dashboards using tools like Prometheus + Grafana or Datadog.

Question: What happens if the order success rate drops suddenly?

Answer: Monitoring alerts would be triggered if success rate drops below a certain threshold (say 95%). The operations team would investigate. They might check logs of the Payment Service and find out that a third-party API used for card payments is down. This quick insight is only possible because of effective monitoring and logging.

Types of Metrics in Monitoring

System Metrics: CPU usage, memory usage, disk I/O
Application Metrics: Number of logins, signups, payment attempts
Custom Metrics: Business KPIs like number of active orders, average order value

What Tools Can We Use for Monitoring?

Prometheus: Time-series based monitoring system. Pulls metrics from endpoints.
Grafana: Visualization tool used with Prometheus to show dashboards and alerts.
Datadog / New Relic: SaaS-based solutions that offer APM and infrastructure monitoring.

What Should Be Logged?

For beginners, it’s helpful to understand what types of events should be logged. Examples:

When a user logs in
When an exception occurs
When an order fails to save to the database
When an API takes more than 2 seconds to respond

Example: Logging in an E-Commerce App

Let’s say a customer places an order and the system crashes. The logs should contain entries like:

[INFO] Order placed for userID=123 at 10:30:15  
  [ERROR] PaymentService timeout after 5000ms  
  [WARN] Retrying payment process  
  [ERROR] OrderService failed to confirm order for userID=123

These logs help engineers pinpoint where the issue occurred — whether in the PaymentService or OrderService — and take necessary actions.

Question: Should we log everything?

Answer: No. Logging everything can lead to performance issues and unmanageable log storage. Instead, log:

Errors and exceptions
Important state transitions
Unusual or slow behavior

What Tools Can We Use for Logging?

Logstash: Used to process and ship logs
Elasticsearch: Stores and indexes logs for fast searching
Kibana: Visualizes logs and trends (used with Elasticsearch)
Fluentd: Log collection and forwarding tool

This stack is often called the ELK (Elasticsearch, Logstash, Kibana) stack or EFK (Elasticsearch, Fluentd, Kibana).

Example: Building a Log Dashboard

Imagine you're debugging why users are facing random logouts. You search the logs in Kibana using a query like:

level: "ERROR" AND message: "user session expired"

The search shows that a Redis instance (used to store sessions) had restarted, causing session loss. This root cause would be hard to find without centralized, searchable logs.

Best Practices for Monitoring and Logging

Set up alerts for critical metrics (error rate, latency)
Use structured logging (JSON format)
Redact sensitive data (passwords, tokens)
Use log rotation and retention policies

Question: What is structured logging and why is it useful?

Answer: Structured logging stores log data in a consistent format (like JSON), making it easier to parse, filter, and search logs. For example:

{
    "timestamp": "2025-05-03T10:15:00Z",
    "level": "ERROR",
    "service": "PaymentService",
    "message": "Failed to process payment",
    "order_id": "ORD1007",
    "user_id": "U512"
  }

Monitoring vs Logging: What’s the Difference?

Aspect	Monitoring	Logging
Purpose	Track system health and performance	Record detailed event history
Data Type	Numerical metrics	Textual logs
Examples	CPU usage, error rate	Stack trace, request logs
Tools	Prometheus, Datadog	ELK Stack, Fluentd

Conclusion

Monitoring and logging are foundational to maintaining healthy and reliable systems. For a beginner, understanding these tools early helps build better systems that are easier to debug and scale. Always start by defining what you want to measure and what kind of problems you want to catch — then build your logging and monitoring strategy around that.

Key Questions to Ask Yourself

What metrics define the health of my system?
What events should I log to make debugging easier?
How quickly can I detect and respond to an issue?

⬅ Previous TopicSystem Design: How to Design YouTube

Next Topic ⮕Security and Authentication in System Design