Yandex

System Design CourseSystem Design Course1

Monitoring and Logging in System Design



Introduction to Monitoring and Logging

Monitoring and logging are essential components in system design that help engineers understand how a system behaves in real-time, detect issues early, and troubleshoot problems effectively. For any large-scale or production-level application, observability is critical to ensure reliability and maintain user satisfaction.

What is Monitoring?

Monitoring is the process of collecting, analyzing, and using metrics to track the performance, health, and availability of a system. It helps answer questions like:

  • Is the server running?
  • Is the database response time normal?
  • Are error rates increasing?

What is Logging?

Logging refers to the recording of events and messages that occur within an application or system. Logs contain detailed information such as error traces, user actions, system warnings, and other operational data that help in debugging and analysis.

Why Do We Need Monitoring and Logging?

Consider a web application that starts responding slowly or returns error pages. Without logs and monitoring data, it would be very difficult to understand what went wrong. But with proper logs and a monitoring dashboard, you can quickly identify:

  • Which component failed
  • When the failure started
  • How often it is occurring

Example: Monitoring a Food Delivery App

Let’s say you’ve built a food delivery app like Zomato or Swiggy. Your backend has multiple components like:

  • API Gateway
  • Order Service
  • Payment Service
  • Restaurant Notification Service

To monitor the health of this system, you might want to track metrics like:

  • Number of orders placed per minute
  • Payment success rate
  • Latency of notification delivery

These metrics would be visualized in dashboards using tools like Prometheus + Grafana or Datadog.

Question: What happens if the order success rate drops suddenly?

Answer: Monitoring alerts would be triggered if success rate drops below a certain threshold (say 95%). The operations team would investigate. They might check logs of the Payment Service and find out that a third-party API used for card payments is down. This quick insight is only possible because of effective monitoring and logging.

Types of Metrics in Monitoring

  • System Metrics: CPU usage, memory usage, disk I/O
  • Application Metrics: Number of logins, signups, payment attempts
  • Custom Metrics: Business KPIs like number of active orders, average order value

What Tools Can We Use for Monitoring?

  • Prometheus: Time-series based monitoring system. Pulls metrics from endpoints.
  • Grafana: Visualization tool used with Prometheus to show dashboards and alerts.
  • Datadog / New Relic: SaaS-based solutions that offer APM and infrastructure monitoring.

What Should Be Logged?

For beginners, it’s helpful to understand what types of events should be logged. Examples:

  • When a user logs in
  • When an exception occurs
  • When an order fails to save to the database
  • When an API takes more than 2 seconds to respond

Example: Logging in an E-Commerce App

Let’s say a customer places an order and the system crashes. The logs should contain entries like:

[INFO] Order placed for userID=123 at 10:30:15  
  [ERROR] PaymentService timeout after 5000ms  
  [WARN] Retrying payment process  
  [ERROR] OrderService failed to confirm order for userID=123  
  

These logs help engineers pinpoint where the issue occurred — whether in the PaymentService or OrderService — and take necessary actions.

Question: Should we log everything?

Answer: No. Logging everything can lead to performance issues and unmanageable log storage. Instead, log:

  • Errors and exceptions
  • Important state transitions
  • Unusual or slow behavior

What Tools Can We Use for Logging?

  • Logstash: Used to process and ship logs
  • Elasticsearch: Stores and indexes logs for fast searching
  • Kibana: Visualizes logs and trends (used with Elasticsearch)
  • Fluentd: Log collection and forwarding tool

This stack is often called the ELK (Elasticsearch, Logstash, Kibana) stack or EFK (Elasticsearch, Fluentd, Kibana).

Example: Building a Log Dashboard

Imagine you're debugging why users are facing random logouts. You search the logs in Kibana using a query like:

level: "ERROR" AND message: "user session expired"

The search shows that a Redis instance (used to store sessions) had restarted, causing session loss. This root cause would be hard to find without centralized, searchable logs.

Best Practices for Monitoring and Logging

  • Set up alerts for critical metrics (error rate, latency)
  • Use structured logging (JSON format)
  • Redact sensitive data (passwords, tokens)
  • Use log rotation and retention policies

Question: What is structured logging and why is it useful?

Answer: Structured logging stores log data in a consistent format (like JSON), making it easier to parse, filter, and search logs. For example:

{
    "timestamp": "2025-05-03T10:15:00Z",
    "level": "ERROR",
    "service": "PaymentService",
    "message": "Failed to process payment",
    "order_id": "ORD1007",
    "user_id": "U512"
  }

Monitoring vs Logging: What’s the Difference?

Aspect Monitoring Logging
Purpose Track system health and performance Record detailed event history
Data Type Numerical metrics Textual logs
Examples CPU usage, error rate Stack trace, request logs
Tools Prometheus, Datadog ELK Stack, Fluentd

Conclusion

Monitoring and logging are foundational to maintaining healthy and reliable systems. For a beginner, understanding these tools early helps build better systems that are easier to debug and scale. Always start by defining what you want to measure and what kind of problems you want to catch — then build your logging and monitoring strategy around that.

Key Questions to Ask Yourself

  • What metrics define the health of my system?
  • What events should I log to make debugging easier?
  • How quickly can I detect and respond to an issue?


Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

You can support this website with a contribution of your choice.

When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M