System Design CourseSystem Design Course1

Monitoring and Logging in System Design



Introduction to Monitoring and Logging

Monitoring and logging are essential components in system design that help engineers understand how a system behaves in real-time, detect issues early, and troubleshoot problems effectively. For any large-scale or production-level application, observability is critical to ensure reliability and maintain user satisfaction.

What is Monitoring?

Monitoring is the process of collecting, analyzing, and using metrics to track the performance, health, and availability of a system. It helps answer questions like:

What is Logging?

Logging refers to the recording of events and messages that occur within an application or system. Logs contain detailed information such as error traces, user actions, system warnings, and other operational data that help in debugging and analysis.

Why Do We Need Monitoring and Logging?

Consider a web application that starts responding slowly or returns error pages. Without logs and monitoring data, it would be very difficult to understand what went wrong. But with proper logs and a monitoring dashboard, you can quickly identify:

Example: Monitoring a Food Delivery App

Let’s say you’ve built a food delivery app like Zomato or Swiggy. Your backend has multiple components like:

To monitor the health of this system, you might want to track metrics like:

These metrics would be visualized in dashboards using tools like Prometheus + Grafana or Datadog.

Question: What happens if the order success rate drops suddenly?

Answer: Monitoring alerts would be triggered if success rate drops below a certain threshold (say 95%). The operations team would investigate. They might check logs of the Payment Service and find out that a third-party API used for card payments is down. This quick insight is only possible because of effective monitoring and logging.

Types of Metrics in Monitoring

What Tools Can We Use for Monitoring?

What Should Be Logged?

For beginners, it’s helpful to understand what types of events should be logged. Examples:

Example: Logging in an E-Commerce App

Let’s say a customer places an order and the system crashes. The logs should contain entries like:

[INFO] Order placed for userID=123 at 10:30:15  
  [ERROR] PaymentService timeout after 5000ms  
  [WARN] Retrying payment process  
  [ERROR] OrderService failed to confirm order for userID=123  
  

These logs help engineers pinpoint where the issue occurred — whether in the PaymentService or OrderService — and take necessary actions.

Question: Should we log everything?

Answer: No. Logging everything can lead to performance issues and unmanageable log storage. Instead, log:

What Tools Can We Use for Logging?

This stack is often called the ELK (Elasticsearch, Logstash, Kibana) stack or EFK (Elasticsearch, Fluentd, Kibana).

Example: Building a Log Dashboard

Imagine you're debugging why users are facing random logouts. You search the logs in Kibana using a query like:

level: "ERROR" AND message: "user session expired"

The search shows that a Redis instance (used to store sessions) had restarted, causing session loss. This root cause would be hard to find without centralized, searchable logs.

Best Practices for Monitoring and Logging

Question: What is structured logging and why is it useful?

Answer: Structured logging stores log data in a consistent format (like JSON), making it easier to parse, filter, and search logs. For example:

{
    "timestamp": "2025-05-03T10:15:00Z",
    "level": "ERROR",
    "service": "PaymentService",
    "message": "Failed to process payment",
    "order_id": "ORD1007",
    "user_id": "U512"
  }

Monitoring vs Logging: What’s the Difference?

Aspect Monitoring Logging
Purpose Track system health and performance Record detailed event history
Data Type Numerical metrics Textual logs
Examples CPU usage, error rate Stack trace, request logs
Tools Prometheus, Datadog ELK Stack, Fluentd

Conclusion

Monitoring and logging are foundational to maintaining healthy and reliable systems. For a beginner, understanding these tools early helps build better systems that are easier to debug and scale. Always start by defining what you want to measure and what kind of problems you want to catch — then build your logging and monitoring strategy around that.

Key Questions to Ask Yourself



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

You can support this website with a contribution of your choice.

When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M