Review & E2E Debugging#
Unit 7: Review & E2E Debugging Topic Code: MS-U7-T1 Reading Time: ~40 minutes
Learning Objectives#
Review the complete microservices architecture built throughout the course.
Apply the three pillars of observability (Logs, Traces, Metrics) in a holistic debugging scenario.
Trace a requestâs lifecycle from the API Gateway to the database and back using Jaeger.
Correlate traces from Jaeger with detailed logs in Kibana using a shared
trace_id.Diagnose performance bottlenecks and errors by analyzing data from Grafana, Jaeger, and Kibana in concert.
Section 1: Concept/overview#
1.1 Introduction#
In the world of monolithic architecture, debugging is relatively simple. You have a single codebase, a single log file (or a few files), and can use a debugger to follow every execution step of a request. Everything sits inside a single âboxâ. However, when moving to microservices architecture, we trade this simplicity for scalability, flexibility, and team independence. A user request is no longer handled by a single application, but travels through a chain of services, each with its own codebase, logs, and even database.
The problem arises immediately: âWhat happens when a request fails or runs slowly?â. Is the error in the API Gateway, the authentication service, the product service, or the payment service? Or does the problem lie in the network connecting them? Searching through dozens of separate log files to piece together the story of a single request is like looking for a needle in a haystack. This is the problem that End-to-End (E2E) Debugging and the concept of Observability were born to solve. Instead of âguessing,â we equip the system with the ability to âtell its story,â allowing us to observe, ask questions, and understand its internal state from the data it emits.
1.2 Formal Definition#
End-to-End (E2E) Debugging in microservices is the process of monitoring, analyzing, and diagnosing the processing flow of a single request across its entire journey, from initialization by the client, passing through services, databases, message queues, until returning the final response.
This process cannot be performed effectively without Observability. Observability is not merely monitoring. Monitoring tells you when the system has a problem (e.g., CPU usage 90%), while Observability helps you answer why the system has a problem. It is built upon three main pillars (The Three Pillars of Observability):
Logs: Record discrete events with timestamps. Logs provide the most detailed information about what happened at a specific moment in a specific service. Example: âUser âaliceâ logged in successfully at 2023-10-27T10:00:00Zâ.
Metrics: Numerical data aggregated over time, providing an overview of system health and performance. Example: number of requests per second, error rate, average latency.
Traces: Record the entire journey of a request as it passes through multiple services. A trace is made up of multiple spans, each representing a unit of work within a service (e.g., an API call, a database query). Traces are the most powerful tool for E2E Debugging.
By combining all three pillars, we can build a complete picture, from detecting issues at a high level (Metrics), identifying the problematic service (Traces), and finally finding the root cause in the code (Logs).
1.3 Analogy#
Imagine you are investigating a lost package case in a complex logistics system.
Metrics are like the logistics companyâs summary report. It tells you: âToday, 98% of packages were delivered on timeâ, âLoss rate at hub X is 0.5%â, âAverage processing time at warehouse Y is 2 hoursâ. Metrics help you detect âhotspotsâ (e.g., why is warehouse Y processing unusually slowly?).
Traces are like the security camera system tracking the journey of a specific package. It shows you: âPackage A was received at the post office at 9 AM, arrived at hub X at 11 AM, left warehouse Y at 3 PM, and was handed to the shipper at 4 PMâ. If the package was stuck at warehouse Y for 4 hours, the trace will indicate exactly that. This is E2E tracking.
Logs are the detailed notebooks of each employee at each point. When the trace shows the package stuck at warehouse Y, you look at warehouse Yâs logs during that time. You might see notes like: â13:05 - Barcode scanning system errorâ, â14:10 - Unexpected power outageâ, â14:30 - Employee Z reported package A wrapper tornâ. Logs provide detailed context to explain why the delay happened.
Only by combining all three can you go from âloss rate increasedâ (metric) to âpackage A stuck at warehouse Yâ (trace) and finally âdue to barcode system errorâ (log) to thoroughly resolve the problem.
Section 2: Core Components#
2.1 Architecture overview#
To perform E2E Debugging, we need an âObservability Stackâ. Below is the reference architecture we have built throughout the course.
+-------------+ +----------------+ +------------------+ +-----------------+
| Client |----->| API Gateway |----->| Order Service |----->| Payment Service |
+-------------+ +-------+--------+ +--------+---------+ +--------+--------+
| | |
| | |
+--------------------v---------------------v----------------------v--------------------+
| OpenTelemetry Collector / Agents |
+--------------------+---------------------+-----------------------------------------+
| | |
| (Traces) | (Metrics) | (Logs)
v v v
+--------------+ +--------------+ +-----------------+
| Jaeger | | Prometheus | | Elasticsearch |
+--------------+ +--------------+ +-----------------+
^ ^ ^
| | |
+--------------+ +--------------+ +-----------------+
| Jaeger UI |<---- | Grafana |----->| Kibana |
+--------------+ +--------------+ +-----------------+
^
|
+---------+---------+
| Developer/SRE |
+-------------------+
Workflow:
Request Flow: The Client sends a request to the API Gateway. The Gateway routes to microservices (e.g., Order Service, Payment Service).
Data Generation: Each service is equipped with the OpenTelemetry SDK. This SDK automatically âinstrumentsâ HTTP calls, DB queries, etc., to generate Traces, Metrics, and Logs.
Data Collection: OpenTelemetry Collector (or agent) collects data from services.
Data Storage & Processing:
Traces are sent to Jaeger for storage and querying.
Metrics are scraped and stored by Prometheus as time-series data.
Logs (usually JSON formatted) are pushed to Elasticsearch via a log forwarder like Fluentd/Logstash.
Data Visualization & Analysis:
Grafana connects to Prometheus to graph metrics and to Jaeger to display traces.
Kibana provides a powerful interface to search, analyze, and visualize logs from Elasticsearch.
Jaeger UI is specialized for deep analysis of distributed traces.
2.2 Key Components#
**Component 1: Structured Logs & trace_id**
Definition: Logs written in a machine-readable format (like JSON) instead of plain text. Each log line contains key-value pairs, making search and filtering extremely efficient. Most importantly, every log line related to a specific request must contain the
trace_idof that request.Role: The
trace_idacts like a âforeign key,â allowing us to link dozens or hundreds of log lines from multiple different services together, all belonging to the same request.Syntax (Example JSON Log in Kibana):
{
"@timestamp": "2023-10-27T10:15:30.123Z",
"log.level": "INFO",
"message": "Processing payment for order",
"service.name": "payment-service",
"trace_id": "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6",
"span_id": "f1e2d3c4b5a6f7e8",
"order_id": "ORD-98765",
"amount": 99.99,
"currency": "USD"
}
Component 2: Distributed Traces (Jaeger)
Definition: A trace represents the entire journey of a request. It is a collection of spans.
Trace: The entire request flow, identified by a unique
trace_id.Span: A unit of work within a service (e.g., processing a function, calling another API, querying DB). Each span has its own
span_id, a commontrace_id, and can have aparent_span_idto form a parent-child relationship tree.Role: Helps visualize the âGantt chartâ of a request. We can see which service calls which service, how long each step takes, and where the performance bottlenecks are.
Syntax (Example query in Jaeger UI): Based on the interface, you can filter by
service.name,operation.name,tags(e.g.,http.status_code=500), and time.
Component 3: Application Metrics (Prometheus & Grafana)
Definition: Indicators measuring application performance and health. Important metrics often follow the RED standard:
Rate: Number of requests per second.
Errors: Rate of failed requests.
Duration: Request processing time (latency), usually measured in percentiles like p50, p95, p99.
Role: Provides a 30,000-foot overview. The Grafana dashboard is the first place you look to know âIs something wrongâ (e.g., error rate spiking).
Syntax (Example PromQL in Grafana):
# Calculate the 99th percentile latency for order-service over the last 5 minutes
histogram_quantile(0.99, sum(rate(http_server_requests_seconds_bucket{service="order-service"}[5m])) by (le))
2.3 Comparison of Approaches#
Approach |
Pros |
Cons |
When to use |
|---|---|---|---|
Logs Only |
Easy to implement initially; Detailed, specific. |
Hard to correlate between services; Hard to visualize request flow; High storage cost. |
Debugging specific issues within a service after identifying that service has a problem. |
Metrics Only |
Good overview; Storage efficient; Good for alerting. |
Lack of detail; Doesnât show root cause; Hard to debug a single request. |
Monitoring system health, detecting abnormal trends, establishing SLOs/SLIs. |
Traces Only |
Perfect for understanding request flow; Easy to identify bottlenecks. |
Can be resource-intensive (overhead); Sampling might miss rare errors. |
Diagnosing latency issues and errors in distributed systems. |
Combine All Three (Observability) |
Provides full picture, from overview to detail; Accelerates diagnosis and resolution. |
Requires more complex setup and integration; Needs new culture and skills in the team. |
Any microservices system operating in production. This is the gold standard. |
Section 3: Implementation#
Level 1 - Basic (Beginner)#
Goal: Add trace_id to a Python (Flask) serviceâs logs to allow searching all logs for a request.
We will create a simple middleware to generate a trace_id if it doesnât exist and put it into the request context. Then, we configure the logger to automatically add this trace_id to every log line.
# app.py
import logging
import uuid
from flask import Flask, request, g
# 1. Custom logging formatter to include our trace_id
class RequestFormatter(logging.Formatter):
def format(self, record):
# Check if we have trace_id in the global context 'g'
record.trace_id = g.get('trace_id', 'no-trace-id')
return super().format(record)
# 2. Setup Flask app and logging
app = Flask(__name__)
handler = logging.StreamHandler()
# Add trace_id to our log format
formatter = RequestFormatter(
'[%(asctime)s] [%(levelname)s] [%(service_name)s] [%(trace_id)s] %(message)s'
)
handler.setFormatter(formatter)
# Add a filter to inject service_name
handler.addFilter(lambda record: setattr(record, 'service_name', 'order-service') or True)
app.logger.addHandler(handler)
app.logger.setLevel(logging.INFO)
# 3. Create a middleware to inject trace_id for each request
@app.before_request
def before_request_func():
# Get trace_id from header or generate a new one
trace_id = request.headers.get('X-Request-ID') or str(uuid.uuid4())
# Store it in Flask's global context 'g' for this request
g.trace_id = trace_id
@app.route('/order')
def create_order():
user_id = request.args.get('user_id')
if not user_id:
app.logger.warning("Request received without user_id")
return "user_id is required", 400
app.logger.info(f"Creating order for user_id: {user_id}")
# ... business logic here ...
app.logger.info(f"Order for user_id: {user_id} created successfully")
return {"status": "success", "trace_id": g.trace_id}
if __name__ == '__main__':
app.run(port=5001)
How to run and check:
Run the file
python app.py.Send request:
curl "http://localhost:5001/order?user_id=123"Send another request with custom
trace_id:curl -H "X-Request-ID: my-custom-trace-id" "http://localhost:5001/order?user_id=456"
Expected Output (in terminal):
[2023-10-27 11:30:05,123] [INFO] [order-service] [a1b2c3d4-e5f6-a7b8-c9d0-e1f2a3b4c5d6] Creating order for user_id: 123
[2023-10-27 11:30:05,124] [INFO] [order-service] [a1b2c3d4-e5f6-a7b8-c9d0-e1f2a3b4c5d6] Order for user_id: 123 created successfully
[2023-10-27 11:30:15,456] [INFO] [order-service] [my-custom-trace-id] Creating order for user_id: 456
[2023-10-27 11:30:15,457] [INFO] [order-service] [my-custom-trace-id] Order for user_id: 456 created successfully
You will see all logs of the same request sharing a common trace_id.
Common Errors:
Error:
AttributeError: 'g' object has no attribute 'trace_id'.Reason: Middleware
before_requesthasnât run or is misconfigured, causingg.trace_idto not be set when logger tries to access it.Fix: Ensure decorator
@app.before_requestis placed correctly and function is executed. Add default value when callingg.get('trace_id', 'default-value')to avoid error.
Level 2 - Intermediate#
Goal: Propagate trace_id (context propagation) from Order Service to Payment Service.
We will extend the previous example. Order Service will call an API of Payment Service, and we need to ensure trace_id is passed via HTTP Header.
order_service.py (modified from above)
# order_service.py
# ... (logging setup is the same as Level 1) ...
import requests
@app.route('/order')
def create_order():
user_id = request.args.get('user_id')
app.logger.info(f"Creating order for user_id: {user_id}")
# Propagate the trace_id to the downstream service
headers = {'X-Request-ID': g.trace_id}
try:
app.logger.info("Calling Payment Service to process payment")
# Call payment service, passing the header
payment_response = requests.get(
'http://localhost:5002/pay',
headers=headers,
params={'order_id': 'ORD123'}
)
payment_response.raise_for_status() # Raise exception for 4xx/5xx status
app.logger.info("Payment successful")
except requests.exceptions.RequestException as e:
app.logger.error(f"Payment service call failed: {e}")
return {"status": "error", "message": "payment failed"}, 500
app.logger.info(f"Order for user_id: {user_id} created successfully")
return {"status": "success", "trace_id": g.trace_id}
payment_service.py (new service)
# payment_service.py
# ... (logging setup is the same as Level 1, but change service_name to 'payment-service') ...
# Note: The logging setup code needs to be copied here as well.
# Make sure to change 'order-service' to 'payment-service' in the filter.
# (For brevity, the setup code is omitted here)
@app.route('/pay')
def process_payment():
order_id = request.args.get('order_id')
# The middleware `before_request` will automatically extract 'X-Request-ID'
app.logger.info(f"Processing payment for order_id: {order_id}")
# ... payment logic ...
app.logger.info(f"Payment for order_id: {order_id} processed")
return {"status": "payment_processed"}
How to run and check:
Run
payment_service.pyon one terminal:python payment_service.py --port 5002.Run
order_service.pyon another terminal:python order_service.py --port 5001.Send request to
Order Service:curl "http://localhost:5001/order?user_id=123".
Expected Output (combined from both terminals):
# Terminal 1: order-service
[2023-10-27 12:00:10,100] [INFO] [order-service] [b1c2d3e4-f5g6-h7i8-j9k0-l1m2n3o4p5q6] Creating order for user_id: 123
[2023-10-27 12:00:10,101] [INFO] [order-service] [b1c2d3e4-f5g6-h7i8-j9k0-l1m2n3o4p5q6] Calling Payment Service to process payment
[2023-10-27 12:00:10,250] [INFO] [order-service] [b1c2d3e4-f5g6-h7i8-j9k0-l1m2n3o4p5q6] Payment successful
[2023-10-27 12:00:10,251] [INFO] [order-service] [b1c2d3e4-f5g6-h7i8-j9k0-l1m2n3o4p5q6] Order for user_id: 123 created successfully
# Terminal 2: payment-service
[2023-10-27 12:00:10,150] [INFO] [payment-service] [b1c2d3e4-f5g6-h7i8-j9k0-l1m2n3o4p5q6] Processing payment for order_id: ORD123
[2023-10-27 12:00:10,151] [INFO] [payment-service] [b1c2d3e4-f5g6-h7i8-j9k0-l1m2n3o4p5q6] Payment for order_id: ORD123 processed
Success! trace_id b1c2... appeared in logs of both services, allowing us to piece together the story perfectly.
Level 3 - Advanced#
Goal: Use OpenTelemetry to automatically generate and propagate Traces, while integrating trace_id into Logs and creating custom Metrics.
This is the professional and production-ready way. We no longer manage trace_id manually but let the library handle it.
# advanced_app.py
import logging
from flask import Flask, request
import requests
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
# 1. --- OpenTelemetry Setup ---
# Setup provider and processor. We'll export to console for this example.
# In production, you would use OTLPSpanExporter to send to Jaeger/Collector.
provider = TracerProvider()
processor = BatchSpanProcessor(ConsoleSpanExporter())
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
# Get a tracer for this service
tracer = trace.get_tracer(__name__)
# 2. --- Flask App Setup ---
app = Flask(__name__)
# Instrument Flask to automatically create spans for each request
FlaskInstrumentor().instrument_app(app)
# Instrument requests library to propagate context and create spans for outgoing calls
RequestsInstrumentor().instrument()
# 3. --- Logging Setup with Trace/Span ID injection ---
def get_trace_id():
# Helper to get current trace_id from the context
span = trace.get_current_span()
if span and span.get_span_context().is_valid:
return format(span.get_span_context().trace_id, '032x')
return 'no-trace-id'
# Custom formatter to add trace_id to logs
# (This is similar to Level 1, but gets trace_id from OpenTelemetry)
# ... logging setup code using get_trace_id() ...
@app.route('/order')
def create_order_advanced():
# A new span will be automatically created for this request by the instrumentor
current_span = trace.get_current_span()
user_id = request.args.get('user_id')
current_span.set_attribute("user.id", user_id) # Add custom attribute to span
app.logger.info("Starting order creation")
# This HTTP call will also be automatically traced
# The trace context (trace_id) is propagated automatically!
requests.get('https://api.example.com/inventory/check', params={'item': 'book'})
# Create a custom inner span for a specific business logic
with tracer.start_as_current_span("process_internal_logic") as logic_span:
logic_span.add_event("Starting complex calculation")
# ... do some work ...
logic_span.add_event("Finished complex calculation")
app.logger.info("Order created")
return {"status": "created", "trace_id": get_trace_id()}
if __name__ == '__main__':
# ... Add logging setup here for brevity ...
app.run(port=5001)
Expected Output (in terminal, from ConsoleSpanExporter): You will see JSON output describing the generated spans.
{
"name": "GET /order",
"context": {
"trace_id": "0x4a1db...e4c",
"span_id": "0x2b...f9",
...
},
"kind": "SpanKind.SERVER",
"parent_id": null,
"start_time": "...",
"end_time": "...",
"attributes": {
"http.method": "GET",
"http.route": "/order",
"user.id": "123"
},
...
"children": [
{
"name": "GET",
"context": { "trace_id": "0x4a1db...e4c", "span_id": "0x3c...a1", ... },
"parent_id": "0x2b...f9",
"attributes": { "http.url": "https://api.example.com/inventory/check", ... }
},
{
"name": "process_internal_logic",
"context": { "trace_id": "0x4a1db...e4c", "span_id": "0x4d...b2", ... },
"parent_id": "0x2b...f9",
"events": [ { "name": "Starting complex calculation" }, ... ]
}
]
}
This output when sent to Jaeger will be visualized as a waterfall chart, showing span GET /order is parent of span GET (calling inventory) and span process_internal_logic.
Section 4: Best Practices#
â DOâs - Recommended#
Practice |
Why |
Example |
|---|---|---|
Use Structured Logging (JSON) |
Allows logs to be queried, filtered, and aggregated automatically by tools like Kibana. Structured data is easier to analyze than plain text. |
|
Use OpenTelemetry |
It is an open standard, supported by a large community. Provides SDKs and automatic instrumentations for many languages/frameworks, reducing setup effort. |
|
Add context to Spans and Logs |
Adding business attributes ( |
|
Configure Sampling in Production |
Tracing 100% requests at high traffic creates large overhead. Sampling (e.g., 1-10% traffic) allows collecting enough data for analysis without impacting performance. |
Use |
â DONâTs - Avoid#
Anti-pattern |
Consequence |
How to avoid |
|---|---|---|
Log sensitive info (PII) |
Leaking user personal data (passwords, credit cards, emails), violating security regulations like GDPR. |
Filter and encrypt sensitive fields before logging. Use log scanning tools to detect PII. |
Create Metrics with High Cardinality |
Putting values with too many variations (like |
Only use labels with finite values (e.g., |
Rely on a single pillar |
Looking only at logs wonât show bottlenecks. Looking only at metrics wonât reveal causes. Looking only at traces lacks detail. |
Build a culture of using Grafana, Jaeger, and Kibana simultaneously during troubleshooting. |
Log too much or too little |
Too little log gives insufficient info to debug. Too much log (especially DEBUG level in production) wastes storage and obscures important info. |
Set appropriate log levels per environment. Use INFO for key events, WARN/ERROR for issues needing attention. |
đ Security Considerations#
PII Leakage: Absolutely do not put sensitive info like passwords, API keys, tokens, personal info into logs or span attributes. Mask or remove them.
Access Control: Observability stack UIs (Grafana, Kibana, Jaeger) must be protected by authentication and authorization. Do not expose them to the Internet without protection.
Injection Attacks: Be careful when logging user input. Although less common, improperly formatted log strings can cause display issues or even vulnerabilities if processed by another tool.
⥠Performance Tips#
Asynchronous Exporting: Always use asynchronous âexportersâ (e.g.,
BatchSpanProcessor) so telemetry data sending doesnât block the applicationâs main thread.Sampling: In high traffic environments, apply sampling. Common strategies are
tail-based sampling(decide to keep trace after completion, based on error presence) orratio-based sampling(keep a certain percentage).Collector Architecture: Deploy OpenTelemetry Collector as an agent on each node or as a centralized gateway to offload the application and optimize data sending.
Section 5: Case Study#
5.1 Scenario#
Company/Project: âE-Shop Nowâ, a growing e-commerce site.
Requirements: Payment system must successfully process 99.9% of requests with latency under 500ms (p99).
Constraints: System built on microservices architecture including Frontend Gateway, Order Service, Payment Service, and Notification Service. Recently, the team received many user complaints about the checkout page âspinningâ for a long time, occasionally reporting errors.
5.2 Problem Analysis#
Support team confirmed the issue but didnât know the root cause.
Hypothesis 1 (Dev A): âMaybe
Payment Servicedatabase is slow.âHypothesis 2 (Dev B): âMaybe
Notification Servicesending confirmation emails timed out.âHypothesis 3 (SRE): âI see
Order ServiceCPU spiking occasionally, could be that.â
Instead of guessing, the team decided to use the observability stack to investigate systematically.
5.3 Solution Design#
3-step investigation plan, using 3 pillars:
Step 1: The âWhatâ (Metrics - Grafana): Confirm the issue and impact scope. View overview dashboard to answer: âIs the problem real? Where and when does it happen?â.
Step 2: The âWhereâ (Traces - Jaeger): Find a specific slow or failed request. Analyze its trace to identify exactly which service and operation is the âbottleneckâ.
Step 3: The âWhyâ (Logs - Kibana): With
trace_idfrom Jaeger, search all related logs in Kibana to read the detailed story and find the root cause.
5.4 Implementation#
Step 1: Analyze Metrics on Grafana
Team opens âService Healthâ dashboard. They see two abnormal graphs:
Payment Serviceendpoint/chargeâAPI Latency (p99)â graph skyrocketed from ~200ms to ~3000ms.Same endpointâs âError Rateâ graph also increased from 0.1% to 5%.
Conclusion: Problem is definitely in Payment Service. Dev B and SRE hypotheses are less likely.
Step 2: Tracing on Jaeger
Team goes to Jaeger UI, filters traces for payment-service with operation POST /charge, status=error, and duration > 2s. They immediately find many traces like this:
payment-service: POST /charge (3050ms)
âââ validate-request (10ms)
âââ get-user-data (50ms)
âââ call-external-psp (3000ms) <-- BOTTLENECK
| âââ ... (timeout error)
âââ save-transaction-record (5ms)
Trace shows almost entire time (3000ms) consumed by span call-external-psp (calling third-party payment provider). This span ends with tag error=true.
Conclusion: Bottleneck is the API call to payment partner. Dev Aâs hypothesis (DB) is eliminated.
Step 3: Analyze Logs on Kibana
Team copies trace_id from slow trace on Jaeger (e.g., e8a9f0b1c2d3e4f5...). They paste it into Kibana search bar with query:
trace.id: "e8a9f0b1c2d3e4f5..."
Result returns a sequence of logs from Payment Service for that request:
{"@timestamp": "...", "service.name": "payment-service", "trace.id": "e8a9...", "message": "Starting payment charge for order ORD-456"}
{"@timestamp": "...", "service.name": "payment-service", "trace.id": "e8a9...", "message": "Calling external PSP API at https://psp.vendor.com/v2/charge"}
{"@timestamp": "...", "service.name": "payment-service", "trace.id": "e8a9...", "log.level": "ERROR", "message": "Failed to call PSP API", "error.details": "ConnectTimeout: HTTPSConnectionPool(host='psp.vendor.com', port=443): Max retries exceeded with url: /v2/charge (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x...>, 'Connection to psp.vendor.com timed out. (connect timeout=3)'))"}
Last log line pointed out the root cause: ConnectTimeoutError. Their service could not connect to partner API within 3 seconds.
5.5 Results & Lessons Learned#
Root Cause: Payment partner (PSP) API performance degraded, causing timeout.
Immediate Action: Team temporarily switched to backup payment provider and contacted main partner to report incident.
Metrics Improvement:
Latency p99 of endpoint
/chargereturned to < 200ms.Error rate dropped to near 0%.
Lessons Learned:
Observability is mandatory: Without this stack, the team could have lost days or weeks finding the issue, severely impacting revenue and reputation.
Importance of
trace_id:trace_idis the red thread connecting everything. It allows seamless transition from âwhatâ (metric) to âwhereâ (trace) and âwhyâ (log).Set Reasonable Timeouts: Must have strict timeouts for external calls and fallback mechanisms so system doesnât âhangâ with partner.