Assignment: LLM Observability with LangFuse & LangSmith#
Assignment Metadata#
Field |
Description |
|---|---|
Assignment Name |
LLM Observability Implementation |
Course |
LLMOps and Evaluation |
Project Name |
|
Estimated Time |
120 minutes |
Framework |
Python 3.10+, LangChain, LangFuse, LangSmith, OpenAI API |
Learning Objectives#
By completing this assignment, you will be able to:
Configure LangFuse and LangSmith for LLM application tracing
Implement callback handlers to capture execution flows
Track token usage, latency, and costs per request
Debug LLM chains using trace visualization and playgrounds
Apply production best practices for sampling, PII handling, and alerting
Problem Description#
You are building a production-ready RAG chatbot application. Without observability, you face:
Black box execution: No visibility into retrieval and generation steps
Cost overruns: Inability to track spending per user or feature
Performance issues: Difficulty identifying latency bottlenecks
Quality problems: No systematic way to collect and analyze feedback
Your task is to instrument this application with comprehensive observability.
Technical Requirements#
Environment Setup#
Python 3.10 or higher
Required packages:
langfuse>= 2.0.0langchain>= 0.1.0langchain-openai>= 0.0.5openai>= 1.0.0
Accounts Required#
LangFuse Cloud account (free tier) OR Docker setup for self-hosted
LangSmith account (free tier available)
OpenAI API key
Tasks#
Task 1: LangFuse Integration (30 points)#
Set up LangFuse environment:
Create a LangFuse Cloud account or deploy locally with Docker
Configure API keys and environment variables
Verify connectivity
Implement tracing for a LangChain application:
Create a RAG chain with retrieval and generation steps
Add
CallbackHandlerto capture all tracesVerify traces appear in the LangFuse dashboard
Implement cost tracking:
Capture token usage for each LLM call
Calculate costs based on model pricing
Display cost breakdown per session
Document:
Screenshot of trace visualization in LangFuse
Cost breakdown for at least 10 queries
Task 2: LangSmith Integration (30 points)#
Configure LangSmith auto-tracing:
Set environment variables for automatic instrumentation
Create a project for your application
Verify traces are captured
Build a RAG pipeline with detailed tracing:
Implement document retrieval step
Implement LLM generation step
Capture intermediate states
Use the Playground for debugging:
Identify a failed or low-quality response
Open the trace in the Playground
Modify the prompt and re-run
Document the improvement
Create a test dataset:
Export 5 production traces to a dataset
Run evaluation on the dataset
Compare results across prompt versions
Task 3: Comparison Analysis (20 points)#
Compare LangFuse vs LangSmith based on your experience:
Feature |
LangFuse |
LangSmith |
Your Assessment |
|---|---|---|---|
Setup complexity |
|||
Trace visualization |
|||
Cost tracking |
|||
Debugging tools |
|||
Self-hosting option |
Write a recommendation (200-300 words):
Which tool would you choose for different scenarios?
What are the key trade-offs?
Task 4: Production Best Practices (20 points)#
Implement sampling:
Configure 100% tracing for development
Configure 5% sampling for production simulation
Add “High Importance” flag for error traces
Implement PII handling:
Create a masking function for sensitive data
Apply to traces before sending to observability tools
Test with sample PII data
Design an alerting strategy:
Define thresholds for error rate, latency, and cost
Document alert rules (pseudo-code or tool configuration)
Create a runbook for each alert type
Submission Requirements#
Required Deliverables#
Source code (Jupyter notebook or Python scripts)
README.mdwith setup and configuration instructionsScreenshots of LangFuse traces and dashboard
Screenshots of LangSmith traces and Playground usage
Comparison analysis document
Production best practices implementation
Submission Checklist#
LangFuse traces are captured and visible
LangSmith auto-tracing is working
Cost tracking is implemented
Playground debugging is demonstrated
Comparison analysis is complete
Production best practices are documented
Evaluation Criteria#
Criteria |
Points |
|---|---|
LangFuse integration & tracing |
30 |
LangSmith integration & debugging |
30 |
Comparison analysis quality |
20 |
Production best practices |
15 |
Code quality and documentation |
5 |
Total |
100 |
Hints#
Start with LangSmith as it requires minimal code changes (just environment variables)
Use LangFuse’s prompt management for version control of prompts
When comparing tools, focus on real usage scenarios from your experience
For PII masking, consider regex patterns for emails, phone numbers, and credit cards
Set up alerts using webhook integrations or existing monitoring tools