I want to tell you about the worst three days of my engineering career.
It started with a Slack message from our support team:
"Users are complaining that the AI is giving wrong answers."
That was it.
There were no error codes, stack traces, or even reproducible steps. Just... the AI was wrong.
I opened our logs and checked, the request received at 10:47 AM and the response was sent at 10:49 AM, and also the status was 200.
Everything looked perfect. The system had done exactly what it was supposed to do, which was very simple.
Accept a question, process it, and return an answer.
Except at this time, the answer was completely wrong, and I had absolutely no idea why.
I spent the next three days in debugging hell. I tried reproducing the issue locally, but of course, the AI gave different answers each time. I added more logging, but I was logging the wrong things. I stared at the database queries that returned the right data. I reviewed prompts that looked fine. I checked the API responses that seemed correct. Every individual piece worked, but the whole system didn't.
On day three, I finally found it. The problem was in our chunking strategy.
We were splitting documents in the middle of sentences, so when users asked certain questions, the retrieved context was grammatically incomplete and semantically meaningless. The AI was doing its best with garbage input, producing garbage output.
That was how I spent three days on a chunking bug. Not because the bug was hard to fix, but because I couldn't see what was happening inside the system.
I was debugging blind.
That experience fundamentally changed how I think about AI observability. Traditional debugging doesn't work for AI systems. The tools and practices we've developed over decades of software engineering, you know, the stack traces, error logs, breakpoints, etc they're not enough. AI systems fail differently, and they need to be observed differently.
In this article, we will delve into Observability in AI systems. We will explore the fundamental problem, why your current logging system will fail you, how to build a proper observation system for AI systems, and finally, the debugging workflow that actually works.
The Fundamental Problem
Here's what makes AI debugging so uniquely frustrating:
Traditional applications have the decency to crash when something goes wrong. They throw exceptions. They return error codes. They give you a stack trace pointing to the exact line where things went sideways.
You might not immediately know how to fix the problem, but at least you know where it is.
However, an AI system is different; it doesn't do this. They keep running. They keep returning responses. They keep returning status 200. But the responses are wrong, and nothing in your standard monitoring tells you this is happening.
Think about what happens in a RAG pipeline when something goes wrong. Maybe your embedding model is poorly suited for your domain, so queries about technical concepts get matched with vaguely related but ultimately unhelpful documents.
The system will never throw an error, and your vector database will keep returning the wrong results each time.
The LLM receives this irrelevant context and does what LLMs do it generates a plausible-sounding response based on what it was given. The response is confident. The response is articulate. The response is wrong.
From the outside, everything looks fine. Your request latency is normal. Your error rate is zero. Your uptime is 100%. Meanwhile, users are getting incorrect information and losing trust in your product, and you have no idea it's happening until someone complains.
This is the core challenge of AI observability: you're not looking for crashes, you're looking for degradation. You're not hunting for errors, you're hunting for quality problems. And quality problems are sneaky. They don't announce themselves. They hide in the space between "working" and "working well."
Why Your Current Logging Strategy Is Failing You
When I first started building AI systems, I logged what I always logged: inputs and outputs. Request came in with this question, response went out with this answer. Basic request/response logging, same as any API.
This is almost useless for AI debugging.
The problem is that AI systems aren't simple request/response pipelines. They're multi-step workflows where each step transforms data in ways that affect downstream steps.
A RAG query might involve embedding the question, searching a vector database, reranking the results, constructing a prompt, calling the LLM, and post-processing the response.
That's six distinct operations, each with its own potential failure modes, and if you're only logging the first input and final output, you're blind to everything in between.
When something goes wrong, you need to know:
Was the embedding generated correctly?
What chunks did the vector search return, and what were their similarity scores?
Did reranking change the order?
What context ended up in the prompt?
How many tokens were used?
What did the LLM actually see when it generated its response?
Without this information, debugging becomes guesswork. You're trying to figure out which of six potential failure points caused the problem, but you can only see the beginning and end. It's like trying to debug a function when you can only see the input parameters and return value, not any of the intermediate computations.
I learned this lesson the hard way during those three days of debugging. My logs showed the question and the answer. They didn't show that the retrieved chunks had unusually low similarity scores, which would have immediately pointed me toward a retrieval problem.
They didn't show that the context contained incomplete sentences, which would have pointed me toward chunking.
I had to manually add this logging, re-deploy, wait for similar queries, and then analyze the results. Three days of detective work that would have taken three minutes with proper observability.
Building an Observability Stack for AI Systems
After that experience, I completely redesigned how I approach AI observability. I now think of it in four layers, each building on the one below:
Structured logging
The foundation is structured logging with AI-specific context: This isn't just "log more stuff." It's logging the right stuff, in the right format, at the right points in your pipeline. Every operation that touches AI, such as embedding, retrieval, reranking, prompt construction, and generation, needs its own log entry with all the relevant context.
For an embedding operation, you need to capture which model you used, how many tokens were in the input, how long it took, and ideally some identifier that lets you correlate this with other operations in the same request.
For retrieval, you need the query, the number of results, the similarity scores of what you retrieved, and the time taken. For a generation, you need the model, input tokens, output tokens, cost, and latency.
The key insight is that each log entry should be self-contained enough that, if something goes wrong, you can look at that entry and understand what happened at that step. You shouldn't need to cross-reference five different log lines to piece together the story.
Here's what a properly structured log entry looks like for a retrieval operation:
{
"trace_id": "abc-123-def-456",
"timestamp": "2026-02-24T14:30:22.847Z",
"operation": "vector_retrieval",
"service": "rag-service",
"query_text": "What is the refund policy for digital products?",
"embedding_model": "text-embedding-3-small",
"vector_db": "pinecone",
"index": "knowledge-base-prod",
"top_k_requested": 10,
"results_returned": 10,
"similarity_scores": [0.91, 0.87, 0.85, 0.82, 0.79, 0.76, 0.71, 0.68, 0.65, 0.61],
"top_result_preview": "Digital products are eligible for refund within 14 days of purchase...",
"latency_ms": 147,
"metadata": {
"user_id": "user-789",
"session_id": "session-012"
}
}
With this level of detail, when a user reports a wrong answer, I can query my logs for that trace ID and immediately see what happened. If the similarity scores are all below 0.7, I know retrieval failed to find relevant content.
If the top result preview doesn't match the query topic, I know there's a mismatch somewhere in my embeddings or indexing.
This time, i'm not guessing anymore but I'm diagnosing.
Distributed Tracing
The second layer is distributed tracing, which connects all these individual log entries into a coherent timeline. A trace shows you the entire journey of a request through your system:
“It started here, went through these operations in this order, took this long at each step, and ended here.”
Traces are invaluable because they show you not just what happened, but the sequence and timing.
I use OpenTelemetry for this, though there are other options. The key is that every operation creates a "span" with a start time, end time, and relevant attributes, and all spans in the same request share a trace ID.
When I'm debugging, I can pull up a trace and see a visual timeline of exactly how the request was processed.
Metrics:
The third layer is metrics, an aggregate of measurements over time that tell you about system health. While logs and traces help you debug individual requests, metrics help you understand patterns.
It answers questions such as:
What's my average retrieval similarity score this hour?
How has latency changed over the past week?
What percentage of queries are hitting my cache?
Metrics turn individual observations into trends.
For AI systems, you need metrics that traditional APM tools don't provide out of the box. I track things like:
Average retrieval quality scores
Confidence score distributions
Token usage by model
Cost per request
Cache hit rates
Hallucination detection rates.
These metrics tell me when something is degrading before users start complaining.
Alerting
The fourth layer is alerting, and it is an automated notification when metrics cross certain thresholds.
If my average retrieval similarity score drops below 0.7 for ten minutes, I want to know immediately. If my hourly AI spend exceeds my budget, I want an alert. If my error rate spikes, I want to be paged.
The goal of this four-layer stack is simple:
I never want to be surprised by an AI problem again. I want to catch degradation before users notice, and when users do report issues, I want to diagnose the root cause in minutes, not days.
The Debugging Workflow That Actually Works
Let me walk you through how I now debug AI issues, using the observability stack I just described.
A support ticket comes in:
"User asked about shipping times and got information about return policies instead."
This is a classic symptom. The AI answered confidently, but answered the wrong question.
Step 1:
First, I find the trace ID. Every response from my AI system includes a trace ID in the response metadata, so support can include it in tickets. If they didn't include it, I can search my logs by user ID and timestamp to find the relevant trace.
Step 2:
Once I have the trace ID, I pull up the full trace. I'm looking at a timeline that shows me:
Query received
Embedding generated
Vector search executed
Reranking applied
Prompt constructed
LLM called
Response returned.
Each step shows its duration and key attributes.
Step 3:
I start from the beginning. The query was "How long does shipping take?"
Therefore, that's correct, it matches what the user asked. The embedding was generated in 45ms using text-embedding-3-small.
Then I look at vector retrieval:
10 results returned
Top score 0.73
Top result preview "Our return policy allows..."
The top score is 0.73, which is on the lower end. And the top result preview is about returns, not shipping.
That's the problem right there.
The retrieval step failed to find relevant content about shipping, and instead returned the most similar content it could find, which happened to be about returns.
Step 4:
Now I know where to look. Why didn't retrieval find shipping content?
I can easily think through a few possibilities:
Maybe there's no content about shipping in the knowledge base.
Maybe the content exists, but wasn't embedded correctly
Maybe there's a mismatch between how the content was chunked and how the query was embedded.
I search my knowledge base index for "shipping" to make sure the content exists.
I look at the embedding for that content compared to the query embedding, and they should be similar. If they're not, I might have an embedding model mismatch or a chunking problem.
In this case, I discovered that the shipping information was embedded as part of a larger document about order processing, and the chunk that contains shipping details also contains a lot of content about order tracking, payment processing, and other topics.
The embedding for that chunk reflects all those topics, diluting the "shipping" signal.
With only 4 steps, I have easily identified the root cause, which is a poor chunking strategy causing relevant content to be embedded with too much surrounding context.
I can quickly introduce a fix by re-chunking the order processing document into more focused sections, one specifically about shipping times.
With this approach, if you look at the total debugging time, it is about fifteen minutes. Now compare that to three days of blind guessing.
Practical Implementation
I want to share some specific implementation patterns that have worked well for me, because the concepts are only useful if you can actually build them.
For logging, I created a simple class that wraps all my AI operations and automatically logs with the right structure.
Every time I make an embedding call, a retrieval call, or an LLM call, it goes through this wrapper, which captures timing, token counts, scores, and other relevant metadata. The wrapper also propagates trace IDs, so all operations in a request are connected.
class AILogger {
constructor(serviceName) {
this.serviceName = serviceName;
}
createTrace(requestId) {
return new AITrace(requestId, this.serviceName);
}
}
class AITrace {
constructor(requestId, serviceName) {
this.traceId = requestId || crypto.randomUUID();
this.serviceName = serviceName;
this.spans = [];
this.startTime = Date.now();
}
span(operation, data = {}) {
const span = {
traceId: this.traceId,
operation,
timestamp: new Date().toISOString(),
elapsed_ms: Date.now() - this.startTime,
service: this.serviceName,
...data
};
this.spans.push(span);
// Send to your logging backend
console.log(JSON.stringify(span));
return span;
}
// Specific logging methods for common AI operations
logEmbedding(model, inputTokens, latencyMs) {
return this.span('embedding', {
step: 'embedding',
model,
input_tokens: inputTokens,
latency_ms: latencyMs
});
}
logRetrieval(query, results, latencyMs) {
return this.span('retrieval', {
step: 'retrieval',
query_length: query.length,
chunks_retrieved: results.length,
top_score: results[0]?.score,
bottom_score: results[results.length - 1]?.score,
scores: results.map(r => r.score),
latency_ms: latencyMs
});
}
logReranking(beforeCount, afterCount, latencyMs) {
return this.span('reranking', {
step: 'reranking',
chunks_before: beforeCount,
chunks_after: afterCount,
latency_ms: latencyMs
});
}
logGeneration(model, inputTokens, outputTokens, latencyMs) {
return this.span('generation', {
step: 'generation',
model,
input_tokens: inputTokens,
output_tokens: outputTokens,
latency_ms: latencyMs,
cost_usd: this.estimateCost(model, inputTokens, outputTokens)
});
}
logPrompt(systemPrompt, userPrompt, context) {
return this.span('prompt_construction', {
step: 'prompt',
system_prompt_tokens: this.countTokens(systemPrompt),
user_prompt_tokens: this.countTokens(userPrompt),
context_tokens: this.countTokens(context),
// Store first 500 chars for debugging (not full prompt for privacy)
context_preview: context.substring(0, 500)
});
}
logResponse(response, confidence) {
return this.span('response', {
step: 'response',
response_tokens: this.countTokens(response),
confidence_score: confidence,
response_preview: response.substring(0, 200)
});
}
estimateCost(model, inputTokens, outputTokens) {
const pricing = {
'gpt-4': { input: 0.03, output: 0.06 },
'gpt-4-turbo': { input: 0.01, output: 0.03 },
'gpt-3.5-turbo': { input: 0.0005, output: 0.0015 },
'claude-3-sonnet': { input: 0.003, output: 0.015 }
};
const p = pricing[model] || pricing['gpt-3.5-turbo'];
return ((inputTokens / 1000) * p.input) + ((outputTokens / 1000) * p.output);
}
countTokens(text) {
// Rough estimation: ~4 chars per token
return Math.ceil((text?.length || 0) / 4);
}
complete() {
return this.span('trace_complete', {
total_duration_ms: Date.now() - this.startTime,
span_count: this.spans.length
});
}
}
The critical insight is to log at decision points, not just at entry and exit.
When my system decides to use a particular embedding model, I log that decision and why.
When reranking changes the order of results, I log the before and after rankings.
When I decide to fall back to a different model because the primary one is slow,
I log that fallback decision. These decision points are often where bugs hide, and having visibility into the decisions makes debugging much faster.
For metrics, I use Prometheus with Grafana, but the specific tools matter less than the metrics you choose to track.
My dashboard shows me six key AI metrics at a glance:
Average retrieval similarity score (quality)
P95 generation latency (performance)
Cache hit rate (efficiency)
Hourly token usage by model (cost)
Error rate by operation (reliability)
Requests per minute (volume).
These six numbers give me a quick health check of my AI system. If any of them look unusual, I dig deeper.
For alerting, I've learned to alert on trends, not just thresholds. It's not very useful to alert when the retrieval score drops below 0.7, because scores fluctuate query-by-query, and I'd get a lot of noise.
Instead, I alert when the average retrieval score over a 15-minute window drops below 0.75, which indicates a genuine degradation rather than a few unlucky queries.
Similarly, I alert on the rate of change. For example, if my costs are increasing 50% faster than my request volume, something is probably wrong even if the absolute numbers are still within budget.
The Tools
People often ask me which observability tools to use. Honestly, the specific tools matter less than the practice of instrumenting your code properly. That said, here are the tools I've had good experiences with:
OpenTelemetry with Jaeger or Datadog works well for general-purpose tracing and metrics. These are battle-tested tools with good ecosystem support.
Langfuse is excellent if you want an open-source option for LLM-specific observability, and LangSmith works well if you're using LangChain.
Helicone and Portkey are good choices if you want a proxy-based approach that adds observability without changing your code.
The most important thing is to start with something. Imperfect observability is infinitely better than no observability. You can always improve your tooling later, but if you're flying blind, you're accumulating debugging debt with every deployment.
I want to be honest:
Building proper AI observability takes time. Instrumenting your code, setting up dashboards, and configuring alerts.
It's not a trivial amount of work. You might be tempted to skip it, especially when you're trying to ship features quickly.
Don't skip it.
The time you invest in observability pays off exponentially when something goes wrong. And in AI systems, something will go wrong.
Models behave unexpectedly. Embeddings drift over time. Retrieval quality degrades as your knowledge base grows. These aren't edge cases; they're inevitable parts of operating AI in production.
When these problems happen, you have a choice: spend days debugging blind, or spend minutes with good observability.
The three days I spent debugging that chunking issue? That was time I could have spent building features. That was the time my team spent frustrated and unproductive. That was the time users spent getting wrong answers and losing trust.
Good observability isn't just a technical practice. It's a business investment. It's the difference between AI systems you can confidently operate and AI systems that feel like ticking time bombs.



