Skip to main content

Silent Failures: Why a “Successful” LLM Workflow Can Cost 40% More

· 9 min read
Mrunmay
AI Engineer

Your agent returns the right answer. The status is 200 OK, and the user walks away satisfied. On the surface, everything looks fine. But when you check the API bill, it doesn’t line up with how simple the task actually was.

LLMs are unusually resilient. When a tool call fails, they don’t stop execution. They try again with small variations. When a response looks off, they adjust and keep going. That behavior is often helpful, but it can also hide broken execution paths. The user sees a successful result, while your token usage quietly absorbs retries, fallbacks, and extra reasoning that never needed to happen.

Silent failures

The Illusion of Success

When an agent returns the correct output and the logs are clean, we assume the logic is sound. However, LLM resilience introduces a new debugging challenge.

  • Standard Software: Invalid parameters trigger immediate exceptions. You see the stack trace and fix the bug.

  • LLMs: If a tool call fails, the workflow doesn't crash with an error. Most of the SDKs have a built-in retry mechanism that will retry with new arguments, switches strategies, or forces a solution.

This resilience masks architectural issues. The agent produces the correct output while quietly absorbing retries and extra reasoning steps.

Standard observability tools catch crashes but often miss these silent performance leaks. A "successful" run looks identical to an optimized one on a dashboard, even if it performed three times the necessary work.

The Suspect: A Slow Travel Agent

To make this concrete, consider a simple travel planning agent. It takes a destination, travel dates, and a few preferences, then generates a five-day itinerary.

From a functional perspective, the agent behaves as expected. Each run produces a reasonable itinerary that matches the user’s request, and there are no visible errors or user complaints.

The problem shows up when you look at the metrics:

  • Output: Correct itinerary
  • Status: 200 OK
  • Time Taken: 361 seconds (over 6 minutes)
  • Cost: $0.068 per run

For a task of this scope, both the time taken and the cost were unusually high. A closer look raised an obvious question: why did generating a straightforward itinerary require 49 separate LLM calls?

The Investigation (Using vLLora MCP)

At this point, the problem wasn’t correctness — it was understanding how the agent arrived at the result. Manually tracing through dozens of JSON logs would have been slow and error-prone, especially given the number of model calls involved.

Instead, we used the vLLora MCP server to inspect the most recent agent run. MCP exposes trace data as structured tools, which means a coding agent can reason about execution flow, tool calls, and model behavior directly — without parsing raw logs or switching to a separate dashboard.

We asked the coding agent:

Use vLLora MCP to inspect the most recent agent run and explain why it produced this result.

The agent inspected the latest traces and summarized what actually happened during the run. While the execution was marked successful, the trace revealed repeated failed attempts to call the same tool.

Specifically:

  • The agent retried the same tool call multiple times with adjusted parameters
  • Each failure was handled internally without surfacing an error
  • A fallback path eventually produced the correct result
  • The extra retries directly inflated both latency and cost

Because the run completed successfully, none of this appeared in error metrics. The inefficiency only becomes visible when you inspect the execution path itself rather than the final outcome.

The Reveal: The Parameter Mismatch

The MCP analysis pointed to a very specific failure pattern. This wasn’t a logic bug or a model hallucination. It was a syntax mismatch between what the model assumed and what the tool schema actually required.

The agent was effectively stuck in a validation loop.

Attempt 1

The model called research_accommodations using camelCase parameters such as checkin_date.

  • Result: ValidationError
  • Reason: The tool schema expected snake_case parameter names.

Attempt 2

After observing the failure, the model retried with a lowercase variation: checkindate.

  • Result: ValidationError
  • Reason: The parameter name still did not match the schema.

Attempt 3

The model simplified further, removing part of the name and trying check_in.

  • Result: ValidationError
  • Reason: Still not a valid parameter.

After multiple failed attempts, the agent abandoned the structured tool entirely.

Fallback path

The model fell back to a generic search call:

tavily_search("hotels in Tokyo")

This fallback produced usable results, which is why the overall run completed successfully and returned a 200 OK. However, that success came at a cost. The trace showed 21 wasted tool calls and thousands of input tokens consumed by repeated retries, error messages, and recovery logic.

From the outside, the agent looked healthy. Under the hood, it was working much harder than it needed to.

The Fix: Delegating to the Agent

Once the MCP analysis identified the root cause, ambiguous docstrings, there was no need to manually search through the codebase or write the fix by hand. We delegated the change to the coding agent.

From Cursor, we asked:

Update the research_accommodations tool definition.
Make the check_in_date parameter explicitly require snake_case to prevent retry loops.

The agent located the relevant Pydantic model and updated the docstrings to remove any ambiguity for the model.

The Code Change

Before: Ambiguous

class AccommodationSearch(BaseModel):
"""Search for hotels and accommodations."""
check_in_date: str = Field(
description="Check-in date in YYYY-MM-DD format"
)

The description specified the value format, but left the parameter name open to interpretation.

After: Explicit

class AccommodationSearch(BaseModel):
"""
Search for hotels.
IMPORTANT: All parameters must be in snake_case.
"""
check_in_date: str = Field(
description="Check-in date (YYYY-MM-DD). Strictly use parameter name: 'check_in_date'."
)

By explicitly stating the required parameter name, the ambiguity that caused the retry loop was removed.

With the fix applied, we cleared the agent context and ran the exact same travel planning task again to verify the results.

Measuring the Impact

To compare the two runs, we asked the coding agent to analyze both traces side by side and summarize the differences.

The Prompt

Compare the performance of the bad run 4ea18f79-4c4c-4d2c-b628-20d510af7181 against the fixed run a5cf084b-01b2-4288-acef-aa2bedc31426. Show me a table of Latency, Cost, and Token Usage differences.

The agent analyzed the telemetry from both traces and generated this comparison:

MetricBad Run (4ea18f79)Fixed Run (a5cf084b)DifferenceImprovement
Latency361.21 seconds (6.02 min)194.66 seconds (3.24 min)-166.55 seconds46.1% faster
Total Cost$0.0683$0.0430-$0.025437.1% cheaper
LLM Calls49 calls28 calls-21 calls42.9% fewer
Input Tokens114,16264,608-49,55443.4% reduction
Output Tokens14,91610,691-4,22528.3% reduction
Total Tokens129,07875,299-53,77941.7% reduction

Impact at Scale

For a single run, saving 2 cents might seem negligible. But at production scale, "silent failures" are a massive budget leak.

Based on these numbers, an agent running 1,000 times a day would see:

  • Annual Savings: ~$9,271/year

  • Processing Time Saved: ~46 hours per day

  • Token Reduction: ~54 million tokens/day

Where did the waste go?

The comparison highlights exactly where the inefficiency was hiding. By fixing the parameter names, we eliminated:

  • Multiple Retry Loops: The agent no longer wastes rounds guessing the correct parameter syntax.

  • Context Pollution: We removed thousands of tokens of error messages and failed tool outputs from the context window.

  • Inefficient Fallbacks: The agent uses the specialized research_accommodations tool immediately, rather than falling back to a more expensive generic search.

The fix was a one-line documentation change. But we wouldn't have found it without seeing the actual execution pattern—the retry attempts that looked like normal agent behavior until we inspected the traces.

Why This Matters

Observability isn't just about catching errors; it's about catching inefficiencies. When agents "work" but cost too much, you need to see the execution flow, not just the final result.

Traditional debugging workflows require you to:

  1. Notice the performance issue
  2. Switch to a tracing UI
  3. Search for the relevant trace
  4. Manually parse JSON logs
  5. Connect the dots across multiple tool calls

The MCP workflow lets your coding agent do steps 2-5. You stay in your editor. The agent understands the trace structure and can explain what's happening—not just what failed, but what's inefficient.

Connecting the MCP Server

vLLora's MCP server runs alongside your vLLora instance. Configure your MCP client to connect to it:

{
"mcpServers": {
"vllora": {
"url": "http://localhost:9090/mcp"
}
}
}

or install the MCP server in your IDE:

Quick Install

Install in VS Code Install in Visual Studio Add to Cursor

Once connected, your coding agent automatically discovers the trace inspection tools and can start using them immediately.

Closing Thoughts

Silent failures are expensive. They don't break your application, but they inflate your costs and slow down your users. The challenge is visibility: you need to see the execution flow, not just the final result.

vLLora's MCP Server brings trace inspection into your coding workflow, so you can debug inefficiencies the same way you debug errors: in your editor, with your tools. Don't just check if your agent works. Check how it works.

For setup details and advanced configuration, see the vLLora MCP Server documentation.