Version
7.21.2
Steps to Reproduce
- Run a trace for an agent or LLM call that utilizes native tool calling (Function Calling).
- Ensure the tool returns a successful response and the LLM uses that data to formulate its final answer.
- Open the specific trace in the dashboard and observe that while the tool call is logged, the corresponding tool response is absent from the timeline/context.
- Optional: Run an evaluation (e.g., Hallucination or Helpfulness) on this trace. This step obviously is only failing because the tool response is missing before so this is not the issue itself.
Expected Result
I expect to see the tool outputs in the traces (and that those are correctly propagated to the evaluation suite.
Actual Result
There appears to be an issue with how LLM traces are captured and processed. Currently, tool call responses (tool outputs) are missing from the trace logs.
This data gap is propagating directly to the Evaluation suite. Because the evaluations (e.g. for Helpfulness and Hallucination) do not see the tool responses in the context history, they incorrectly flag the final LLM output as a hallucination.
For example, if the LLM provides specific figures or data points retrieved via a tool, the evaluator marks it as a hallucination because it cannot find the source of that data in the incomplete trace. This makes the automated evaluation metrics unreliable for any flows involving tool usage.
Version
7.21.2
Steps to Reproduce
Expected Result
I expect to see the tool outputs in the traces (and that those are correctly propagated to the evaluation suite.
Actual Result
There appears to be an issue with how LLM traces are captured and processed. Currently, tool call responses (tool outputs) are missing from the trace logs.
This data gap is propagating directly to the Evaluation suite. Because the evaluations (e.g. for Helpfulness and Hallucination) do not see the tool responses in the context history, they incorrectly flag the final LLM output as a hallucination.
For example, if the LLM provides specific figures or data points retrieved via a tool, the evaluator marks it as a hallucination because it cannot find the source of that data in the incomplete trace. This makes the automated evaluation metrics unreliable for any flows involving tool usage.