Production evals on Arize: what the docs don't tell you

The goal

MOBA is an AI nutrition chatbot for endurance athletes. It uses Claude (via OpenRouter) to answer questions about training load, recovery, and fueling strategy. We're sending real user questions in production, which means we need to know: is the chatbot actually giving good advice?

That's where Arize comes in. Arize lets you run LLM-as-judge evaluators against your production traces, automatically scoring every response for quality, safety, and correctness. The idea is simple. Getting it working is not.

This is a full account of what broke, what we tried, and what finally produced 900 successes with 0 errors and 0 skipped.

Architecture

Before we get into the failures, here's what the final system looks like.

Tracing: Every user turn runs inside a CHAIN span. The OpenAI-compatible OpenRouter client is traced automatically by OpenAIInstrumentor from openinference-instrumentation-openai. Tool calls (NLS score lookups, training load queries) each get a manual TOOL span nested inside the CHAIN span.

chat [CHAIN]
  ├── chat.completions.create [LLM]   ← auto via OpenAIInstrumentor
  ├── get_nls_score [TOOL]             ← manual span
  └── chat.completions.create [LLM]   ← final answer

Evaluators: Three LLM-as-judge evaluators running on every CHAIN span:

Nutrition Advice Quality — is the advice specific to the athlete's actual data?
Scope Adherence — does the response stay in the nutrition/training domain?
Safety — does it avoid pushing unsafe fueling strategies?

Session evals: A fourth evaluator at session granularity, scoring the full multi-turn conversation for coherence and helpfulness across all turns.

Mistake 1: the wrong instrumentor

OpenRouter provides an OpenAI-compatible API. The first instinct was to look for a dedicated OpenRouter instrumentor. That was wrong.

openinference-instrumentation-openrouter doesn't exist. Trying to install it fails with a pip 404. The right package is openinference-instrumentation-openai — because OpenRouter is OpenAI-compatible, OpenAIInstrumentor traces all calls through it automatically.

from openinference.instrumentation.openai import OpenAIInstrumentor
tracer_provider = register(
    space_id=ARIZE_SPACE_ID,
    api_key=ARIZE_API_KEY,
    project_name="moba-nutrition-demo",
)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

# OpenRouter client — fully compatible
client = openai.OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=OPENROUTER_API_KEY,
    default_headers={
        "HTTP-Referer": "https://moba-nutrition.app",
        "X-Title": "MOBA Chat",
    },
)

This traces every client.chat.completions.create() call as an LLM span with full input messages and output content, even though the traffic goes to OpenRouter and not OpenAI directly.

Mistake 2: 171 skipped — no column mappings

The first eval task ran and returned:

Status    : failed
Successes : 0
Errors    : 0
Skipped   : 171

171 skipped means the evaluator found spans but couldn't resolve the template variables. We had created the task without column_mappings, the field that tells Arize which span attribute corresponds to each {variable} in the prompt template.

Every skipped evaluation is a template variable that couldn't be filled in. Without column mappings, the evaluator sees {question} and {output} in the template and has no idea where to get those values from.

The fix is explicit mappings on every evaluator in the task:

evaluators_config = [
    {
        "evaluator_id": "RXZhbHVhdG9yOjM2ODA6NXpHVA==",
        "column_mappings": {
            "question": "attributes.input.value",
            "output": "attributes.output.value",
        },
    },
    ...
]

The path attributes.input.value is where OpenInference stores the user's question on a CHAIN span. attributes.output.value is the final answer. These are OpenInference semantic conventions and they apply regardless of which LLM provider you're using.

Mistake 3: immediate cancellation — bad model name

After fixing column mappings, runs started cancelling immediately (within about 1 second):

Status    : cancelled
Successes : 0
Errors    : 0
Skipped   : 0

Cancellation at ~1 second means the integration credentials are invalid or the model name doesn't exist. We had set the judge model to gpt-5.2, which doesn't exist on OpenRouter. Arize validates this on run start and cancels immediately.

The fix: create new evaluator versions pointing to gpt-4o-mini, which is valid and cheap.

Mistake 4: still cancelling — missing OpenRouter headers

Even with a valid model name, runs kept cancelling. This time the diagnosis was different: ~3 minutes to cancel instead of ~1 second. That means Arize found spans, made LLM calls, but the calls failed.

The issue: when you create an AI integration in Arize pointing to OpenRouter's base URL, OpenRouter requires two additional headers that Arize doesn't send by default:

HTTP-Referer — identifies your app
X-Title — a display name for the OpenRouter dashboard

Without these, OpenRouter rejects the requests. You have to pass them via provider_params when creating or updating the integration:

arize_client.ai_integrations.update(
    ai_integration="YOUR_INTEGRATION_ID",
    provider_params={
        "extra_headers": {
            "HTTP-Referer": "https://moba-nutrition.app",
            "X-Title": "MOBA Chat",
        }
    },
)

After this update, runs stopped cancelling at the 3-minute mark.

Mistake 5: 74 skipped — template variables that can't be mapped

The next run returned:

Status    : failed
Successes : 0
Errors    : 0
Skipped   : 74

74 skipped with correct column mappings meant the templates themselves referenced variables that didn't exist on the spans. Our original evaluator templates used variables like {failure_mode}, {expected_response_themes}, and {context} — fields that sounded useful but had no corresponding span attributes.

The Arize evaluator engine only fills variables that exist in column_mappings. Any variable in the template that doesn't appear in mappings means that span gets skipped. No partial scoring, no fallback.

The fix was to rewrite all three evaluator templates to use only two variables:

{question}  → attributes.input.value
{output}    → attributes.output.value

Here's the final Nutrition Advice Quality template:

You are an expert sports nutritionist evaluating an AI nutrition assistant.

Athlete question: {question}

MOBA response: {output}

Evaluate whether this response provides high-quality, actionable nutrition advice
for an endurance athlete. A high-quality response:
- Addresses the specific question asked
- Provides concrete numbers or recommendations where relevant
- Is grounded in sports nutrition principles
- Avoids vague platitudes ("eat well", "stay hydrated")

Respond with exactly one label:
- high_quality: The response is specific, actionable, and useful
- low_quality: The response is vague, generic, or unhelpful

After this rewrite: 900 successes, 0 errors, 0 skipped.

What finally worked: the complete setup

1. Tracing

from arize.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry.trace import get_tracer

tracer_provider = register(
    space_id=ARIZE_SPACE_ID,
    api_key=ARIZE_API_KEY,
    project_name="moba-nutrition-demo",
)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
tracer = get_tracer("moba", "1.0.0")

2. Manual CHAIN + TOOL spans

The instrumentor only wraps the OpenAI client. Everything between LLM calls — tool execution, business logic — needs manual spans. Each user turn gets a CHAIN span; each tool call gets a TOOL span nested inside it.

with tracer.start_as_current_span("chat") as chain_span:
    chain_span.set_attribute("openinference.span.kind", "CHAIN")
    chain_span.set_attribute("input.value", question)
    chain_span.set_attribute("output.value", final_answer)   # set at end
    if session_id:
        chain_span.set_attribute("session.id", session_id)

    # Inside the tool loop:
    with tracer.start_as_current_span(tool_name) as tool_span:
        tool_span.set_attribute("openinference.span.kind", "TOOL")
        tool_span.set_attribute("input.value", json.dumps(tool_input))
        tool_span.set_attribute("output.value", json.dumps(result))

3. Evaluators (via Arize portal, not SDK)

The Arize SDK's prompt/evaluator creation endpoints are alpha and fragile. We created all three evaluators through the Arize portal UI and fetched their IDs programmatically:

resp = arize_client.evaluators.list(space=ARIZE_SPACE_ID)
for ev in resp.evaluators:   # .evaluators, not .data
    print(ev.id, ev.name)

4. Task with column mappings

task = arize_client.tasks.create(
    name="moba-chat-evaluations-v3",
    task_type="template_evaluation",
    project=PROJECT_ID,
    evaluators=[
        {
            "evaluator_id": EVAL_ID,
            "column_mappings": {
                "question": "attributes.input.value",
                "output":   "attributes.output.value",
            },
        },
        # ... repeat for each evaluator
    ],
    is_continuous=True,
)

5. Backfill trigger

from datetime import datetime, timedelta, timezone

now = datetime.now(timezone.utc)
run = arize_client.tasks.trigger_run(
    task=task.id,
    data_start_time=now - timedelta(days=7),
    data_end_time=now,
)
completed = arize_client.tasks.wait_for_run(
    run_id=run.id, poll_interval=5, timeout=300
)
print(completed.status, completed.num_successes, completed.num_skipped)

Session-level evals: scoring multi-turn conversations

Individual span evals tell you whether each response was good. But for a chatbot, the question that matters is: did the full conversation hold together? Did MOBA lose context, contradict itself, or go off-topic by turn 5?

Arize supports session-granularity evaluators that score all turns in a conversation as a unit. The key ingredient is a session.id attribute on every CHAIN span. All spans sharing the same session.id are grouped into one session.

import uuid

session_id = str(uuid.uuid4())   # one per conversation

# In chat():
chain_span.set_attribute("session.id", session_id)

The session evaluator template uses {conversation}, a special variable Arize builds automatically from the input.value / output.value pairs across all spans in the session, ordered by start time:

You are evaluating a multi-turn conversation between an athlete and MOBA.

Full conversation:
{conversation}

Score this conversation as:
- coherent_and_helpful: MOBA maintained context and gave useful, consistent advice
- degraded_or_incoherent: MOBA lost context, contradicted itself, or gave unhelpful responses

The task is created with data_granularity='session':

# Evaluator created via portal with data_granularity=session
# Task:
session_task = arize_client.tasks.create(
    name="moba-session-quality",
    task_type="template_evaluation",
    project=PROJECT_ID,
    evaluators=[
        {
            "evaluator_id": SESSION_EVAL_ID,
            "column_mappings": {},   # {conversation} is built automatically
        }
    ],
    is_continuous=True,
)

No column mappings needed for {conversation}. Arize handles that variable internally.

SDK gotchas worth knowing

These aren't in the main docs. We found them by reading error messages and inspecting response objects with model_fields.keys().

.evaluators not .data — EvaluatorsList200Response uses .evaluators as the list attribute, not .data. Using .data raises AttributeError.

provider_parameters={} is required — EvaluatorLlmConfig requires this field even when you have no provider-specific parameters. Omitting it raises a ValidationError.

No tasks.delete() — the TasksClient has no delete method. When a task is misconfigured, you create a new one with a versioned name (v2, v3, etc.) and use the new ID.

Run status guide:

Status	Time to cancel	Likely cause
`cancelled`	~1s	Invalid model name
`cancelled`	~3min	LLM call failed (bad key or missing headers)
`failed`, all skipped	—	Template variables not in column_mappings
`completed`, 0 spans	—	No spans in time window — widen range
`completed`, N successes	—	Working

Results

After working through all of the above:

900 CHAIN spans evaluated across 3 evaluators
0 errors, 0 skipped
Continuous evaluation running on new spans as they arrive
Session-level conversation quality scoring active

In the Arize UI, you can filter traces by eval.Nutrition Advice Quality.label = low_quality to find the responses worth investigating. The goal isn't to report a number — it's to find the specific turns where the chatbot isn't doing its job.

Takeaways

Column mappings are the most important field in a task. Every template variable that isn't explicitly mapped gets skipped. There's no error — spans just disappear silently from your success count.

Check your model names against the provider's actual model list. gpt-5.2 looks plausible. It isn't. Arize validates on run start and cancels immediately.

OpenRouter requires extra headers that Arize won't add automatically. Pass them via provider_params.extra_headers on the AI integration.

Keep evaluator templates to variables you can actually map. {failure_mode} sounds useful. If it's not a span attribute, it will silently skip every span.

Start with a 7-day backfill window, not 24 hours. If your recent spans are from yesterday, a 24-hour window catches nothing.

The path from 0 successes to 900 took about a day. Hopefully this saves you most of it.