A/B testing system prompts on Arize: how we track MOBA's chatbot quality

The problem with prompt intuition

MOBA's nutrition chatbot is powered by Claude (via OpenRouter). The system prompt is the single biggest lever we have over how it behaves — the difference between a response that says "eat well and stay hydrated" and one that says "your NLS is 38 (Critical), you need 480g carbs today to start recovering that deficit."

There's no intuition-based way to know whether a shorter, more direct system prompt produces better advice than our longer, more detailed one. "It feels better" isn't a measurement. We needed a way to run controlled experiments where the same eval criteria score both variants in production, on real traces.

Architecture

MOBA traces every chatbot turn as a CHAIN span in Arize. The span contains:

input.value — the user's question
output.value — MOBA's final answer
session.id — groups turns into conversations
athlete.id, athlete.nls, athlete.tss — athlete-specific context
prompt.variant — the A/B tag ("A" or "B")

Three continuous LLM-as-judge evaluators score every CHAIN span:

Nutrition Advice Quality — is the advice specific and actionable?
Scope Adherence — does it stay in the nutrition/fitness domain?
Safety — does it avoid dangerous fueling recommendations?

With prompt.variant on every span, we can filter traces in Arize by variant and compare eval scores directly.

The two prompts

Prompt A is our current production prompt, retrieved from Arize Prompt Hub:

from arize import ArizeClient
from arize._generated.api_client.models import MessageRole

arize_client = ArizeClient(api_key=ARIZE_API_KEY)

fetched = arize_client.prompts.get(
    prompt="UHJvbXB0OjMxODUxOjN3SFQ=",       # "MOBA Nutrition Advisor - Optimized"
    version_id="UHJvbXB0VmVyc2lvbjo3NzM5NzpQcDJS",  # v1
)

PROMPT_A = next(
    m.content for m in fetched.version.messages
    if m.role == MessageRole.SYSTEM
)

Fetching from Prompt Hub means the version is tracked. We know exactly which prompt text was in production at any time. If we update the prompt in the Hub, we bump the version ID in code and Arize can compare scores across prompt versions, not just A/B variants.

Prompt B is a concise challenger we wrote directly:

PROMPT_B = """You are MOBA, a nutrition and fitness assistant for endurance athletes.
Be direct, specific, and grounded in the athlete's actual data.

## Scope
Answer questions about nutrition, fitness, and health only.
For off-topic questions: "I'm MOBA, your personal health assistant. I can only help with nutrition, fitness, and your health data."

## NLS - Nutrition Load Score
MOBA's proprietary nutrition fitness metric (42-day EMA, 0-100). Similar to TrainingPeaks CTL.
- NLS: Long-term nutrition fitness
- ANL: Acute load (7-day EMA)
- NRB: Readiness balance (NLS - ANL). Below -20 = red flag.
- DNS: Today's score

Status: Critical <40 * Suboptimal 40-60 * Adequate 60-75 * Optimal 75-85 * Enhanced 85-95 * Exceptional 95+

Rules:
- Always call get_nls_score before answering NLS questions.
- If available=false: "I don't have your NLS data yet. Log meals for a few days first."
- Never fabricate NLS values.

## Tools
- get_nls_score, get_recent_meals, get_training_load, get_daily_totals, get_nutrition_targets

## Response Style
- Under 150 words. Specific numbers over vague advice. No fabricated data.
"""

Prompt A is exhaustive — edge cases, failure modes, NLS explainers. Prompt B bets that shorter context produces more focused outputs. We didn't know which would perform better. That's what the test is for.

Deterministic variant assignment

For the test to be meaningful, the same athlete needs to always get the same variant. If athlete A gets Prompt A on one turn and Prompt B on the next, the trace history is noise.

We use a deterministic MD5 hash:

import hashlib

SALT = "moba-ab-v1"  # change to rotate the experiment

def get_variant(athlete_id: str) -> str:
    digest = hashlib.md5(f"{SALT}:{athlete_id}".encode()).hexdigest()
    return "A" if int(digest, 16) % 2 == 0 else "B"

Same salt + same athlete_id always returns the same variant. To start a new experiment after updating Prompt B, increment the salt to moba-ab-v2 and all assignments rotate.

This isn't random-at-request-time. It's stable across sessions, restarts, and deploys. An athlete in group B today is in group B tomorrow.

Tagging spans with the variant

The chat() function accepts a prompt_variant parameter and writes it onto the CHAIN span:

def chat(
    question: str,
    athlete_id: str,
    chat_history: list[dict] | None = None,
    session_id: str | None = None,
    prompt_variant: str | None = None,
) -> str:
    with tracer.start_as_current_span("chat") as chain_span:
        chain_span.set_attribute("openinference.span.kind", "CHAIN")
        chain_span.set_attribute("input.value", question)
        chain_span.set_attribute("athlete.id", athlete_id)
        if session_id:
            chain_span.set_attribute("session.id", session_id)
        if prompt_variant:
            chain_span.set_attribute("prompt.variant", prompt_variant)  # ← the A/B tag

        # ... rest of the agentic loop

The chat_ab() wrapper handles variant selection and prompt injection:

def chat_ab(
    question: str,
    athlete_id: str,
    chat_history: list[dict] | None = None,
    session_id: str | None = None,
) -> tuple[str, str]:
    variant = get_variant(athlete_id)
    prompt  = PROMPT_A if variant == "A" else PROMPT_B

    global MOBA_SYSTEM_PROMPT
    original = MOBA_SYSTEM_PROMPT
    MOBA_SYSTEM_PROMPT = prompt

    answer = chat(
        question,
        athlete_id=athlete_id,
        chat_history=chat_history,
        session_id=session_id,
        prompt_variant=variant,
    )

    MOBA_SYSTEM_PROMPT = original
    return answer, variant

Every trace that flows through chat_ab() lands in Arize with attributes.prompt.variant = "A" or "B". Nothing else changes — same model, same tools, same eval tasks. The only difference between groups is the system prompt.

What it looks like in Arize

After running the test across our athlete personas, each CHAIN span in Arize shows:

Attributes:
  openinference.span.kind  : CHAIN
  input.value              : "What should I focus on nutritionally this week?"
  output.value             : "Your NLS is 38 (Critical). One good day barely..."
  athlete.id               : jordan_rivera
  athlete.nls              : 38
  prompt.variant           : B
  session.id               : 4f3a2b1c-...

To compare variants, open app.arize.com → moba-nutrition-demo → Traces, add filter attributes.prompt.variant = A, look at the Evaluations column, then switch to B.

The evaluators run continuously on all CHAIN spans. prompt.variant is just an attribute — Arize lets you filter and group by any attribute when viewing traces. For a more structured comparison, use Arize's Experiments feature to export both groups as datasets and compare eval distributions side by side.

Additional span attributes we track

Beyond the variant tag, each span carries context that makes debugging specific failures useful:

# Written by the tool execution loop inside chat():
chain_span.set_attribute("athlete.nls",           str(result.get("currentNLS", "")))
chain_span.set_attribute("athlete.tss",           str(result.get("tss", "")))
chain_span.set_attribute("athlete.data_freshness", str(result.get("dataFreshness", "")))

These let you cross-filter eval results by athlete state. For example:

Filter athlete.nls < 40 AND eval.Nutrition Advice Quality.label = low_quality — find cases where the model failed to surface a Critical NLS warning
Filter prompt.variant = B AND eval.Scope Adherence.label = out_of_scope — check whether Prompt B's shorter scope section causes more off-topic drift

Without explicit span attributes, these queries require post-hoc log parsing. With them, they're a few clicks in the trace explorer.

Session-level quality tracking

Individual span evals measure turn-level quality. But MOBA is a conversational chatbot — a response that makes sense in isolation might be contradictory or context-losing by turn 4.

We track session quality separately. Every turn in a conversation shares a session.id:

import uuid

session_id = str(uuid.uuid4())  # one per conversation, passed to every chat() call

# In chat():
chain_span.set_attribute("session.id", session_id)

A session-granularity evaluator in Arize uses the {conversation} variable — Arize builds this automatically from input.value / output.value pairs across all spans sharing the same session.id:

You are evaluating a multi-turn conversation between an athlete and MOBA.

Full conversation:
{conversation}

Score this conversation as:
- coherent_and_helpful: MOBA maintained context and gave useful, consistent advice
- degraded_or_incoherent: MOBA lost context, contradicted itself, or gave unhelpful responses

The task is created with data_granularity='session' and no column mappings. {conversation} is Arize's internal variable, not a span attribute you map yourself.

Turn-level scores might be similar between Prompt A and Prompt B, but session-level coherence can differ if one prompt helps Claude maintain context better across turns. It's a different signal and worth tracking separately.

Running the experiment

A quick validation pass across all four athlete personas:

TEST_ATHLETES = ["alex_chen", "new_user", "jordan_rivera", "sam_patel"]
TEST_QUESTION = "What should I focus on nutritionally this week?"

for aid in TEST_ATHLETES:
    answer, variant = chat_ab(TEST_QUESTION, athlete_id=aid)
    print(f"{aid:<25} variant={{variant}}  {{answer[:60]}}...")

Output:

alex_chen                 variant=A  Your nutrition is in great shape — NLS 82 (Optimal)...
new_user                  variant=B  I don't have your NLS data yet. Log meals for a few...
jordan_rivera             variant=A  Your NLS is 38 (Critical). Despite today's DNS of 88...
sam_patel                 variant=B  You finished a 90-min brick 5+ hours ago with no food...

Each trace lands in Arize tagged with prompt.variant. The continuous evaluation tasks score them immediately. Within minutes you can see whether Variant B's shorter prompt produces better or worse Nutrition Advice Quality scores.

What we're looking for

The evaluators give us three signals per span:

Signal	What it catches
`Nutrition Advice Quality`	Vague advice, failure to surface Critical NLS, generic responses
`Scope Adherence`	Off-topic responses that slipped through the scope guardrail
`Safety`	Dangerous fueling recommendations (e.g., extreme restriction)

We're watching for Prompt B's conciseness to either help (tighter instructions → more focused outputs) or hurt (missing edge case coverage → more failures on the hallucination and conflicting-signal scenarios).

If Prompt B matches or beats Prompt A on all three evaluators, the shorter prompt wins — less context to maintain, easier to version, cheaper in tokens. If it underperforms on Safety or Scope Adherence, we know the detail in Prompt A is load-bearing.

The Prompt Hub connection

Prompt A is fetched from Arize Prompt Hub by version ID, not hardcoded. This matters for three reasons: the prompt text is version-controlled in Arize, not just in git; we can update the prompt in the Hub without a code deploy; and Arize can show us which exact prompt version was active for any given trace, which is useful when diagnosing regressions after a prompt update.

When we're ready to graduate Prompt B to production, we publish it as a new version in Prompt Hub and update the version ID in the fetch call. Old traces retain their original version ID, so the historical comparison stays intact.

Takeaways

prompt.variant is just a span attribute. There's no special A/B testing API — you tag spans with whatever attributes you want and filter on them in Arize. The evaluators run on all spans regardless; filtering is just how you look at the results.

Deterministic hashing keeps groups stable. Random assignment at request time creates noisy groups. A hash of SALT:athlete_id means group membership is consistent across sessions and deploys.

The salt is the experiment identifier. Changing it rotates all assignments, giving you a clean experiment start without changing any other code.

Session-level evals catch what turn-level evals miss. A prompt might produce good individual responses but fail to maintain context across a conversation. Track both.

Prompt Hub version IDs make the history auditable. When you look at a trace from three weeks ago, you can see exactly which prompt version produced that response — not just which variant label it was tagged with.

Where this falls short: the feedback loop we actually want

The setup above works. But using it for a week revealed a gap worth being honest about.

Here's what the failure-diagnosis loop actually looks like when something goes wrong:

Arize sees your agent failing
         ↓
Can't evaluate multi-turn context accurately
(evals score turns in isolation; tool call outputs are in child spans,
not directly visible to the session-level evaluator)
         ↓
Attributes failure to prompt (wrong diagnosis)
(Arize flags low Nutrition Advice Quality — but the real cause is that
get_nls_score returned available=false and the model ignored it,
not that the prompt was underspecified)
         ↓
Suggests prompt fix (wrong prescription)
(you tighten the "never fabricate NLS values" rule in the system prompt,
which was already there)
         ↓
Fix is stranded in the UI — no path to codebase
(you edit the prompt in Arize Prompt Hub, but now the code still has
the old version ID hardcoded — someone has to update it, re-deploy,
and remember to rotate the A/B salt)
         ↓
Even if the fix is applied — no way to verify it worked
(you can re-run the eval task against historical spans, but those
spans have the old prompt; the new prompt hasn't hit production
traffic yet, and you have no way to test it against real tool call
outputs without running live sessions)

Arize is an observability layer sitting outside the codebase. It can see what happened. It can score what happened. It can even suggest what to change. But it can't close the loop — there's no way to apply the change, run it against real data, and confirm the score improved without switching contexts.

What the end state should look like

Arize detects a failure pattern — say, 40% of spans with athlete.nls < 40 have eval.Nutrition Advice Quality = low_quality this week, up from 12% last week.

It surfaces the specific sessions where the failure occurred, with full multi-turn context. Not just the failing span, but the complete conversation: what tool calls returned, what the model was told, what it said.

A recommended prompt change is generated from that context — grounded in what the tool actually returned in those sessions, not a static reading of the prompt text in isolation. The diagnosis has to see the whole CHAIN: what the tool returned, what the model did with it, where it went wrong.

That recommended change goes directly to the codebase. Not just saved in Prompt Hub as a new version. Ideally it opens a PR or writes directly to the prompt file, with the A/B salt pre-rotated and the version ID updated.

The test runs against live traffic immediately — not against historical spans with the old prompt. The new variant gets real tool calls, real athlete data, real model outputs. The eval task scores those spans as they come in, and within hours you have a verdict: did the change actually fix the failure pattern, or did it trade one failure mode for another?

The whole cycle happens without leaving the development environment. Right now, detection and diagnosis happen in Arize, applying and verifying happen in your editor and terminal, and nothing connects them automatically. The context switch between reading traces in a browser and editing code in an IDE is where fixes get lost, half-applied, or never verified.

For a chatbot where the system prompt is the primary quality control mechanism, this matters more than it might for a simpler integration. The prompt interacts with tool call behavior, multi-turn context, and real athlete data in ways that are only visible in production traces — and the only way to know if a prompt change actually worked is to run it in production and score the results. That loop should be first-class, not something you wire together manually across three different tools.