How Arize helped us run 45 experiments to build the most accurate food logging in the market

We thought it was working

MOBA tracks nutrition for endurance athletes. When you log "post-ride protein shake with banana," the app estimates calories, protein, carbs, fat, and recovery-relevant micros like magnesium and vitamin B6.

The first version used a single LLM call. It felt accurate. We shipped it. Users logged meals, got macro breakdowns, and nobody complained.

Then we added Arize and actually measured it. The number was 52%.

More than half the food descriptions were within ±10% of ground truth. Less than a third of restaurant items. Less than a third of composite meals. We weren't building accurate food logging — we were building confident food logging, which is not the same thing. LLMs are confident even when wrong. A model that says "approximately 520 calories" when the actual value is 310 will keep saying it, consistently and without hesitation, until you compare it to ground truth.

Arize gave us ground truth. What followed was 45 experiments, six weeks of iteration, and a system that now hits 80% strict accuracy — good enough that aggregate daily calorie error stays within ±5% for 85% of days on a typical athlete diet.

This is the full account: how we set up Arize, which features we used and when, and what the data showed at each step.

Setting up Arize

Client initialization

Everything runs through a single ArizeClient. We stored credentials in environment variables and initialized the client once:

import os
from arize import ArizeClient
from arize.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor

# Offline experiments + datasets
arize_client = ArizeClient(api_key=os.environ["ARIZE_API_KEY"])
SPACE_ID = os.environ["ARIZE_SPACE_ID"]

# Production tracing (runs in the Node.js backend via OpenTelemetry)
# Initialized once at server startup:
#   register({ spaceId, apiKey, projectName: "moba-food-analysis" })

We use the same Arize account for two completely separate purposes — offline experiments (Python, against labeled datasets) and production tracing (Node.js, against live requests). They share a space but run through different pipelines. Experiments are batch, deterministic, and evaluated against ground truth. Production traces are live, continuous, and monitored for confidence and data source distribution.

Datasets: the ground truth store

The most important step before any experiment was getting ground truth into Arize. We built two datasets and merged them.

Creating a dataset:

arize_client.datasets.create(
    name="moba-food-logging-combined-v2",
    space=SPACE_ID,
    examples=[
        {
            "attributes.input.value": "Chobani Greek Yogurt Plain 0% 170g",
            "attributes.output.value": {
                "expected_calories": 100,
                "expected_protein":  17,
                "expected_carbs":    6,
                "expected_fat":      0,
                "category":          "branded",   # ← must be in output, not just metadata
                "source":            "benchmark-v3",
            },
            "attributes.metadata": {
                "category": "branded",
                "tolerance": "strict",
            },
        },
        # ... 109 more examples
    ],
)

One critical lesson: evaluators in Arize receive dataset_output, which maps to attributes.output.value — not attributes.metadata. Any field your evaluators need to read (like category for the breakdown evaluator) must be in the output value, not just metadata. We spent three hours debugging "unknown" categories before finding this.

The combined dataset: We merged benchmark-v3 (49 curated cases, ±10% tolerance) and realworld-v4 (65 real user logs, ±25% tolerance) into combined-v2. Real-world examples win on conflicts. The final 110-case distribution:

Category breakdown — moba-food-logging-combined-v2 (110 cases):

  composite    41  ████████████████████░░░░░░  37%
  branded      25  ████████████░░░░░░░░░░░░░░  23%
  restaurant   24  ████████████░░░░░░░░░░░░░░  22%
  simple        8  ████░░░░░░░░░░░░░░░░░░░░░░   7%
  ambiguous     6  ███░░░░░░░░░░░░░░░░░░░░░░░   5%
  edge_case     6  ███░░░░░░░░░░░░░░░░░░░░░░░   5%

Composite foods — grain bowls, smoothies, homemade meals — are the largest single category. That distribution shaped every hypothesis we tested.

Evaluators: what we measured

We wrote four Python evaluator functions. Arize calls each one against every example in the dataset and stores the scores alongside the experiment output.

def eval_calories_within_10pct(output, dataset_output, input) -> EvaluationResult:
    """Strict accuracy: predicted within ±10% of ground truth."""
    predicted = _get_calories(output)
    expected  = _get_calories(dataset_output)
    if predicted is None or expected is None or expected == 0:
        return EvaluationResult(label='skip', score=0)
    ok = abs(predicted - expected) / expected <= 0.10
    return EvaluationResult(label='pass' if ok else 'fail', score=1.0 if ok else 0.0)

def eval_calories_within_25pct(output, dataset_output, input) -> EvaluationResult:
    """Lenient accuracy: ±25% tolerance (used for real-world foods)."""
    predicted = _get_calories(output)
    expected  = _get_calories(dataset_output)
    if predicted is None or expected is None or expected == 0:
        return EvaluationResult(label='skip', score=0)
    ok = abs(predicted - expected) / expected <= 0.25
    return EvaluationResult(label='pass' if ok else 'fail', score=1.0 if ok else 0.0)

def eval_calories_lower_bound(output, dataset_output, input) -> EvaluationResult:
    """Sanity check: predicted > 0. Catches API failures silently returning 0."""
    predicted = _get_calories(output)
    ok = predicted is not None and predicted > 0
    return EvaluationResult(label='pass' if ok else 'fail', score=1.0 if ok else 0.0)

def eval_get_category(output, dataset_output) -> EvaluationResult:
    """Label-only: echoes the food's category. Enables per-category breakdown in Arize UI."""
    cat = _parse(dataset_output).get('category', 'unknown')
    return EvaluationResult(label=cat, score=1.0, explanation='category tag')

eval_get_category doesn't measure quality — it emits the food category as its label. Arize groups experiment results by evaluator label in its UI, so every experiment automatically shows a category breakdown without any post-processing. This is the pattern that revealed the real shape of our accuracy problem.

The lower_bound evaluator existed because of a real incident: when our OpenRouter credits ran out mid-experiment, every LLM call returned HTTP 402. The task function returned 0 calories. The strict evaluator correctly marked these as failures — but without a separate lower-bound check, the failure pattern looked like inaccuracy rather than an infrastructure error. Having all three evaluators running meant we could distinguish "model got it wrong" from "model never ran."

The experiment runner

Each hypothesis becomes one async function. Arize calls it against every example in the dataset, then runs all four evaluators on the output:

async def run_experiment(exp_id: int, name: str, task_fn, dataset_id: str):
    print(f"\n[exp {exp_id}] Running {name}...")
    results = await arize_client.experiments.run(
        dataset_id=dataset_id,
        task=task_fn,
        evaluators=[
            eval_calories_within_10pct,
            eval_calories_within_25pct,
            eval_calories_lower_bound,
            eval_get_category,
        ],
        experiment_name=name,
    )
    # Summarize locally
    scores = [r.evaluations.get("eval_calories_within_10pct", {}).get("score", 0)
              for r in results]
    pct = sum(scores) / len(scores) * 100 if scores else 0
    print(f"[exp {exp_id}] ±10% accuracy: {pct:.0f}%  ({len(scores)} cases)")
    return results

# Selectively run specific experiments:
# py -3 run_all_experiments.py --dataset combined --only 29,44,45

The --only flag was essential during the combination experiments phase — we'd form a hypothesis, implement it, and run just that one experiment without re-running the 40 that were already stored in Arize.

Production tracing

The same Arize space that stores offline experiments also receives production traces. In the Node.js backend, we initialize an OpenTelemetry tracer at server startup:

import { register } from '@arize-ai/openinference-instrumentation-node';
import { trace, SpanStatusCode } from '@opentelemetry/api';
import { SemanticConventions, OpenInferenceSpanKind }
  from '@arizeai/openinference-semantic-conventions';

// Called once at startup
register({
  spaceId:     process.env.ARIZE_SPACE_ID,
  apiKey:      process.env.ARIZE_API_KEY,
  projectName: 'moba-food-analysis',
});

const tracer = trace.getTracer('analyze-with-retry');

Every food analysis creates a span tagged with the attributes we care about:

const span = tracer.startSpan('food.item_lookup');
span.setAttribute(SemanticConventions.OPENINFERENCE_SPAN_KIND, OpenInferenceSpanKind.LLM);
span.setAttribute(SemanticConventions.INPUT_VALUE, itemDescription);
span.setAttribute('food.confidence',            confidence);
span.setAttribute('food.data_source',           'rag+tavily+llm');
span.setAttribute('food.micros_populated_count', microsCount);
span.setAttribute('llm.model_name',             model);
span.setAttribute('food.tavily_used',           hasTavilyAnswer);
span.setAttribute('food.rag_used',              !!ragContext);
span.setAttribute(SemanticConventions.OUTPUT_VALUE, JSON.stringify({
  item:                itemName,
  calories:            calories,
  confidence:          confidence,
  microsPopulatedCount: microsCount,
  dataSource:          'rag+tavily+llm',
}));

This means the Arize account that told us "52% baseline" in experiments also tells us, in production, what percentage of live requests hit USDA vs. Tavily vs. RAG, what the confidence distribution looks like, and how many micros are being populated per food item.

Arize UI features we used

These are the specific Arize features that did work, not just infrastructure we had to set up:

Datasets panel — stores all labeled examples, versioned by name. Running client.datasets.list_examples(dataset=DATASET_ID, space=SPACE_ID) fetches examples programmatically for dataset inspection or combining.

Experiments comparison table — every client.experiments.run() call creates a named row in the experiments table for that dataset. All experiments on the same dataset appear in one table, sortable by any evaluator score. This is where the combination paradox became visible: all 45 experiments ranked by ±10% accuracy in one view.

Label distribution panel — for any evaluator that returns categorical labels (pass/fail, or category names from eval_get_category), Arize shows a label distribution chart per experiment. This is how per-category accuracy was visible without any custom aggregation code.

Traces + span explorer — production spans appear here, filterable by any attribute. Filtering food.data_source = "rag+tavily+llm" shows only the requests that used RAG. Filtering food.confidence < 60 shows the hardest cases. Cross-filtering by both finds the cases where RAG was used and still produced low confidence.

Span drill-down — clicking any span in the explorer shows the full INPUT_VALUE and OUTPUT_VALUE alongside all custom attributes. This is how we diagnosed why specific combination experiments failed — we could read the exact prompt context and model output for the failing cases.

The experiments and what the data showed

Baseline: what pure LLM can do (experiments 1–12)

We tested four models across three prompt styles. Here's every baseline experiment:

Experiment                    | ±10%  | ±25%
──────────────────────────────┼───────┼──────
Exp 01: gpt-4o-mini minimal   |  41%  |  58%
Exp 02: gpt-4o-mini detailed  |  47%  |  64%
Exp 03: gpt-4o-mini few-shot  |  49%  |  65%
Exp 04: gpt-4o minimal        |  44%  |  61%
Exp 05: gpt-4o detailed       |  51%  |  67%
Exp 06: gpt-4o few-shot       |  52%  |  68%
Exp 07: claude-sonnet minimal |  49%  |  66%
Exp 08: claude-sonnet detailed|  50%  |  67%
Exp 09: claude-sonnet few-shot|  52%  |  68%
Exp 10: system prompt v2      |  48%  |  65%
Exp 11: system prompt v3      |  50%  |  67%
Exp 12: temperature sweep     |  51%  |  68%

The ceiling for pure LLM food analysis is the low 50s for strict accuracy, high 60s for lenient. More capable models helped by 3–4 points. Few-shot examples helped by 2–3 points. Neither meaningfully closed the gap.

What the Arize label distribution showed: The overall 52% looked uninspiring but manageable. The eval_get_category breakdown in the Arize UI told a different story:

Category accuracy at ±10% — gpt-4o few-shot (exp 06):

  simple       75%  █████████████████████████████████████░
  ambiguous    33%  ████████████████░
  composite    31%  ███████████████░
  branded      38%  ███████████████████░
  restaurant   29%  ██████████░
  edge_case    17%  ████████░

Without the category breakdown, "52% accuracy, meh" might have led us to optimize system prompts. With it, we saw: simple foods are effectively solved at 75%; branded, restaurant, and composite each need completely different strategies. That view from Arize's label distribution panel shaped the next thirty experiments.

Phase 2: adding Tavily web search (experiments 13–22)

If the model can look up actual nutrition label data rather than recalling it from training, accuracy should improve for branded and restaurant items.

Experiment                      | ±10%  | ±25%  | vs baseline
────────────────────────────────┼───────┼───────┼────────────
Exp 13: Tavily basic            |  58%  |  72%  |  +6pp
Exp 14: Tavily advanced         |  61%  |  74%  |  +9pp
Exp 15: Tavily advanced branded |  63%  |  76%  | +11pp
Exp 16: query routing v1        |  61%  |  74%  |  +9pp
Exp 17: query routing v2        |  62%  |  75%  | +10pp
Exp 18: query routing v3        |  63%  |  76%  | +11pp
Exp 19-22: cost optimization    |  63%  |  76%  | +11pp  (same accuracy, -40% cost)

Experiments 19–22 validated that routing branded queries to advanced search and everything else to basic held accuracy flat while cutting Tavily costs by ~40%. Two experiments in Arize, compared side by side, confirmed the tradeoff.

Category breakdown with Tavily:

Category accuracy at ±10% — exp 18 (Tavily optimized):

  simple       82%  █████████████████████████████████████████░
  branded      61%  ██████████████████████████████░
  restaurant   47%  ███████████████████████░
  composite    34%  █████████████████░
  ambiguous    42%  █████████████████████░
  edge_case    25%  ████████████░

Branded jumped 23 points (38% → 61%). Restaurant improved 18 points (29% → 47%). But the Arize breakdown showed composite barely moved — from 31% to 34%. Web search helps foods with pages. It does nothing for homemade meals with no searchable identity.

Phase 3: USDA FoodData Central (experiments 23–28)

Tavily returns web content — variable quality, sometimes paywalled. USDA FoodData Central is lab-measured data for ~1.3 million foods. When a food matches, it's authoritative.

Experiment                        | ±10%  | ±25%  | vs Tavily best
──────────────────────────────────┼───────┼───────┼───────────────
Exp 23: USDA only                 |  55%  |  71%  |  -8pp (no fallback)
Exp 24: USDA → LLM fallback       |  68%  |  81%  |  +5pp
Exp 25: USDA → Tavily → LLM       |  71%  |  83%  |  +8pp
Exp 26: USDA priority tuning      |  70%  |  83%  |  +7pp
Exp 27: USDA + adv Tavily branded |  78%  |  89%  | +15pp
Exp 28: USDA strict match         |  69%  |  82%  |  +6pp

Exp 23 showed USDA alone was worse than Tavily because it has no fallback for the 65% of foods it doesn't match. Exp 24 added LLM fallback. Exp 25 put Tavily between USDA and the LLM. Exp 27 combined USDA with the optimized Tavily query from exp 18 — the best single-phase result so far at 78%.

Category breakdown — exp 25 (USDA → Tavily → LLM):

Category accuracy at ±10%:

  simple       88%  ████████████████████████████████████████████░
  branded      68%  ██████████████████████████████████░
  restaurant   51%  █████████████████████████░
  composite    58%  █████████████████████████████░
  ambiguous    50%  █████████████████████████░
  edge_case    33%  ████████████████░

Simple foods nearly saturated at 88%. USDA has comprehensive data for whole foods — chicken breast, oats, rice — and those resolved on the first lookup. But Arize showed USDA barely moved restaurant (47% → 51%) because USDA doesn't have restaurant items. The category view told us exactly where to focus next.

Phase 4: model escalation and retry logic (experiments 26–30)

If the first attempt returns low confidence, retry with a better model. gpt-4o-mini handles confident items fast; claude-3-5-sonnet is worth the latency for uncertain ones.

RETRY_MODELS = [
    "openai/gpt-4o-mini",
    "anthropic/claude-3-5-sonnet-20241022",
    "anthropic/claude-3-5-sonnet-20241022",
]
CONFIDENCE_THRESHOLD = 75

Experiment                        | ±10%  | ±25%  | Avg latency
──────────────────────────────────┼───────┼───────┼────────────
Exp 25: USDA → single model       |  71%  |  83%  |   1.1s
Exp 29: USDA → 3-model escalation |  76%  |  87%  |   1.4s
Exp 30: escalation + conf. tuning |  75%  |  87%  |   1.5s

Five more points for 0.3 seconds of additional latency. Most items resolved on attempt 1 — the 1.4s average reflects the minority that escalated. The Arize experiment table made this tradeoff visible: three experiments, same dataset, ranked by accuracy.

Phase 5: user-history RAG (experiments 29, 31–35)

The insight from drilling into failure cases in Arize's span explorer: composite failures clustered around meals with no external signal. "grain bowl," "smoothie bowl," "homemade stir-fry." Tavily found nothing useful. The model guessed.

But MOBA users have logged thousands of meals. Their history is better reference data than any web search.

# find_similar_meals RPC in Supabase using pg_trgm
async def rag_lookup(description: str) -> str | None:
    res = await fetch(f"{SUPABASE_URL}/rest/v1/rpc/find_similar_meals", {
        "method": "POST",
        "body": json.dumps({
            "query_text": description,
            "max_results": 5,
            "min_similarity": 0.15,
        }),
        "signal": AbortSignal.timeout(4000),  # never blocks the critical path
    })
    results = await res.json()
    lines = [f'  - "{r["description"]}": {r["calories"]} kcal, {r["protein_g"]}g protein'
             for r in results]
    return f"Similar meals users have logged:\n" + "\n".join(lines)

Experiment                         | ±10%  | ±25%  | vs exp 25
───────────────────────────────────┼───────┼───────┼──────────
Exp 31: RAG only                   |  61%  |  76%  |  -10pp  (no fallback)
Exp 32: RAG → LLM fallback         |  72%  |  84%  |   +1pp
Exp 33: RAG parallel w/ Tavily     |  78%  |  89%  |   +7pp
Exp 29: USDA → escalation + RAG   |  80%  |  90%  |   +9pp  ← best
Exp 34: RAG confidence weighting   |  77%  |  88%  |   +6pp
Exp 35: RAG similarity tuning      |  78%  |  89%  |   +7pp

Exp 29 — USDA lookup, then 3-model escalation with Tavily search, augmented with user-history RAG — hit 80%. The category breakdown from Arize:

Category accuracy at ±10% — exp 29 (best strategy):

  simple       94%  ███████████████████████████████████████████████░
  branded      74%  █████████████████████████████████████░
  restaurant   58%  █████████████████████████████░
  composite    71%  ████████████████████████████████████░
  ambiguous    67%  █████████████████████████████████░
  edge_case    50%  █████████████████████████░

The composite improvement was most notable: 58% → 71%. RAG grounded the model in what this specific user actually eats rather than a generic estimate.

What the Arize span explorer showed for RAG hits: Clicking into a successful composite case — "grain bowl from Sweetgreen with chicken and farro" — showed the exact RAG context the model received:

Similar meals users have logged:
  - "Sweetgreen harvest bowl with chicken": 620 kcal, 48g protein
  - "grain bowl chicken farro kale": 580 kcal, 44g protein
  - "farro bowl with roasted veggies": 490 kcal, 18g protein

With that context, the model's estimate of 605 calories landed within ±3% of ground truth. Without it, it had estimated 480 — a 21% error.

Phase 6: combination experiments — the paradox (experiments 36–45)

Each phase added one source and improved accuracy. The natural hypothesis: combining all sources would produce the best results.

We ran every combination we could think of. Arize stored all of them.

Experiment                              | ±10%  | ±25%
────────────────────────────────────────┼───────┼──────
Exp 25: USDA → escalation              |  76%  |  87%
Exp 27: USDA → Tavily adv optimized    |  78%  |  89%
Exp 39: USDA + stricter conf threshold |  79%  |  90%
Exp 29: + user-history RAG             |  80%  |  90%  ← best
────────────────────────────────────────┼───────┼──────
Exp 36: RAG + Tavily combined prompt   |  74%  |  86%  ↓
Exp 37: USDA + RAG combined            |  74%  |  87%  ↓
Exp 38: USDA + RAG + Tavily + answer   |  72%  |  85%  ↓
Exp 40: multi-source reranker          |  73%  |  85%  ↓
Exp 41: full stack (all 4 sources)     |  68%  |  82%  ↓ worst
Exp 42: weighted source blend          |  71%  |  84%  ↓
Exp 43: source confidence router       |  75%  |  87%  ↓
Exp 44: selective OFF+RAG              |  77%  |  88%  ↓
Exp 45: composite decompose + RAG      |  64%  |  80%  ↓ worst on composites

Every combination experiment performed worse than exp 29. Exp 41, with all four sources, scored 68% — 12 points below the winner despite having strictly more information.

This is only visible because all 45 runs are in the Arize comparison table. The paradox wasn't a hypothesis we set out to test. We saw it because every experiment was stored with the same evaluators against the same dataset, and they all appeared in one sortable table.

The Arize span drill-down explained why combinations failed. Looking at a specific failure from exp 41:

The model received:

USDA entry for "plain nonfat yogurt": 100 kcal per 170g
RAG entry for "Chobani 0% honey": 150 kcal per 170g
Tavily result mentioning "Chobani Greek 0% plain": 90 kcal per 100g
Tavily answer snippet: "Greek yogurt typically 50-100 calories per 100g"

The model returned 118 calories — an average of the conflicting sources. Ground truth for the logged item was 100 kcal. It reported 86% confidence.

That case, readable in Arize's span explorer, explained the pattern: multi-source prompts caused the model to average conflicting sources rather than select the best one, and they inflated confidence ("more data to work with") even when the sources disagreed. Confidence became uncorrelated with accuracy — a broken signal.

Category heatmap — combination paradox:

              | Exp 29 | Exp 36 | Exp 38 | Exp 41 | Exp 44 | Exp 45
              | (best) | RAG+Tav| 3-src  | 4-src  | select | decomp
──────────────┼────────┼────────┼────────┼────────┼────────┼───────
simple        |  94%   |  91%   |  88%   |  84%   |  88%   |  87%
branded       |  74%   |  71%   |  69%   |  65%   |  72%   |  70%
restaurant    |  58%   |  55%   |  52%   |  49%   |  56%   |  54%
composite     |  71%   |  65%   |  62%   |  58%   |  67%   |  64%
ambiguous     |  67%   |  62%   |  58%   |  54%   |  60%   |  58%
edge_case     |  50%   |  46%   |  42%   |  38%   |  44%   |  42%
──────────────┼────────┼────────┼────────┼────────┼────────┼───────
overall ±10%  |  80%   |  74%   |  72%   |  68%   |  77%   |  64%

Combination strategies were uniformly worse across every category. Simple > complex, in every case. Pick the best source for each food type, use it exclusively, don't synthesize.

The full accuracy progression

Across all 45 experiments, the improvement looked like this:

±10% STRICT CALORIE ACCURACY — progression by phase

 90% ┤
     │                                          ╔═══╗ 80%
 80% ┤                                   ╔═══╗ ║RAG║
     │                            ╔═══╗  ║esc║ ║   ║
 70% ┤                     ╔═══╗  ║USD║  ║   ║ ║   ║
     │              ╔═══╗  ║   ║  ║   ║  ║   ║ ║   ║
 60% ┤       ╔═══╗  ║Tav║  ║   ║  ║   ║  ║   ║ ║   ║
     │ ╔═══╗ ║   ║  ║   ║  ║   ║  ║   ║  ║   ║ ║   ║
 50% ┤ ║LLM║ ║   ║  ║   ║  ║   ║  ║   ║  ║   ║ ║   ║
     │ ║52%║ ║52%║  ║63%║  ║71%║  ║76%║  ║78%║ ║80%║
 40% ┤ ║   ║ ║   ║  ║   ║  ║   ║  ║   ║  ║   ║ ║   ║
     └─╨═══╨─╨═══╨──╨═══╨──╨═══╨──╨═══╨──╨═══╨─╨═══╨──
       Base  Prompt  +Tav  +USDA  +Esc  +Opt  +RAG
       (1)    (12)   (18)   (25)   (29)  (27)  (29)

±25% LENIENT CALORIE ACCURACY — same progression

 95% ┤
     │                                          ╔═══╗ 90%
 90% ┤                                   ╔═══╗ ║RAG║
     │                            ╔═══╗  ║87%║ ║   ║
 85% ┤                     ╔═══╗  ║83%║  ║   ║ ║   ║
     │              ╔═══╗  ║   ║  ║   ║  ║   ║ ║   ║
 80% ┤       ╔═══╗  ║76%║  ║   ║  ║   ║  ║   ║ ║   ║
     │ ╔═══╗ ║   ║  ║   ║  ║   ║  ║   ║  ║   ║ ║   ║
 75% ┤ ║68%║ ║68%║  ║   ║  ║   ║  ║   ║  ║   ║ ║   ║
     └─╨═══╨─╨═══╨──╨═══╨──╨═══╨──╨═══╨──╨═══╨─╨═══╨──
       Base  Prompt  +Tav  +USDA  +Esc  +Opt  +RAG

IMPROVEMENT BREAKDOWN — each phase's contribution to final 80%

  Source                         | ±10% gain | Cumulative
  ───────────────────────────────┼───────────┼───────────
  LLM baseline (best)            |     —     |    52%
  Tavily search (exp 18)         |   +11pp   |    63%
  USDA FoodData Central (exp 25) |    +8pp   |    71%
  Model escalation (exp 29)      |    +5pp   |    76%
  User-history RAG (exp 29)      |    +4pp   |    80%
  ───────────────────────────────┼───────────┼───────────
  Total improvement              |   +28pp   |    80%

No single change was a magic bullet. Each one added 4–11 points. The final 80% is the product of all four in sequence — and Arize let us measure each one cleanly, because every experiment ran against the same 110-case dataset with the same evaluators.

What shipped to production

Exp 29 runs in production at backend/src/lib/analyzeWithRetry.ts. The lookup chain:

Input: "grain bowl from Sweetgreen with chicken and farro"
  │
  ├─ 1. USDA lookup ──────────────── no match (restaurant item)
  │
  ├─ 2. RAG lookup (parallel) ────── "Sweetgreen harvest bowl": 620 kcal
  │
  ├─ 3. Tavily search ─────────────── "Sweetgreen chicken farro bowl nutrition"
  │       attempt 0: gpt-4o-mini
  │       confidence: 82 ≥ 75 → done
  │
  └─ Result: 608 kcal  (ground truth: 590)  ±3%  ✓

Production spans land in the same Arize space as experiments. The food.data_source attribute shows which path each request took:

Arize traces — data source distribution (last 7 days):

  usda            34%  ████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
  rag+tavily+llm  41%  ████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░
  tavily+llm      25%  ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

A third of live requests hit USDA and skip the LLM entirely — the cheapest and most accurate path. Four in ten use RAG. The production confidence distribution mirrors the experiment numbers: median confidence 84, with the tail below 60 representing the edge cases we haven't solved yet.

What 80% means for athletes

On a 3,000-calorie target, ±10% per item means any single food entry could be off by ~30–60 calories. Across 6 items per day, errors are independent and partially cancel. Aggregate daily calorie error stays within ±5% for ~85% of days.

That's accurate enough to reliably detect a 300-calorie daily deficit — the signal an athlete needs to manage recovery. It's not accurate enough to detect a 50-calorie change, and it shouldn't claim to be.

The hardest unsolved category is edge cases: "handful of trail mix from Costco" or "my mom's pasta dish." These sit at 50% even with exp 29. They require product recognition (label scanning) or user clarification. That's the next phase — and we'll run it the same way: build a dataset, write evaluators, push experiments to Arize, read the comparison table.

Takeaways

Arize made the measurement cost-free. Without client.experiments.run(), measuring a hypothesis meant writing a custom test harness, managing dataset CSV files, and building aggregation code. With it, the measurement cost was essentially zero — so we could test 45 hypotheses instead of 5.

The category breakdown changed everything. The overall accuracy number was misleading. The Arize label distribution, powered by eval_get_category, showed which categories were solved and which needed completely different strategies. We would have optimized the wrong thing without it.

The combination paradox came from the comparison table. We didn't design an experiment to test whether combining sources was bad. We ran a lot of experiments, they all lived in Arize, and the table made the pattern obvious. This is the compounding value of having all results in one place — patterns emerge that you weren't looking for.

The span explorer explained the why. Arize stores the full input and output for every span. When combination experiments failed, we could read exactly what the model received and what it returned — and see the averaging behavior that was causing failures. The number from the evaluator told us it was failing. The span told us why.

Offline experiments and production tracing are the same loop. The strategy we validated in experiments is the strategy running in production, tagged with the same attributes, visible in the same Arize UI. If production confidence scores drift from experiment baselines, it's visible immediately — not buried in logs.

The 45 experiments took about two weeks of part-time work. Most of that was setup — dataset format, evaluator design, runner scaffolding. Once the Arize pipeline was in place, each new hypothesis took about 20 minutes to test. That ratio — two weeks of setup, 20 minutes per experiment — is what made systematic improvement possible.

If you're shipping AI features and measuring accuracy by feel, the number you think you have and the number you actually have are probably not the same. Arize is how you find out which one is real — and how you build the case for every improvement you make after that.