Search/Eval dashboard

Eval and tracing

30 labeled queries across 5 categories (filter accuracy, semantic match, geo, Fair Housing guard, researcher quality). The runner executes them sequentially through the production agents and scores with deterministic rules. Stats below come from traces-v1 in OpenSearch.

Total traces
n/a
Total spans
n/a
Agents seen
0
Eval cases
30