Blog Benchmark — Kirha vs Web Search

Benchmark Kirha vs Web Search

We compared Kirha and Web Search across 100 real-time queries and let Gemini 2.5 score the outputs. Here's what we found.

Benchmark — Kirha vs Web Search
December 16, 2025 - Pierre Hay

Many people assume that private data isn’t public at its core. The truth is, across industries, most of what data providers sell originates in the public domain: price feeds, SEC filings, company earnings, court decisions. Blockchains are an even starker example: every transaction lives on a public ledger, yet good luck turning that raw data into actionable insights.

So what are customers actually paying for? Real-time synchronization, domain-specific aggregation, and the accountability that comes from professional data providers.

For AI agents needing external context, the options have been limited:

  • Web search is broad but messy. It’s not accurate enough for agents working with real-time specialized data. Usage-based Web indexers prioritize coverage over depth, so staying real-time on domain-specific data is tough.
  • Premium data sources deliver the depth and reliability agents need, but they come with subscriptions, enterprise contracts, weeks of integration work, and ongoing maintenance.

Kirha gives AI agents the best of both worlds: usage-based access to premium data sources, fully traceable and auditable, through a single /search endpoint. We route queries to a network of premium providers first and fall back to web search when needed—so coverage is always at least as broad, but depth is significantly better where it matters.

Today, we’re releasing the first public benchmark measuring that difference.


Results

Across 100 domain-specific queries spanning company data, insurance, and crypto:

KirhaWeb Search
Score87/10061/100
Tokens injected233,9204,604,853

Kirha scores 43% higher while using 95% fewer tokens. That’s not just better answers. It’s dramatically more efficient context for your agents.


Methodology

We use the LLM-as-a-Judge pattern to score results across five criteria: relevance, accuracy, completeness, freshness, and actionability. Each query runs against both Kirha and Web Search in parallel. Results are summarized by Gemini 2.5 Flash to normalize output format, then scored by the same model with extended thinking enabled.

A common best practice with LLM-as-a-Judge is to cross-reference scores against human evaluation and aim for a high correlation. For this v1 we took a lighter approach: we asked Claude to review all results alongside their judge scores and flag inconsistencies. Read the full report.

“The benchmark is reliable. The judge makes correct evaluations in ~96% of cases. The 4 inconsistencies identified are edge cases that warrant discussion but don’t invalidate the overall methodology.”

A note on our baseline: We used Exa for Web Search. Not to single them out, but because they’re a strong representative of the category. We may add other providers in future iterations. The point of this benchmark isn’t to claim Web Search is broken. It’s to demonstrate that for real-time, domain-specific queries, tailored integrations will always outperform Web Search.

Want to learn more?

Let’s chat

Latest articles

  1. Kirha Monthly Recap — December 2025

    Kirha Monthly Recap — December 2025

    December 31, 2025 2 min. read

    December 2025 updates: Kirha vs Web Search benchmark, News vertical launch (Polymarket), spreadsheet integration via FormulaJS, semantic tool filtering, trading community partnerships, Discord and Telegram launch...

  2. Kirha Monthly Recap — November 2025

    Kirha Monthly Recap — November 2025

    November 30, 2025 2 min. read

    November 2025 updates: X402 integration on Coinbase Bazaar, Neo4j migration for MCP metadata, new CyberSec/OSINT vertical, deterministic tool planning, enterprise privacy features, and B2B traction...