Strata MCP vs Official MCPs: A Real‑World Benchmark on Notion and GitHub

·10 min read

Cover Image for Strata MCP vs Official MCPs: A Real‑World Benchmark on Notion and GitHub

Key takeaways

  • Strata MCP achieved higher success rates than the official GitHub and Notion MCP servers on Mcpmark’s real‑world tasks.
  • On Notion (28 tasks), Strata MCP improved pass@1 by +13.4 pts (34.8% vs 21.4%) and cut cost per run by 32.6%.
  • On GitHub (23 tasks), Strata MCP improved pass@1 by +15.2 pts (31.5% vs 16.3%) at 20.3% lower cost per run.
  • Reliability gains were strongest on “all-four-runs succeed” (Pass^4): 3.5× higher on Notion and 2.5× higher on GitHub.
  • Strata MCP consistently used fewer tokens (−24% to −35%), trading more agentic steps for better final accuracy.

Executive summary

We evaluated Strata MCP against the official GitHub and Notion MCP servers using the public Mcpmark benchmark. The benchmark comprises hand-designed, end‑to‑end tasks that require the model to reason, call MCP tools, modify external systems (GitHub/Notion), and pass automated verification.

Using the same model claude-sonnet-4 and identical prompts, Strata MCP delivered higher success rates at lower token usage and lower cost across both task families:

  • GitHub (23 tasks): pass@1 31.5% vs 16.3% (+93% relative), pass@4 39.10% vs 30.40%, Pass^4 21.74% vs 8.70% (2.5×); −24% tokens; −20% cost.
  • Notion (28 tasks): pass@1 34.8% vs 21.4% (+63% relative), pass@4 50.00% vs 39.30%, Pass^4 25.00% vs 7.14% (3.5×); −35% tokens; −33% cost.

These gains come from Strata MCP’s design, which greatly reduces the tool’s context and increase the tool coverage.

Benchmark design and task examples

We used the Mcpmark benchmark suite across two real‑world integrations:

GitHub tasks (23 tasks)

These tasks focus on configuration‑as‑code, repository hygiene, and delivery workflows. They commonly require:

  • Authoring or modifying GitHub Actions workflows (YAML)
  • Interacting with commit history and tags
  • Implementing policies for linting, testing, versioning, and releases
  • Producing human‑readable artifacts (e.g., changelogs) from repository metadata

Representative skills measured:

  • YAML correctness, job orchestration, and event scoping
  • Safe use of marketplace actions (pinning versions, least privilege)
  • Semantic versioning (SemVer) and release discipline
  • Scripting for repository queries (e.g., commit metadata)

Notion tasks (28 tasks)

These tasks assess information design and workspace operations in Notion, including:

  • Editing and structuring content
  • Designing and refactoring databases (properties, relations, rollups)
  • Using views, filters, grouping, and formula logic
  • Summarization and planning workflows for everyday productivity

Representative skills measured:

  • Translating goals into workable page/database structures
  • Choosing appropriate property types and formulas
  • Building understandable, maintainable views and summaries
  • Applying consistent styles and conventions

Evaluation protocol and metrics

How success is determined

  • The model receives the task prompt and calls the MCP server’s tools (GitHub or Notion).
  • The MCP performs the requested modifications.
  • An automated checker validates the final state (page structure, counts, links, commit file contents, and repository history).
  • A task counts as “success” if it passes verification.

Metrics reported

  • pass@1 (avg ± std): Average single‑run success rate across tasks, with per‑task standard deviation.
  • pass@4: Success rate of at least one success within four independent runs per task (empirically measured).
  • Pass^4: Success rate that all four runs succeed (empirically measured).
  • Efficiency: average tokens, turns, wall‑clock time per task, and estimated cost per run.

Experimental setup

  • Model: claude-sonnet-4
  • MCP servers:
    • Strata MCP (Klavis AI)
    • GitHub Official MCP Server
    • Notion Official MCP Server
  • Tasks: 23 GitHub tasks, 28 Notion tasks (hand‑designed by Mcpmark)
  • Repetitions: 4 runs per task, per MCP
  • Success criteria: Automated verification of final GitHub/Notion state

Results

GitHub tasks (23)

MCPModelpass@1 (avg ± std)pass@4Pass^4Avg TokensTurnsAvg TimeCost/run
Klavis AI Strata MCP Serverclaude-sonnet-4-2025051431.5 ± 3.6%39.10%21.74%533,38521.7358.3s$39.55
GitHub Official MCP Serverclaude-sonnet-4-2025051416.3 ± 5.7%30.40%8.70%701,25211.2196.5s$49.61

Highlights

  • Accuracy: +15.2 pts pass@1 (+93% relative); pass@4 +8.7 pts; Pass^4 2.5×.
  • Efficiency: −24% tokens; −20% cost.
  • Latency: Strata MCP takes longer (+82%), reflecting more agentic steps to ensure correctness.

Notion tasks (28)

MCPModelpass@1 (avg ± std)pass@4Pass^4Avg TokensTurnsAvg TimeCost/run
Klavis AI Strata MCP Serverclaude-sonnet-4-2025051434.8 ± 6.4%50.00%25.00%424,47424.3147.6s$37.83
Notion Official MCP Serverclaude-sonnet-4-2025051421.4 ± 5.1%39.30%7.14%650,87919.7193.2s$56.10

Highlights

  • Accuracy: +13.4 pts pass@1 (+63% relative); pass@4 +10.7 pts; Pass^4 3.5×.
  • Efficiency: −35% tokens; −33% cost.
  • Latency: Strata MCP is faster here (−24% time), indicating less rework and better first‑try formatting compliance.

Efficiency and cost

Across both task families, Strata MCP used fewer tokens and cost less per run:

  • GitHub: −167,867 tokens (−23.9%), −$10.06 (−20.3%) per run
  • Notion: −226,405 tokens (−34.8%), −$18.27 (−32.6%) per run

Interpretation: Strata MCP’s orchestration encourages deliberate reasoning and structured tool use. Although it often increases conversational turns, it reduces retries, over‑generation, and failed verifications—lowering token consumption and cost.

Reliability across retries

Pass^4 (all four attempts succeed) is a strong indicator of reliability in production:

  • Notion: 25.00% vs 7.14% (3.5×)
  • GitHub: 21.74% vs 8.70% (2.5×)

Higher Pass^4 means fewer flaky runs and more predictable automation when tasks must succeed consistently.

Why Strata MCP performs better

  • Lean context, fewer tokens
    • We never dump full tool descriptions into the prompt. Strata reveals only what’s needed at each step: service → category → action name/summary → full schema at execution.
    • Integration-aware preloading limits discovery to tools a user actually has enabled, removing irrelevant descriptions from the context.
    • We design the error handling prompt to avoid the model getting stuck and repeatedly calling the tool.
  • Precise tool targeting
    • Structured narrowing—service selection, then category shortlisting, then action choice—shrinks the decision surface progressively, making it easier for the model to lock onto the exact tool.
    • Strata shows concise action descriptions first, then provides the full parameter schema only for the chosen action, reducing confusion between similarly named APIs.
    • Just‑in‑time recovery via search_documentation (BM25, pre‑cached indices) supplies missing details when needed, helping the model resolve ambiguity instead of guessing.
  • Full coverage at scale
    • No artificial 40–50 tool cap: Strata scales to thousands by gating discovery and schema exposure, preserving accuracy as coverage grows.
    • A consistent, discovery‑driven interface normalizes disparate MCP servers (official or custom), enabling multi‑app workflows without overwhelming the model.
    • handle_auth_failure automates OAuth/API key flows so more integrations are actually usable, turning “available” tools into reliable actions.

Net effect: fewer tokens sent, fewer misfires in tool choice, and higher first‑try success—especially on multi‑step GitHub and Notion tasks where official servers rely on flat, schema‑heavy tool lists.

Reproducibility

Conclusion and next steps

Strata MCP delivers higher success at lower cost on realistic GitHub and Notion automations. If you’re looking to ship dependable, schema‑correct tool use with predictable spend, Strata MCP is a strong default.

References