Strata MCP vs Official MCPs: A Real‑World Benchmark on Notion and GitHub

Key takeaways

Strata MCP achieved higher success rates than the official GitHub and Notion MCP servers on Mcpmark’s real‑world tasks.
On Notion (28 tasks), Strata MCP improved pass@1 by +13.4 pts (34.8% vs 21.4%) and cut cost per run by 32.6%.
On GitHub (23 tasks), Strata MCP improved pass@1 by +15.2 pts (31.5% vs 16.3%) at 20.3% lower cost per run.
Reliability gains were strongest on “all-four-runs succeed” (Pass^4): 3.5× higher on Notion and 2.5× higher on GitHub.
Strata MCP consistently used fewer tokens (−24% to −35%), trading more agentic steps for better final accuracy.

Executive summary

We evaluated Strata MCP against the official GitHub and Notion MCP servers using the public Mcpmark benchmark. The benchmark comprises hand-designed, end‑to‑end tasks that require the model to reason, call MCP tools, modify external systems (GitHub/Notion), and pass automated verification.

Using the same model claude-sonnet-4 and identical prompts, Strata MCP delivered higher success rates at lower token usage and lower cost across both task families:

GitHub (23 tasks): pass@1 31.5% vs 16.3% (+93% relative), pass@4 39.10% vs 30.40%, Pass^4 21.74% vs 8.70% (2.5×); −24% tokens; −20% cost.
Notion (28 tasks): pass@1 34.8% vs 21.4% (+63% relative), pass@4 50.00% vs 39.30%, Pass^4 25.00% vs 7.14% (3.5×); −35% tokens; −33% cost.

These gains come from Strata MCP’s design, which greatly reduces the tool’s context and increase the tool coverage.

Benchmark design and task examples

We used the Mcpmark benchmark suite across two real‑world integrations:

GitHub tasks (23 tasks)

These tasks focus on configuration‑as‑code, repository hygiene, and delivery workflows. They commonly require:

Authoring or modifying GitHub Actions workflows (YAML)
Interacting with commit history and tags
Implementing policies for linting, testing, versioning, and releases
Producing human‑readable artifacts (e.g., changelogs) from repository metadata

Representative skills measured:

YAML correctness, job orchestration, and event scoping
Safe use of marketplace actions (pinning versions, least privilege)
Semantic versioning (SemVer) and release discipline
Scripting for repository queries (e.g., commit metadata)

Notion tasks (28 tasks)

These tasks assess information design and workspace operations in Notion, including:

Editing and structuring content
Designing and refactoring databases (properties, relations, rollups)
Using views, filters, grouping, and formula logic
Summarization and planning workflows for everyday productivity

Representative skills measured:

Translating goals into workable page/database structures
Choosing appropriate property types and formulas
Building understandable, maintainable views and summaries
Applying consistent styles and conventions

Evaluation protocol and metrics

How success is determined

The model receives the task prompt and calls the MCP server’s tools (GitHub or Notion).
The MCP performs the requested modifications.
An automated checker validates the final state (page structure, counts, links, commit file contents, and repository history).
A task counts as “success” if it passes verification.

Metrics reported

pass@1 (avg ± std): Average single‑run success rate across tasks, with per‑task standard deviation.
pass@4: Success rate of at least one success within four independent runs per task (empirically measured).
Pass^4: Success rate that all four runs succeed (empirically measured).
Efficiency: average tokens, turns, wall‑clock time per task, and estimated cost per run.

Experimental setup

Model: claude-sonnet-4
MCP servers:
- Strata MCP (Klavis AI)
- GitHub Official MCP Server
- Notion Official MCP Server
Tasks: 23 GitHub tasks, 28 Notion tasks (hand‑designed by Mcpmark)
Repetitions: 4 runs per task, per MCP
Success criteria: Automated verification of final GitHub/Notion state

Results

GitHub tasks (23)

MCP	Model	pass@1 (avg ± std)	pass@4	Pass^4	Avg Tokens	Turns	Avg Time	Cost/run
Klavis AI Strata MCP Server	claude-sonnet-4-20250514	31.5 ± 3.6%	39.10%	21.74%	533,385	21.7	358.3s	$39.55
GitHub Official MCP Server	claude-sonnet-4-20250514	16.3 ± 5.7%	30.40%	8.70%	701,252	11.2	196.5s	$49.61

Highlights

Accuracy: +15.2 pts pass@1 (+93% relative); pass@4 +8.7 pts; Pass^4 2.5×.
Efficiency: −24% tokens; −20% cost.
Latency: Strata MCP takes longer (+82%), reflecting more agentic steps to ensure correctness.

Notion tasks (28)

MCP	Model	pass@1 (avg ± std)	pass@4	Pass^4	Avg Tokens	Turns	Avg Time	Cost/run
Klavis AI Strata MCP Server	claude-sonnet-4-20250514	34.8 ± 6.4%	50.00%	25.00%	424,474	24.3	147.6s	$37.83
Notion Official MCP Server	claude-sonnet-4-20250514	21.4 ± 5.1%	39.30%	7.14%	650,879	19.7	193.2s	$56.10

Highlights

Accuracy: +13.4 pts pass@1 (+63% relative); pass@4 +10.7 pts; Pass^4 3.5×.
Efficiency: −35% tokens; −33% cost.
Latency: Strata MCP is faster here (−24% time), indicating less rework and better first‑try formatting compliance.

Efficiency and cost

Across both task families, Strata MCP used fewer tokens and cost less per run:

GitHub: −167,867 tokens (−23.9%), −$10.06 (−20.3%) per run
Notion: −226,405 tokens (−34.8%), −$18.27 (−32.6%) per run

Interpretation: Strata MCP’s orchestration encourages deliberate reasoning and structured tool use. Although it often increases conversational turns, it reduces retries, over‑generation, and failed verifications—lowering token consumption and cost.

Reliability across retries

Pass^4 (all four attempts succeed) is a strong indicator of reliability in production:

Notion: 25.00% vs 7.14% (3.5×)
GitHub: 21.74% vs 8.70% (2.5×)

Higher Pass^4 means fewer flaky runs and more predictable automation when tasks must succeed consistently.

Why Strata MCP performs better

Lean context, fewer tokens
- We never dump full tool descriptions into the prompt. Strata reveals only what’s needed at each step: service → category → action name/summary → full schema at execution.
- Integration-aware preloading limits discovery to tools a user actually has enabled, removing irrelevant descriptions from the context.
- We design the error handling prompt to avoid the model getting stuck and repeatedly calling the tool.
Precise tool targeting
- Structured narrowing—service selection, then category shortlisting, then action choice—shrinks the decision surface progressively, making it easier for the model to lock onto the exact tool.
- Strata shows concise action descriptions first, then provides the full parameter schema only for the chosen action, reducing confusion between similarly named APIs.
- Just‑in‑time recovery via search_documentation (BM25, pre‑cached indices) supplies missing details when needed, helping the model resolve ambiguity instead of guessing.
Full coverage at scale
- No artificial 40–50 tool cap: Strata scales to thousands by gating discovery and schema exposure, preserving accuracy as coverage grows.
- A consistent, discovery‑driven interface normalizes disparate MCP servers (official or custom), enabling multi‑app workflows without overwhelming the model.
- handle_auth_failure automates OAuth/API key flows so more integrations are actually usable, turning “available” tools into reliable actions.

Net effect: fewer tokens sent, fewer misfires in tool choice, and higher first‑try success—especially on multi‑step GitHub and Notion tasks where official servers rely on flat, schema‑heavy tool lists.

Reproducibility

Benchmark: Mcpmark task suite (GitHub and Notion families)
Model: claude-sonnet-4-20250514
Runs: 4 per task, per MCP server
Success criteria: Automated verifiers provided by the benchmark
How to reproduce:
- Set up Strata MCP: http://www.klavis.ai/docs/quickstart#multi-app-integration
- Set up Mcpmark benchmark: http://www.klavis.ai/docs/quickstart#multi-app-integration
- Replace Official MCP with Strata MCP according to the Mcpbench mark guide

Conclusion and next steps

Strata MCP delivers higher success at lower cost on realistic GitHub and Notion automations. If you’re looking to ship dependable, schema‑correct tool use with predictable spend, Strata MCP is a strong default.

Try Strata MCP now: sign up
Talk to us about your workflows: contact us

References

MCP (Model Context Protocol) specification — Model Context Protocol docs: https://modelcontextprotocol.io/
Mcpmark benchmark: https://mcpmark.ai/
Notion Official MCP: https://developers.notion.com/docs/mcp
Github Official MCP: https://github.com/github/github-mcp-server
Claude model overview: https://docs.claude.com/en/docs/about-claude/models/overview