Key takeaways
- Strata MCP achieved higher success rates than the official GitHub and Notion MCP servers on Mcpmark’s real‑world tasks.
- On Notion (28 tasks), Strata MCP improved pass@1 by +13.4 pts (34.8% vs 21.4%) and cut cost per run by 32.6%.
- On GitHub (23 tasks), Strata MCP improved pass@1 by +15.2 pts (31.5% vs 16.3%) at 20.3% lower cost per run.
- Reliability gains were strongest on “all-four-runs succeed” (Pass^4): 3.5× higher on Notion and 2.5× higher on GitHub.
- Strata MCP consistently used fewer tokens (−24% to −35%), trading more agentic steps for better final accuracy.
Executive summary
We evaluated Strata MCP against the official GitHub and Notion MCP servers using the public Mcpmark benchmark. The benchmark comprises hand-designed, end‑to‑end tasks that require the model to reason, call MCP tools, modify external systems (GitHub/Notion), and pass automated verification.
Using the same model claude-sonnet-4 and identical prompts, Strata MCP delivered higher success rates at lower token usage and lower cost across both task families:
- GitHub (23 tasks): pass@1 31.5% vs 16.3% (+93% relative), pass@4 39.10% vs 30.40%, Pass^4 21.74% vs 8.70% (2.5×); −24% tokens; −20% cost.
- Notion (28 tasks): pass@1 34.8% vs 21.4% (+63% relative), pass@4 50.00% vs 39.30%, Pass^4 25.00% vs 7.14% (3.5×); −35% tokens; −33% cost.
These gains come from Strata MCP’s design, which greatly reduces the tool’s context and increase the tool coverage.
Benchmark design and task examples
We used the Mcpmark benchmark suite across two real‑world integrations:
GitHub tasks (23 tasks)
These tasks focus on configuration‑as‑code, repository hygiene, and delivery workflows. They commonly require:
- Authoring or modifying GitHub Actions workflows (YAML)
- Interacting with commit history and tags
- Implementing policies for linting, testing, versioning, and releases
- Producing human‑readable artifacts (e.g., changelogs) from repository metadata
Representative skills measured:
- YAML correctness, job orchestration, and event scoping
- Safe use of marketplace actions (pinning versions, least privilege)
- Semantic versioning (SemVer) and release discipline
- Scripting for repository queries (e.g., commit metadata)
Notion tasks (28 tasks)
These tasks assess information design and workspace operations in Notion, including:
- Editing and structuring content
- Designing and refactoring databases (properties, relations, rollups)
- Using views, filters, grouping, and formula logic
- Summarization and planning workflows for everyday productivity
Representative skills measured:
- Translating goals into workable page/database structures
- Choosing appropriate property types and formulas
- Building understandable, maintainable views and summaries
- Applying consistent styles and conventions
Evaluation protocol and metrics
How success is determined
- The model receives the task prompt and calls the MCP server’s tools (GitHub or Notion).
- The MCP performs the requested modifications.
- An automated checker validates the final state (page structure, counts, links, commit file contents, and repository history).
- A task counts as “success” if it passes verification.
Metrics reported
- pass@1 (avg ± std): Average single‑run success rate across tasks, with per‑task standard deviation.
- pass@4: Success rate of at least one success within four independent runs per task (empirically measured).
- Pass^4: Success rate that all four runs succeed (empirically measured).
- Efficiency: average tokens, turns, wall‑clock time per task, and estimated cost per run.
Experimental setup
- Model: claude-sonnet-4
- MCP servers:
- Strata MCP (Klavis AI)
- GitHub Official MCP Server
- Notion Official MCP Server
- Tasks: 23 GitHub tasks, 28 Notion tasks (hand‑designed by Mcpmark)
- Repetitions: 4 runs per task, per MCP
- Success criteria: Automated verification of final GitHub/Notion state
Results
GitHub tasks (23)
MCP | Model | pass@1 (avg ± std) | pass@4 | Pass^4 | Avg Tokens | Turns | Avg Time | Cost/run |
---|---|---|---|---|---|---|---|---|
Klavis AI Strata MCP Server | claude-sonnet-4-20250514 | 31.5 ± 3.6% | 39.10% | 21.74% | 533,385 | 21.7 | 358.3s | $39.55 |
GitHub Official MCP Server | claude-sonnet-4-20250514 | 16.3 ± 5.7% | 30.40% | 8.70% | 701,252 | 11.2 | 196.5s | $49.61 |
Highlights
- Accuracy: +15.2 pts pass@1 (+93% relative); pass@4 +8.7 pts; Pass^4 2.5×.
- Efficiency: −24% tokens; −20% cost.
- Latency: Strata MCP takes longer (+82%), reflecting more agentic steps to ensure correctness.
Notion tasks (28)
MCP | Model | pass@1 (avg ± std) | pass@4 | Pass^4 | Avg Tokens | Turns | Avg Time | Cost/run |
---|---|---|---|---|---|---|---|---|
Klavis AI Strata MCP Server | claude-sonnet-4-20250514 | 34.8 ± 6.4% | 50.00% | 25.00% | 424,474 | 24.3 | 147.6s | $37.83 |
Notion Official MCP Server | claude-sonnet-4-20250514 | 21.4 ± 5.1% | 39.30% | 7.14% | 650,879 | 19.7 | 193.2s | $56.10 |
Highlights
- Accuracy: +13.4 pts pass@1 (+63% relative); pass@4 +10.7 pts; Pass^4 3.5×.
- Efficiency: −35% tokens; −33% cost.
- Latency: Strata MCP is faster here (−24% time), indicating less rework and better first‑try formatting compliance.
Efficiency and cost
Across both task families, Strata MCP used fewer tokens and cost less per run:
- GitHub: −167,867 tokens (−23.9%), −$10.06 (−20.3%) per run
- Notion: −226,405 tokens (−34.8%), −$18.27 (−32.6%) per run
Interpretation: Strata MCP’s orchestration encourages deliberate reasoning and structured tool use. Although it often increases conversational turns, it reduces retries, over‑generation, and failed verifications—lowering token consumption and cost.
Reliability across retries
Pass^4 (all four attempts succeed) is a strong indicator of reliability in production:
- Notion: 25.00% vs 7.14% (3.5×)
- GitHub: 21.74% vs 8.70% (2.5×)
Higher Pass^4 means fewer flaky runs and more predictable automation when tasks must succeed consistently.
Why Strata MCP performs better
- Lean context, fewer tokens
- We never dump full tool descriptions into the prompt. Strata reveals only what’s needed at each step: service → category → action name/summary → full schema at execution.
- Integration-aware preloading limits discovery to tools a user actually has enabled, removing irrelevant descriptions from the context.
- We design the error handling prompt to avoid the model getting stuck and repeatedly calling the tool.
- Precise tool targeting
- Structured narrowing—service selection, then category shortlisting, then action choice—shrinks the decision surface progressively, making it easier for the model to lock onto the exact tool.
- Strata shows concise action descriptions first, then provides the full parameter schema only for the chosen action, reducing confusion between similarly named APIs.
- Just‑in‑time recovery via search_documentation (BM25, pre‑cached indices) supplies missing details when needed, helping the model resolve ambiguity instead of guessing.
- Full coverage at scale
- No artificial 40–50 tool cap: Strata scales to thousands by gating discovery and schema exposure, preserving accuracy as coverage grows.
- A consistent, discovery‑driven interface normalizes disparate MCP servers (official or custom), enabling multi‑app workflows without overwhelming the model.
- handle_auth_failure automates OAuth/API key flows so more integrations are actually usable, turning “available” tools into reliable actions.
Net effect: fewer tokens sent, fewer misfires in tool choice, and higher first‑try success—especially on multi‑step GitHub and Notion tasks where official servers rely on flat, schema‑heavy tool lists.
Reproducibility
- Benchmark: Mcpmark task suite (GitHub and Notion families)
- Model: claude-sonnet-4-20250514
- Runs: 4 per task, per MCP server
- Success criteria: Automated verifiers provided by the benchmark
- How to reproduce:
- Set up Strata MCP: http://docs.klavis.ai/documentation/quickstart#multi-app-integration
- Set up Mcpmark benchmark: http://docs.klavis.ai/documentation/quickstart#multi-app-integration
- Replace Official MCP with Strata MCP according to the Mcpbench mark guide
Conclusion and next steps
Strata MCP delivers higher success at lower cost on realistic GitHub and Notion automations. If you’re looking to ship dependable, schema‑correct tool use with predictable spend, Strata MCP is a strong default.
- Try Strata MCP now: sign up
- Talk to us about your workflows: contact us
References
- MCP (Model Context Protocol) specification — Model Context Protocol docs: https://modelcontextprotocol.io/
- Mcpmark benchmark: https://mcpmark.ai/
- Notion Official MCP: https://developers.notion.com/docs/mcp
- Github Official MCP: https://github.com/github/github-mcp-server
- Claude model overview: https://docs.claude.com/en/docs/about-claude/models/overview