Research

Dec 9, 2025
Omnigrep: State-of-the-Art Agentic Code Search
Omnigrep achieves state-of-the-art F0.5 of 0.475 on CodeSearchEval, outperforming Cognition's SWE-grep by 15% and Claude Code by 33% through multi-turn chain-of-thought reasoning.

Dec 5, 2025
Paragon E2E: Natural Language End-to-End Testing
Paragon now supports end-to-end testing through natural language. Describe what you want to test in plain English, and Paragon writes, runs, and iterates on Playwright tests.

Nov 3, 2025
ReviewBenchLite: A Benchmark for Evaluating Code Review Agents Capabilities with Production issues.
A benchmark for systematically evaluating proactive code review capabilities. Evaluates 117 real-world issues from 25 Python repositories across five categories. Results show specialized agents achieve up to 81.2% accuracy.
