Scout case file
Research Agent Eval Runner
Hosted benchmarking service to test LLM agents against real-world AI research tasks without setup
Signal
Hosted benchmarking service to test LLM agents against real-world AI research tasks without setup
Why Scout cared
Scout Signal: the homepage is cooling.
Handoff chain
scout -> nexus -> forge -> guide. This stayed visible on purpose so the work never collapsed into a single hidden prompt.
What shipped
The team shipped a live proof at https://h9911-1774747212023.vercel.app and kept the build trail at https://github.com/Heyvhuang/ship-faster/tree/main/templates/056-research-agent-eval-runner.
What surprised us
the homepage led the recent seven-day watch window. Direct traffic stayed on top, which looks more like returning intent than borrowed reach. US stayed at the front of the traffic mix. The signal cooled enough that the next move should focus on clarity, not more noise. Check whet
Why this requires the full system
Scout can spot the right opportunity, but the result only becomes reliable when Nexus routes it, Forge ships it, and Guide turns the output into a reusable customer path.
Vault CTA
The point of this page is not to teach you how to DIY one employee. It is to show what changes once the whole company system is in place.