Microsoft puts Copilot agent quality on the command line with a new evaluations tool

Microsoft has pushed the Microsoft 365 Copilot Agent Evaluations tool into public preview, and it's worth paying attention to even if you're not currently building agents. The tool sends prompts to a deployed agent, captures the responses, and scores them using Azure OpenAI as a judge — producing structured reports that fit into a developer's inner loop or a CI/CD pipeline.

In other words: agent quality is becoming testable, repeatable, and part of the build process. That's a meaningful shift.

Why this matters now

For the past couple of years, "build a Copilot agent" has largely meant wiring something up, eyeballing a few responses, and shipping. That works fine for demos. It doesn't work when the agent sits inside a real business workflow — answering customer questions, summarising contracts, triaging tickets, taking actions across systems.

Microsoft's framing in the announcement is pretty direct: as agents move from demos into core workflows, the bar rises with them. Customers expect responses that are accurate, grounded, and consistent. Manual testing doesn't scale to that bar. You need objective, repeatable evaluation, and ideally something that lives where developers already work.

That's what the Agent Evaluations CLI is trying to be.

What's actually in the preview

The tool is a command-line interface that plugs into the Microsoft 365 Agents Toolkit, so developers can evaluate declarative agents from the same environment they're building them in. A few things stand out from the announcement.

It handles both single-turn and multi-turn conversations. That second part matters — agents that handle follow-ups, retain context, and complete end-to-end tasks behave very differently from ones that answer one prompt at a time. Testing only the first turn gives you a flattering but misleading picture.

Responses are scored against a mix of evaluators. Some are LLM-based, like Coherence and Groundedness — Azure OpenAI plays judge and rates the output on a 1-to-5 scale. Others are deterministic and code-based, like ExactMatch and PartialMatch, which are useful when you know exactly what the right answer should look like. Combining both is the right shape for a real evaluation suite.

Output comes as a sharable HTML scorecard (with JSON and CSV also available, per the Learn documentation). Microsoft pitches the scorecard as objective evidence of agent quality — something you can drop into a code review or attach to a release record. For regulated industries especially, that kind of paper trail is more than a nice-to-have.

There's also an interactive agent picker so testing teams (not just developers) can run evaluations, and the evaluation skill is accessible from coding agents like Claude Code and Copilot via the Microsoft 365 Agents Toolkit's Work IQ skill.

What you need to try it

The tool itself is free to install during public preview. To actually run it, you'll need a Microsoft 365 Copilot license, an agent already deployed in your tenant, Node.js 24.12.0 or higher, admin consent for the tool in your tenant, and an Azure OpenAI endpoint to power the LLM-judge evaluators.

That last requirement is worth thinking through. Azure OpenAI consumption is billed through your Azure subscription on a token basis, so while the CLI itself is free, every evaluation run costs real money in Azure consumption. If you're planning to wire this into a CI pipeline that runs on every commit, it's worth modelling that cost early — repeated multi-turn evaluations against a frontier model add up. Note that the default model in the CLI's environment variables is gpt-4o-mini, which is one of the cheaper options, but you can point it at whatever deployment you prefer.

The admin consent step is also worth flagging if you work in a larger organisation. Tools like this rarely make it past tenant admins on the first try without a conversation, so getting that started in parallel with your technical setup is a good idea.

The bigger pattern

Zoom out and this fits a broader shift in how the industry is approaching generative AI in production. The first wave of Copilot and agent tooling was about building. The second wave — the one we're in now — is about trusting. Evaluation frameworks, groundedness metrics, scorecards, regression suites: these are the boring but essential machinery that turns a clever prototype into something a CIO will sign off on.

It's also a sign that Microsoft expects agents to be treated as software, with all the accompanying engineering discipline — version control, automated testing, code review, CI/CD. If you've been building agents as configuration rather than code, that mindset shift is coming whether you like it or not.

The feedback loop here is open, too. Microsoft is asking developers to file issues on GitHub and tell them which evaluators and integrations matter most. That suggests the GA shape isn't locked in yet, which is a decent reason to engage early if you've got opinions about what good agent testing looks like.

For European teams building on the Microsoft stack — and there are a lot of you, judging by the conversations happening across our community — this is the kind of unglamorous but consequential tooling worth getting hands-on with before it becomes standard practice. Expect agent evaluation to be a recurring thread in the talks, hallway chats, and demo booths at ECS this year.