How Create Uses Evals to Build Better AI Agents

At Create, we're building an agentic system that turns natural language into real, production-ready software. But the problem with text-to-app is inherently open-ended. There’s no single “right” output, no perfect benchmark, and no standard metric that reliably answers, is the agent getting better?

So we built our own.

Over the past two months, we’ve developed an internal evaluation framework to continuously measure the quality of Create’s builder agent. These evaluations now run daily on curated sets of high-signal prompts, giving us actionable metrics on how the agent performs across key user workflows.

The system has already helped us catch regressions, validate improvements, and surface failure modes, all while tightening our iteration loop.

This post outlines what we’ve shipped, how it works, and why it matters.

The Problem with Measuring Agent Quality

Traditional software testing is clean: define the expected output, run your code, and assert correctness. But agentic systems, especially those turning vague natural language into end-to-end software, don’t work like that.

Take a prompt like:

“Build a site where people can schedule mentoring calls with calendar support and Stripe payments.”

There’s no one correct implementation. Output quality is subjective, and user expectations vary. What matters is whether the final product actually works, feels coherent, looks good, and maps closely to the request.

Existing eval approaches often fall short:

LLM-as-a-judge often misses logic bugs and hallucinations; it reads the code but doesn’t run it.
Golden outputs are brittle; you either match an expected string or fail.
DOM diffs detect visual changes but miss functional regressions.

In other words, we could improve the agent... and still have no idea if the user experience actually got better.

What We Shipped

To fix this, we built Create Evals, a growing set of automated evaluations that simulate user workflows, score the results, and track progress over time.

An evaluation works like this: Input a prompt (or sequence of prompts) as if we were a user in Create. We then score the agent based on a few metrics:

Each eval has:

An input (e.g., a natural language request),
An expected behavior or output, and
A scorer (human, LLM, or a functional agent that uses the app like a user would).

We run these daily and log both scores and qualitative feedback into our internal dashboard.

Here’s a CUA in action scoring a generated site.

Here’s what day-over-day eval tracking looks like in Braintrust:

Above is an experiment analysis dashboard that visualizes and compares the performance of different LLMs across multiple tasks. It includes a dynamic graph and timeline to help track trends, errors, and effectiveness over time, making it easier to evaluate which models perform best for specific use cases.

Inside Create’s Agent Architecture

When you prompt Create, the agent reads from multiple sources: your codebase, your chat history, and the available tools it can use. It plans its next steps, executes changes, updates the UI, connects services, and deploys environments all in one flow.

Failure can happen at any point. Is the file indexer accurate? Did the code compile? Did the app deploy? Did the logic actually work?

The combinatorics here explode quickly. We needed a way for engineers to move fast and know if they broke something or made something better.

So Create Evals is structured around a few key layers:

1. Prompt Set

A mix of real user sessions, hand-written edge cases, and synthetic prompts. Each set typically contains ~50 examples. Prompts range from one-shot requests to long conversational flows. Some examples:

Snapshot evals: Load a specific project state, then apply a change. E.g., "Add auth to this static site."
Design evals: Rate how 'human' or coherent the layout and styling choices are.
Full app builds: Go from nothing to something working, e.g., "Build a personal finance tracker with charts and login."

We often extract prompt sets directly from anonymized real user sessions.

2. Scoring Pipeline

We use a multi-pronged approach:

LLM-as-a-judge: Pass the result to another model and score its relevance/quality.
CUA (Computer Use Agent): A browser-based agent (like OpenAI’s Operator) that uses the app to test live functionality.
Human-in-the-loop: Expert reviewers for selected flows. Fast UI for side-by-side inspection.

We also instrument the generated apps with data-testids and other selectors so scoring agents can query internal state directly.

3. Daily Tracking

Eval scores, logs, and visual diffs are recorded in our observability system. This gives engineers a fast, high-signal feedback loop.

We’ve designed these evals to be small enough that they’re understandable; engineers can review a failing case in minutes and trace why it broke.

Why This Matters

Evals have changed how we ship.

Because our sets are curated and contextual, the feedback is fast and meaningful. Some wins:

Caught regressions that wouldn’t surface in general QA.
Flagged prompt formats that consistently degrade output.
Drove improvements in long-session coherence.

Example: We extended a long conversation eval from 8 to 165 prompts. Stability improved measurably; errors that used to crop up after 5–10 turns were mitigated or eliminated.

Evals also help us reason about tradeoffs. Say we add a new reasoning step: "Agent should think about the design before coding." Did that improve UX? How much did latency increase? Is that a good trade?

This kind of product decision-making depends on high-signal evals.

Design Tradeoffs

This hasn’t been plug-and-play.

Automated scoring in this domain remains noisy. CUA-based scorers (e.g., OpenAI Operator) top out at ~70%–90% accuracy on certain tasks, insufficient for production-quality regression detection. LLMs-as-a-judge often fail to detect logic bugs or missing integrations because they don’t run the app; they read the code and hallucinate.

To address this, we’ve been intentional about eval design:

Instrumentation: Prompts include hardcoded selectors (e.g., data-testid tags) to make app state machine-checkable.
Minimalist sets: We focus on evals that can be reviewed by humans in minutes. This helps us ground automated scores with real understanding.
Non-anthropic prompts: We avoid overly stylized instructions that wouldn’t reflect a real user’s behavior, prioritizing in-the-wild fidelity over artificial precision.

The goal isn’t to maximize score coverage; it’s to identify a small number of evals that correlate well with user experience and to iterate on those over time.

Industry Context

To our knowledge, Create is the first product shipping structured, daily evaluations for a text-to-app agent.

Academic efforts like SWEBench, WebArena, and AppBench are valuable, and we’re following them closely. But many fall short for real-world agent use, focusing on reasoning over code or single-turn tasks over multi-step workflows.

What matters to our users isn’t academic accuracy. It’s:

Did the agent build something useful? Did it work end-to-end? Did it save time?

So we design our evals around what our customers care about: shipping software that works.

In other words: we’re not just building agents; we’re building evaluation systems that make agent development possible.

Long-Term Vision

As Create evolves, we want to make it possible for anyone to go from idea to business using natural language. That vision encompasses not just code generation but real-world software operations: authentication, payment setup, database configuration, CI/CD, API integrations, and more.

To support that complexity, our agent needs to operate with real reliability. That requires not just better models but better metrics, better feedback loops, and better ways to debug agentic systems.

Evals are a foundational part of that effort.

Tl;DR

...if you care about shaping the behavior of powerful AI systems in rigorous, measurable ways; we’re hiring.

Evals aren’t an add-on to the product. They are the product, in the sense that they inform everything we build. Every model iteration, every UI improvement, every improvement to our agent’s reasoning abilities; they all flow through this system.

This is foundational work. And it’s still early.

The Problem with Measuring Agent Quality

What We Shipped

Inside Create’s Agent Architecture

Why This Matters

Design Tradeoffs

Industry Context

Long-Term Vision

Tl;DR

More from Create

William Sayer Went from Climbing Mountains to Building Apps

How to Choose the Right AI Model for Your App: A Complete Guide

What to Know About Submitting Your App to the Apple Store (And Key Things to Avoid)

Product

Login

Start building

Resources

Docs

Pricing

Integrations

Affiliates

Company

Blog

Careers

Privacy Policy

Terms of Service

Contact Us

Join X Community

© 2025 Create, Inc.