How to Evaluate AI-Generated Results

Project Introduction

The frontend class I took over the semester is coming to an end. For my final project, I created an interface called Style Print that helps extract visual style elements from reference images and generate new UI designs.

StylePrint is a project where you upload multiple UI reference images, extract design elements from each, and combine them to generate new React/Tailwind UI code. Each week, we received group mentoring from professional developers. Through this process, I realized that the important thing is not that AI produces results itself, but rather making AI-generated results comprehensible enough for me to evaluate them.

The Problem with Convincing Results

When AI creates UI, the results often look quite plausible. The colors fit reasonably well, the layout doesn’t look bad, and the Tailwind code works at first glance. But from the code alone, it’s hard to tell whether the colors used were what I intended, which parts of the reference it was taken from, whether accessibility standards were met, or if the code is reusable.

This is especially true when mixing multiple UI references in one project. For example, if colors come from reference 1, typography from 2, and layout from 3, it’s difficult to verify if the final combination was applied as intended just by looking at the output.

A Connecting Link Called IntentSpec

So in this project, instead of going directly from image to code, there’s an intermediate structure called IntentSpec. Design elements extracted from references are divided into facets like color, typography, layout, spacing, and component style, and it records which elements came from which references.

It looks roughly like this.

type IntentSpec = {
  chosen: {
    colorRefId?: string
    typographyRefId?: string
    layoutRefId?: string
    spacingRefId?: string
    componentStyleRefId?: string
  }
  normalized: {
    palette?: Record<string, string>
    typography?: object
    layout?: object
    spacing?: object
    componentStyle?: object
  }
  provenance: Record<string, Evidence>
  conflicts: ConflictCard[]
  repairs: RepairPlan[]
  coherenceScore?: number
}

At first, I felt this structure was a bit cumbersome. Wouldn’t it be faster to just input an image and generate code right away? But as the project progressed, I realized this intermediate structure was the core of the project.

With IntentSpec, you can ask more specific questions like "Does this result align with my intent?"

Did it use the intended color palette?
Is the typography size not too small?
For a compact layout, is spacing not excessively wide?
Did the generated code actually reflect the intended facets?

It turns AI outputs from something assessed by intuition into something that can be compared and reviewed concretely.

What AI Should Do vs. What Code Should Do

One piece of feedback from mentoring was not to delegate everything to the LLM; it’s better to code deterministic processes where possible. Initially, I had the LLM API extract all features, but to save tokens and improve system stability, I split the system by feature. For things like color extraction, which can be calculated pixel-wise, the code handles it. AI is used for interpreting layout mood or component style from images and auditing whether the generated code reflects the intent well. The final React/Tailwind code generation is handled by v0. Building this structure made clear what AI excels at and what requires human-set criteria. This separation also made it possible to track issues during the process and helped improve the quality of the results.

Ultimately, My Personal Preference Matters

A key point emphasized by my mentor was "My taste, my thoughts, my decisions." The more I used AI, the more this statement resonated. AI won’t necessarily ask the right questions for what I want. Because the results look plausible, I might overlook details. Ultimately, if I don’t know what I want, I have no standard to judge the AI’s output.

What is the core purpose of this service?
Is the target user a designer, frontend developer, or someone wanting to quickly prototype?
Should the output be a creative mockup, or code ready to be copied and pasted and run immediately?

Answering these questions helped sharpen the direction of the service.

Scope of AI Utilization

The project is now in its final stages. Next, I need to verify whether each feature works as intended, whether the UI design indeed reflects the references as intended, and how to improve UI design coherence and aesthetics.

I believe AI can assist in this process. If I establish criteria for checking whether "what I intended" is functioning correctly, AI can evaluate the results through tests and iteratively suggest improvements, enabling code reviews of what I’ve built so far.

Conclusion

What I learned through this project is not just how to use AI more, but how to make AI-generated results understandable and evaluable by myself.

AI produces results quickly. But whether those results are what I wanted, what standards define good results, and what should be fixed — these are decisions I have to make. Therefore, in the AI era, more important than prompt-writing skills is the ability to structure what I want, turn it into evaluation criteria, and create repeatable validation processes.

Ultimately, good AI utilization is less about blindly delegating work and more about building frameworks that help me make clearer decisions.

← Back to blog