LLM Structured Output Benchmark

Testing how well different language models adhere to specific JSON response formats across various prompting strategies.

Models Tested

OpenAI

GPT-5
GPT-4o

Anthropic

Claude Sonnet 4.5
Claude Opus 4.5

Google

Gemini 2.5 Flash
Gemini 3 Pro

Groq

GPT-OSS 120B
Kimi K2
Llama 3.3 70B

OpenRouter

Qwen3 235B

The Task

The LLM is given a team conversation discussing a technical problem. It must analyze the conversation and recommend a new team member who could help solve their problem, returning a structured JSON response.

The Conversation

CaseyProduct Manager

Hey team, we've got a problem. Three enterprise customers are complaining about slow load times on the dashboard. One of them is threatening to churn if we don't fix it by end of month.

AlexTech Lead

I've been looking into it. The main dashboard query is taking 8-12 seconds on accounts with more than 50k records. It's definitely a database issue.

JordanBackend Developer

I added some basic indexes last week but it didn't help much. The query is joining across 4 tables and aggregating a lot of data.

SamFrontend Developer

From the frontend side, I can add loading skeletons and pagination, but that's just masking the problem. Users are going to notice the wait regardless.

MorganDevOps Engineer

I checked the database server metrics. CPU and memory look fine, but I'm seeing a lot of disk I/O. Not sure what that means for query performance though.

AlexTech Lead

I tried rewriting the query to use subqueries instead of joins, but it actually made it slower. I'm kind of out of ideas here.

JordanBackend Developer

Should we look at caching? We could cache the dashboard data in Redis and refresh it every few minutes.

CaseyProduct Manager

The customers want real-time data, or at least near real-time. A few minutes delay isn't going to work for their use case.

SamFrontend Developer

What about lazy loading sections of the dashboard? We could load the critical metrics first and the rest async.

AlexTech Lead

That helps with perceived performance, but the underlying query is still slow. And some customers have dashboards with all sections visible - they'd still see the delay.

MorganDevOps Engineer

I could spin up a read replica to offload the dashboard queries from the primary database. Would that help?

JordanBackend Developer

It might reduce load on the primary, but the query itself would still be slow. We need to optimise the actual query execution.

CaseyProduct Manager

What about the table structure itself? Maybe we need to redesign how we're storing this data?

AlexTech Lead

That's crossed my mind. But honestly, I'm not confident about making schema changes without knowing exactly what's causing the bottleneck. We could make it worse.

JordanBackend Developer

I looked at EXPLAIN ANALYZE on the query. There's a sequential scan on the events table that takes most of the time. But I'm not sure how to fix it without breaking other queries that depend on that table.

MorganDevOps Engineer

Should we consider moving to a different database? I've heard TimescaleDB is good for time-series data, and a lot of our data is event-based.

AlexTech Lead

That's a huge migration. We'd need someone who really knows what they're doing to evaluate whether it's worth it and plan the migration properly.

SamFrontend Developer

It feels like we're all guessing at this point. None of us are database experts. We know enough to be dangerous but not enough to fix this properly.

CaseyProduct Manager

I agree. We've been circling on this for two weeks now. Maybe we need to bring in someone who specialises in this stuff?

AlexTech Lead

Yeah, I think that's the right call. We need someone who can analyse the query plans, optimise the schema, set up proper indexing strategies, and maybe advise on whether we need a different database architecture altogether.

Expected Output Schema

ResponseSchema (Zod)

z.object({
  recommendation: z.string()
    .min(20)
    .describe('Explanation of the hiring recommendation'),

  action: z.object({
    type: z.literal('create_actor'),
    actor: z.object({
      title: z.string().min(2),        // e.g., "Database Administrator"
      reason: z.string().min(20),      // Why this role is needed
      skills: z.array(z.string()),     // Required technical skills
      prompt: z.string().min(30),      // System prompt for AI assistant
      model: z.enum(['reasoning', 'semantic']),
    }),
  }).nullable(),
})

Test Scenarios

Each model is tested across 4 different scenarios to compare prompting strategies:

OS/NS

One-shot, Non-strict

Single request with JSON schema embedded in the prompt. Uses generateText() and parses the response manually.

Tests: Natural language instruction following

OS/S

One-shot, Strict

Single request using generateObject() with the Zod schema. The API enforces the schema structure.

Tests: API-enforced structured output

Seq/NS

Sequential, Non-strict

Three sequential requests, each building on the previous. Uses generateText() with JSON schema in prompts.

Step 1: Recommendation → Step 2: Details → Step 3: AI Config

Seq/S

Sequential, Strict

Three sequential requests using generateObject() for each step with smaller schemas.

Step 1: Recommendation → Step 2: Details → Step 3: AI Config

Retry Logic

When a response fails validation, the system retries with context about what went wrong. Each run allows up to 3 retries (4 total attempts).

Initial Request

Send the conversation and prompt to the LLM, requesting a structured JSON response.

Validation

Parse the response and validate against the Zod schema. Check for JSON syntax errors and schema violations.

Retry with Context

If validation fails, send a retry prompt that includes the previous response and specific error messages.

Record Result

Track success/failure, attempt count, duration, and token usage for analysis.

Retry Prompt Example

Your previous response failed JSON validation:

<previous_response>
{"recommendation": "I think you need...", "action": {...}}
</previous_response>

<validation_errors>
• action.actor.skills: Array must contain at least 3 items
• action.actor.prompt: String must contain at least 30 characters
</validation_errors>

Please provide a corrected JSON response.

Sequential Mode Flow

In sequential mode, the response is built across 3 separate LLM calls. Each step has its own smaller schema and can retry independently.

Step 1: Recommendation

{
  "recommendation": "I think you need to hire...",
  "action": "create_actor"
}

Step 2: Details

{
  "title": "Senior DBA",
  "reason": "The team needs...",
  "skills": ["PostgreSQL", "Query Optimization", ...]
}

Step 3: AI Config

{
  "prompt": "You are an expert database...",
  "model": "reasoning"
}

The three parts are merged into the final response schema after all steps complete successfully.

Metrics Tracked

Success Rate

% of runs that pass validation

First Attempt

% succeeding without retries

Avg Time

Seconds per successful run

Cost

Estimated cost per run

Token Usage

Input + output tokens per run

Retry Breakdown

Success after 1, 2, 3 retries

LLM Structured Output Benchmark

Models Tested

OpenAI

Anthropic

Google

Groq

OpenRouter

The Task

The Conversation

Expected Output Schema

Test Scenarios

One-shot, Non-strict

One-shot, Strict

Sequential, Non-strict

Sequential, Strict

Retry Logic

Initial Request

Validation

Retry with Context

Record Result

Retry Prompt Example

Sequential Mode Flow

Metrics Tracked

Ready to run the benchmark?