LLM Structured Output Benchmark

Testing how well different language models adhere to specific JSON response formats across various prompting strategies.

Models Tested

OpenAI

  • GPT-5
  • GPT-4o

Anthropic

  • Claude Sonnet 4.5
  • Claude Opus 4.5

Google

  • Gemini 2.5 Flash
  • Gemini 3 Pro

Groq

  • GPT-OSS 120B
  • Kimi K2
  • Llama 3.3 70B

OpenRouter

  • Qwen3 235B

The Task

The LLM is given a team conversation discussing a technical problem. It must analyze the conversation and recommend a new team member who could help solve their problem, returning a structured JSON response.

The Conversation

C
CaseyProduct Manager

Hey team, we've got a problem. Three enterprise customers are complaining about slow load times on the dashboard. One of them is threatening to churn if we don't fix it by end of month.

A
AlexTech Lead

I've been looking into it. The main dashboard query is taking 8-12 seconds on accounts with more than 50k records. It's definitely a database issue.

J
JordanBackend Developer

I added some basic indexes last week but it didn't help much. The query is joining across 4 tables and aggregating a lot of data.

S
SamFrontend Developer

From the frontend side, I can add loading skeletons and pagination, but that's just masking the problem. Users are going to notice the wait regardless.

M
MorganDevOps Engineer

I checked the database server metrics. CPU and memory look fine, but I'm seeing a lot of disk I/O. Not sure what that means for query performance though.

A
AlexTech Lead

I tried rewriting the query to use subqueries instead of joins, but it actually made it slower. I'm kind of out of ideas here.

J
JordanBackend Developer

Should we look at caching? We could cache the dashboard data in Redis and refresh it every few minutes.

C
CaseyProduct Manager

The customers want real-time data, or at least near real-time. A few minutes delay isn't going to work for their use case.

S
SamFrontend Developer

What about lazy loading sections of the dashboard? We could load the critical metrics first and the rest async.

A
AlexTech Lead

That helps with perceived performance, but the underlying query is still slow. And some customers have dashboards with all sections visible - they'd still see the delay.

M
MorganDevOps Engineer

I could spin up a read replica to offload the dashboard queries from the primary database. Would that help?

J
JordanBackend Developer

It might reduce load on the primary, but the query itself would still be slow. We need to optimise the actual query execution.

C
CaseyProduct Manager

What about the table structure itself? Maybe we need to redesign how we're storing this data?

A
AlexTech Lead

That's crossed my mind. But honestly, I'm not confident about making schema changes without knowing exactly what's causing the bottleneck. We could make it worse.

J
JordanBackend Developer

I looked at EXPLAIN ANALYZE on the query. There's a sequential scan on the events table that takes most of the time. But I'm not sure how to fix it without breaking other queries that depend on that table.

M
MorganDevOps Engineer

Should we consider moving to a different database? I've heard TimescaleDB is good for time-series data, and a lot of our data is event-based.

A
AlexTech Lead

That's a huge migration. We'd need someone who really knows what they're doing to evaluate whether it's worth it and plan the migration properly.

S
SamFrontend Developer

It feels like we're all guessing at this point. None of us are database experts. We know enough to be dangerous but not enough to fix this properly.

C
CaseyProduct Manager

I agree. We've been circling on this for two weeks now. Maybe we need to bring in someone who specialises in this stuff?

A
AlexTech Lead

Yeah, I think that's the right call. We need someone who can analyse the query plans, optimise the schema, set up proper indexing strategies, and maybe advise on whether we need a different database architecture altogether.

Expected Output Schema

ResponseSchema (Zod)
z.object({
  recommendation: z.string()
    .min(20)
    .describe('Explanation of the hiring recommendation'),

  action: z.object({
    type: z.literal('create_actor'),
    actor: z.object({
      title: z.string().min(2),        // e.g., "Database Administrator"
      reason: z.string().min(20),      // Why this role is needed
      skills: z.array(z.string()),     // Required technical skills
      prompt: z.string().min(30),      // System prompt for AI assistant
      model: z.enum(['reasoning', 'semantic']),
    }),
  }).nullable(),
})

Test Scenarios

Each model is tested across 4 different scenarios to compare prompting strategies:

OS/NS

One-shot, Non-strict

Single request with JSON schema embedded in the prompt. Uses generateText() and parses the response manually.

Tests: Natural language instruction following
OS/S

One-shot, Strict

Single request using generateObject() with the Zod schema. The API enforces the schema structure.

Tests: API-enforced structured output
Seq/NS

Sequential, Non-strict

Three sequential requests, each building on the previous. Uses generateText() with JSON schema in prompts.

Step 1: Recommendation → Step 2: Details → Step 3: AI Config
Seq/S

Sequential, Strict

Three sequential requests using generateObject() for each step with smaller schemas.

Step 1: Recommendation → Step 2: Details → Step 3: AI Config

Retry Logic

When a response fails validation, the system retries with context about what went wrong. Each run allows up to 3 retries (4 total attempts).

1

Initial Request

Send the conversation and prompt to the LLM, requesting a structured JSON response.

2

Validation

Parse the response and validate against the Zod schema. Check for JSON syntax errors and schema violations.

3

Retry with Context

If validation fails, send a retry prompt that includes the previous response and specific error messages.

4

Record Result

Track success/failure, attempt count, duration, and token usage for analysis.

Retry Prompt Example

Your previous response failed JSON validation:

<previous_response>
{"recommendation": "I think you need...", "action": {...}}
</previous_response>

<validation_errors>
• action.actor.skills: Array must contain at least 3 items
• action.actor.prompt: String must contain at least 30 characters
</validation_errors>

Please provide a corrected JSON response.

Sequential Mode Flow

In sequential mode, the response is built across 3 separate LLM calls. Each step has its own smaller schema and can retry independently.

Step 1: Recommendation
{
  "recommendation": "I think you need to hire...",
  "action": "create_actor"
}
Step 2: Details
{
  "title": "Senior DBA",
  "reason": "The team needs...",
  "skills": ["PostgreSQL", "Query Optimization", ...]
}
Step 3: AI Config
{
  "prompt": "You are an expert database...",
  "model": "reasoning"
}

The three parts are merged into the final response schema after all steps complete successfully.

Metrics Tracked

Success Rate
% of runs that pass validation
First Attempt
% succeeding without retries
Avg Time
Seconds per successful run
Cost
Estimated cost per run
Token Usage
Input + output tokens per run
Retry Breakdown
Success after 1, 2, 3 retries

Ready to run the benchmark?