LLM Structured Output Benchmark
Testing how well different language models adhere to specific JSON response formats across various prompting strategies.
Models Tested
OpenAI
- GPT-5
- GPT-4o
Anthropic
- Claude Sonnet 4.5
- Claude Opus 4.5
- Gemini 2.5 Flash
- Gemini 3 Pro
Groq
- GPT-OSS 120B
- Kimi K2
- Llama 3.3 70B
OpenRouter
- Qwen3 235B
The Task
The LLM is given a team conversation discussing a technical problem. It must analyze the conversation and recommend a new team member who could help solve their problem, returning a structured JSON response.
The Conversation
Hey team, we've got a problem. Three enterprise customers are complaining about slow load times on the dashboard. One of them is threatening to churn if we don't fix it by end of month.
I've been looking into it. The main dashboard query is taking 8-12 seconds on accounts with more than 50k records. It's definitely a database issue.
I added some basic indexes last week but it didn't help much. The query is joining across 4 tables and aggregating a lot of data.
From the frontend side, I can add loading skeletons and pagination, but that's just masking the problem. Users are going to notice the wait regardless.
I checked the database server metrics. CPU and memory look fine, but I'm seeing a lot of disk I/O. Not sure what that means for query performance though.
I tried rewriting the query to use subqueries instead of joins, but it actually made it slower. I'm kind of out of ideas here.
Should we look at caching? We could cache the dashboard data in Redis and refresh it every few minutes.
The customers want real-time data, or at least near real-time. A few minutes delay isn't going to work for their use case.
What about lazy loading sections of the dashboard? We could load the critical metrics first and the rest async.
That helps with perceived performance, but the underlying query is still slow. And some customers have dashboards with all sections visible - they'd still see the delay.
I could spin up a read replica to offload the dashboard queries from the primary database. Would that help?
It might reduce load on the primary, but the query itself would still be slow. We need to optimise the actual query execution.
What about the table structure itself? Maybe we need to redesign how we're storing this data?
That's crossed my mind. But honestly, I'm not confident about making schema changes without knowing exactly what's causing the bottleneck. We could make it worse.
I looked at EXPLAIN ANALYZE on the query. There's a sequential scan on the events table that takes most of the time. But I'm not sure how to fix it without breaking other queries that depend on that table.
Should we consider moving to a different database? I've heard TimescaleDB is good for time-series data, and a lot of our data is event-based.
That's a huge migration. We'd need someone who really knows what they're doing to evaluate whether it's worth it and plan the migration properly.
It feels like we're all guessing at this point. None of us are database experts. We know enough to be dangerous but not enough to fix this properly.
I agree. We've been circling on this for two weeks now. Maybe we need to bring in someone who specialises in this stuff?
Yeah, I think that's the right call. We need someone who can analyse the query plans, optimise the schema, set up proper indexing strategies, and maybe advise on whether we need a different database architecture altogether.
Expected Output Schema
z.object({
recommendation: z.string()
.min(20)
.describe('Explanation of the hiring recommendation'),
action: z.object({
type: z.literal('create_actor'),
actor: z.object({
title: z.string().min(2), // e.g., "Database Administrator"
reason: z.string().min(20), // Why this role is needed
skills: z.array(z.string()), // Required technical skills
prompt: z.string().min(30), // System prompt for AI assistant
model: z.enum(['reasoning', 'semantic']),
}),
}).nullable(),
})Test Scenarios
Each model is tested across 4 different scenarios to compare prompting strategies:
One-shot, Non-strict
Single request with JSON schema embedded in the prompt. Uses generateText() and parses the response manually.
One-shot, Strict
Single request using generateObject() with the Zod schema. The API enforces the schema structure.
Sequential, Non-strict
Three sequential requests, each building on the previous. Uses generateText() with JSON schema in prompts.
Sequential, Strict
Three sequential requests using generateObject() for each step with smaller schemas.
Retry Logic
When a response fails validation, the system retries with context about what went wrong. Each run allows up to 3 retries (4 total attempts).
Initial Request
Send the conversation and prompt to the LLM, requesting a structured JSON response.
Validation
Parse the response and validate against the Zod schema. Check for JSON syntax errors and schema violations.
Retry with Context
If validation fails, send a retry prompt that includes the previous response and specific error messages.
Record Result
Track success/failure, attempt count, duration, and token usage for analysis.
Retry Prompt Example
Your previous response failed JSON validation:
<previous_response>
{"recommendation": "I think you need...", "action": {...}}
</previous_response>
<validation_errors>
• action.actor.skills: Array must contain at least 3 items
• action.actor.prompt: String must contain at least 30 characters
</validation_errors>
Please provide a corrected JSON response.Sequential Mode Flow
In sequential mode, the response is built across 3 separate LLM calls. Each step has its own smaller schema and can retry independently.
{
"recommendation": "I think you need to hire...",
"action": "create_actor"
}{
"title": "Senior DBA",
"reason": "The team needs...",
"skills": ["PostgreSQL", "Query Optimization", ...]
}{
"prompt": "You are an expert database...",
"model": "reasoning"
}The three parts are merged into the final response schema after all steps complete successfully.