Summary
Total Tests
96
Passed
96
Failed
0
Success Rate
100%Success Rate by Model & Scenario
Attempt:
1st request
1st retry
2nd retry
3rd retry
Scenario:
One-Shot Non-Strict
One-Shot Strict
Sequential Non-Strict
Sequential Strict
Cost vs Time
Models:
Scenarios:
Average efficiency ($0.0011/s)
Bottom-left = fast & cheap (best) • Top-right = slow & expensive
Results by Model
GPT-5OS/NS
3/3 passed1st try:100%
Time:30.5s
Tokens:2,321
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
GPT-5OS/S
3/3 passed1st try:100%
Time:30.7s
Tokens:2,362
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
GPT-5Seq/NS
3/3 passed1st try:100%
Time:67.3s
Tokens:5,590
Step 1
Step 2
Step 3
Run 1Pass
Run 2Pass
Run 3Pass
Success
Failed
Skipped
GPT-5Seq/S
3/3 passed1st try:100%
Time:72.0s
Tokens:7,518
Step 1
Step 2
Step 3
Run 1Pass
Run 2Pass
Run 3Pass
Success
Failed
Skipped
GPT-4oOS/NS
3/3 passed1st try:100%
Time:2.4s
Tokens:1,226
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
GPT-4oOS/S
3/3 passed1st try:100%
Time:2.8s
Tokens:1,371
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
GPT-4oSeq/NS
3/3 passed1st try:100%
Time:7.0s
Tokens:3,018
Step 1
Step 2
Step 3
Run 1Pass
Run 2Pass
Run 3Pass
Success
Failed
Skipped
GPT-4oSeq/S
3/3 passed1st try:100%
Time:7.2s
Tokens:3,291
Step 1
Step 2
Step 3
Run 1Pass
Run 2Pass
Run 3Pass
Success
Failed
Skipped
Claude Sonnet 4.5OS/NS
3/3 passed1st try:100%
Time:15.8s
Tokens:1,700
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Claude Sonnet 4.5OS/S
3/3 passed1st try:100%
Time:13.9s
Tokens:2,112
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Claude Sonnet 4.5Seq/NS
3/3 passed1st try:100%
Time:21.8s
Tokens:4,226
Step 1
Step 2
Step 3
Run 1Pass
Run 2Pass
Run 3Pass
Success
Failed
Skipped
Claude Sonnet 4.5Seq/S
3/3 passed1st try:100%
Time:21.4s
Tokens:4,755
Step 1
Step 2
Step 3
Run 1Pass
Run 2Pass
Run 3Pass
Success
Failed
Skipped
Claude Opus 4.5OS/NS
3/3 passed1st try:100%
Time:15.2s
Tokens:1,683
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Claude Opus 4.5OS/S
3/3 passed1st try:100%
Time:14.1s
Tokens:2,140
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Claude Opus 4.5Seq/NS
3/3 passed1st try:0%
Time:25.4s
Tokens:6,503
Step 1
Step 2
Step 3
Run 1Pass
Run 2Pass
Run 3Pass
Success
Failed
Skipped
Claude Opus 4.5Seq/S
3/3 passed1st try:100%
Time:24.4s
Tokens:4,998
Step 1
Step 2
Step 3
Run 1Pass
Run 2Pass
Run 3Pass
Success
Failed
Skipped
Gemini 2.5 FlashOS/NS
3/3 passed1st try:100%
Time:5.0s
Tokens:1,507
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Gemini 2.5 FlashOS/S
3/3 passed1st try:100%
Time:5.1s
Tokens:1,331
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Gemini 2.5 FlashSeq/NS
3/3 passed1st try:100%
Time:14.4s
Tokens:3,942
Step 1
Step 2
Step 3
Run 1Pass
Run 2Pass
Run 3Pass
Success
Failed
Skipped
Gemini 2.5 FlashSeq/S
3/3 passed1st try:100%
Time:9.2s
Tokens:3,396
Step 1
Step 2
Step 3
Run 1Pass
Run 2Pass
Run 3Pass
Success
Failed
Skipped
Gemini 3 ProOS/NS
3/3 passed1st try:100%
Time:66.1s
Tokens:1,400
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Gemini 3 ProOS/S
3/3 passed1st try:100%
Time:48.4s
Tokens:1,306
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Gemini 3 ProSeq/NS
3/3 passed1st try:100%
Time:233.6s
Tokens:3,595
Step 1
Step 2
Step 3
Run 1Pass
Run 2Pass
Run 3Pass
Success
Failed
Skipped
Gemini 3 ProSeq/S
3/3 passed1st try:100%
Time:75.4s
Tokens:3,706
Step 1
Step 2
Step 3
Run 1Pass
Run 2Pass
Run 3Pass
Success
Failed
Skipped
GPT-OSS 120BOS/NS
3/3 passed1st try:100%
Time:6.4s
Tokens:1,577
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
GPT-OSS 120BSeq/NS
3/3 passed1st try:100%
Time:17.3s
Tokens:3,478
Step 1
Step 2
Step 3
Run 1Pass
Run 2Pass
Run 3Pass
Success
Failed
Skipped
Kimi K2OS/NS
3/3 passed1st try:100%
Time:6.5s
Tokens:1,343
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Kimi K2Seq/NS
3/3 passed1st try:100%
Time:17.2s
Tokens:3,172
Step 1
Step 2
Step 3
Run 1Pass
Run 2Pass
Run 3Pass
Success
Failed
Skipped
Llama 3.3 70BOS/NS
3/3 passed1st try:100%
Time:5.8s
Tokens:1,254
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Llama 3.3 70BSeq/NS
3/3 passed1st try:100%
Time:17.1s
Tokens:3,320
Step 1
Step 2
Step 3
Run 1Pass
Run 2Pass
Run 3Pass
Success
Failed
Skipped
Qwen3 235BOS/NS
3/3 passed1st try:100%
Time:13.8s
Tokens:1,573
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Qwen3 235BSeq/NS
3/3 passed1st try:0%
Time:57.5s
Tokens:5,958
Step 1
Step 2
Step 3
Run 1Pass
Run 2Pass
Run 3Pass
Success
Failed
Skipped
Configuration
Models:10
Scenarios:1, 2, 3, 4
Runs/scenario:3
Temperature:0.1