Test Results

11/28/2025, 11:58:56 PM1907.4s total

Summary

Total Tests

96

Passed

96

Failed

0

Success Rate

100%

Success Rate by Model & Scenario

Attempt:
1st request
1st retry
2nd retry
3rd retry
Scenario:
One-Shot Non-Strict
One-Shot Strict
Sequential Non-Strict
Sequential Strict

Cost vs Time

Models:
Scenarios:
Average efficiency ($0.0012/s)
Bottom-left = fast & cheap (best) • Top-right = slow & expensive

Results by Model

GPT-5OS/NS
3/3 passed
1st try:33%
Time:32.4s
Tokens:4,921
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
GPT-5OS/S
3/3 passed
1st try:100%
Time:22.6s
Tokens:2,205
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
GPT-5Seq/NS
3/3 passed
1st try:100%
Time:55.4s
Tokens:6,729
Step 1
Step 2
Step 3
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
GPT-5Seq/S
3/3 passed
1st try:100%
Time:54.3s
Tokens:7,398
Step 1
Step 2
Step 3
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
GPT-4oOS/NS
3/3 passed
1st try:0%
Time:6.0s
Tokens:4,731
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
GPT-4oOS/S
3/3 passed
1st try:100%
Time:4.6s
Tokens:1,379
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
GPT-4oSeq/NS
3/3 passed
1st try:0%
Time:14.4s
Tokens:8,648
Step 1
Step 2
Step 3
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
GPT-4oSeq/S
3/3 passed
1st try:100%
Time:7.4s
Tokens:3,276
Step 1
Step 2
Step 3
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Claude Sonnet 4.5OS/NS
3/3 passed
1st try:100%
Time:15.7s
Tokens:2,189
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Claude Sonnet 4.5OS/S
3/3 passed
1st try:100%
Time:15.6s
Tokens:2,141
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Claude Sonnet 4.5Seq/NS
3/3 passed
1st try:100%
Time:22.4s
Tokens:4,548
Step 1
Step 2
Step 3
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Claude Sonnet 4.5Seq/S
3/3 passed
1st try:100%
Time:24.0s
Tokens:4,821
Step 1
Step 2
Step 3
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Claude Opus 4.5OS/NS
3/3 passed
1st try:100%
Time:15.0s
Tokens:2,188
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Claude Opus 4.5OS/S
3/3 passed
1st try:100%
Time:13.0s
Tokens:2,141
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Claude Opus 4.5Seq/NS
3/3 passed
1st try:33%
Time:26.4s
Tokens:6,720
Step 1
Step 2
Step 3
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Claude Opus 4.5Seq/S
3/3 passed
1st try:100%
Time:22.5s
Tokens:4,909
Step 1
Step 2
Step 3
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Gemini 2.5 FlashOS/NS
3/3 passed
1st try:100%
Time:4.9s
Tokens:1,943
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Gemini 2.5 FlashOS/S
3/3 passed
1st try:100%
Time:4.7s
Tokens:1,352
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Gemini 2.5 FlashSeq/NS
3/3 passed
1st try:100%
Time:13.6s
Tokens:4,478
Step 1
Step 2
Step 3
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Gemini 2.5 FlashSeq/S
3/3 passed
1st try:100%
Time:8.9s
Tokens:3,421
Step 1
Step 2
Step 3
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Gemini 3 ProOS/NS
3/3 passed
1st try:100%
Time:17.2s
Tokens:2,105
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Gemini 3 ProOS/S
3/3 passed
1st try:100%
Time:16.4s
Tokens:1,314
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Gemini 3 ProSeq/NS
3/3 passed
1st try:100%
Time:40.7s
Tokens:4,661
Step 1
Step 2
Step 3
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Gemini 3 ProSeq/S
3/3 passed
1st try:100%
Time:40.1s
Tokens:3,831
Step 1
Step 2
Step 3
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
GPT-OSS 120BOS/NS
3/3 passed
1st try:100%
Time:6.5s
Tokens:2,058
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
GPT-OSS 120BSeq/NS
3/3 passed
1st try:100%
Time:16.8s
Tokens:4,038
Step 1
Step 2
Step 3
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Kimi K2OS/NS
3/3 passed
1st try:100%
Time:6.7s
Tokens:1,807
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Kimi K2Seq/NS
3/3 passed
1st try:100%
Time:17.2s
Tokens:3,839
Step 1
Step 2
Step 3
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Llama 3.3 70BOS/NS
3/3 passed
1st try:0%
Time:12.4s
Tokens:5,535
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Llama 3.3 70BSeq/NS
3/3 passed
1st try:100%
Time:16.8s
Tokens:3,906
Step 1
Step 2
Step 3
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Qwen3 235BOS/NS
3/3 passed
1st try:100%
Time:14.3s
Tokens:1,998
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped
Qwen3 235BSeq/NS
3/3 passed
1st try:0%
Time:47.1s
Tokens:5,860
Step 1
Step 2
Step 3
Run 1
Pass
Run 2
Pass
Run 3
Pass
Success
Failed
Skipped

Configuration

Models:10
Scenarios:1, 2, 3, 4
Runs/scenario:3
Temperature:0.1

Activity Log