i ran my own llm benchmarks (52 cycles, 5 models)

February 17, 2026·3 min read

i don't trust public benchmarks. so i ran my own.

science vibes

context

i have an autonomous coding loop. the short version:

a task comes in from a backlog. a formulation model reads it and writes a detailed instruction. what to build, which files, what success looks like. then claude code executes that instruction. same executor every time.

the question: which formulation model actually produces instructions that lead to working code?

the test

5 models on openrouter, randomly assigned per cycle:

minimax/minimax-m2.5
z-ai/glm-5
qwen/qwen3-coder-next
qwen/qwen3.5-plus-02-15
moonshotai/kimi-k2.5

success = claude code finishes the task, tests pass, files created. failure = incomplete work or errors.

52 active cycles.

raw results

model	cycles	successes	rate
minimax/minimax-m2.5	6	5	83.3%
z-ai/glm-5	9	7	77.8%
qwen/qwen3-coder-next	13	10	76.9%
qwen/qwen3.5-plus-02-15	6	3	50.0%
moonshotai/kimi-k2.5	7	3	42.9%

from the validation report i built to track this:

============================================================
  formulation model performance
============================================================

  model                                    attempts  success     rate
  ---------------------------------------- -------- -------- --------
  minimax/minimax-m2.5                            6        5    83.3%
  z-ai/glm-5                                      9        7    77.8%
  qwen/qwen3-coder-next                          13       10    76.9%
  qwen/qwen3.5-plus-02-15                         6        3    50.0%
  moonshotai/kimi-k2.5                             7        3    42.9%

top 3 clustered at 77-83%. then a gap. then the bottom half.

why kimi failed

this is the interesting part. kimi isn't dumb. it's too ambitious.

here's what kimi would formulate:

"implement a fisher's exact test for statistical significance, build a comprehensive a/b validation framework with multiple dimensions of analysis, add cosine similarity for behavioral drift detection..."

meanwhile minimax would write:

"add these fields to metrics.py. update the report. write tests. run pytest."

done in 60 seconds.

over-specification kills. claude code is smart enough to figure out details. it needs direction and a realistic scope, not a research paper. the bigger thinking models need room to breathe. kimi's formulations were genuinely thoughtful but the scope was too large for a single execution cycle.

qwen3.5-plus vs qwen3-coder-next

same model family. 27-point gap. the "plus" adds verbosity that doesn't help. the coder variant is more focused, writes tighter instructions.

what i did with this

trimmed to the top 3:

# tested via a/b (2026-02-17, 52 cycles).
# top 3: minimax 83%, glm-5 78%, qwen3-coder 77%.
# dropped: kimi (43%), qwen3.5-plus (50%).
DEFAULT_FORMULATION_MODELS = [
    "minimax/minimax-m2.5",
    "z-ai/glm-5",
    "qwen/qwen3-coder-next",
]

added a --model flag so i can a/b test new models later without editing code:

# pin to one model
my-loop --autonomy --model minimax/minimax-m2.5

# test a subset
my-loop --autonomy --model z-ai/glm-5 --model qwen/qwen3-coder-next

limitations

sample sizes are small. 6-13 cycles per model. directional, not statistically significant.
all cycles ran against the same task pool. different tasks might shuffle the rankings.
the bigger thinking models might perform better with longer execution windows or multi-step task decomposition. this test measured single-shot formulation quality only.

the point

run your own benchmarks on your own workload. takes a few hours. the results will contradict the leaderboards because leaderboards measure capability and you care about reliability in context.

for instruction-writing, concise wins. brevity is an asset. could be different for your use case.

< EOF

< cd ../root-cause-vs-symptom cd ../pick-stacks-for-ai-throughput-not-developer-comfort >