A benchmark of 13 large language models across 31 verified Qlik Set Analysis tasks from three domains: Sports, HR, Sales. A two-phase methodology with two independent LLM judges. Up to 68% of solutions return the correct number; up to 47% at the leader strictly replicate the reference logic, and in roughly another 18% of cases the model writes a formula that is even more correct than the reference formula.
The tasks come from the QATA training platform. They're open: anyone can solve them and check themselves against the reference with automated grading. Real cases — no tasks invented by the researcher.
31 verified Set Analysis tasks from three domains: Sports, HR, Sales. We used the QATA platform to automatically check results against references. Access platform: OpenRouter (a single API to 300+ models), budget $20.
Screening. Each of the 13 models solves all 31 tasks with one standard prompt. The output is a leaderboard across two checks and a shortlist of the top 5 models.
Top 5 models × 31 tasks × 3 prompt levels (minimal / standard / enriched). The goal is to measure the effect of prompt engineering.
Every model answer was run through two LLM judges. One looked at what came out, the other at how it was written. When they diverge, a "logic gap" appears.
The judge runs the model's expression in Qlik and compares the resulting number with the reference KPI from the training platform. If the number matches, it counts — the expression logic is not analyzed.
Top models: up to 68%The judge compares the Set Analysis expression with the reference one from qata.datanomix.pro. It only counts if the expressions are semantically equivalent. A number that matches "by accident" through different logic does not count.
Top models: up to 47%We didn't include outdated versions (Llama 2, GPT-3.5), variant fine-tunes (for roleplay/medicine), or small models (≤8B parameters).
| Category | Models | Rationale |
|---|---|---|
| Top premium | Claude Opus 4.7 · GPT-5 · Gemini 2.5 Pro | Flagships. Check whether the price is justified. |
| Mid-tier | Sonnet 4.6 · GPT-5 mini · Gemini 2.5 Flash · Mistral Large · Grok 3 | The sweet spot for production. |
| Budget | Haiku 4.5 · Llama 3.3 70B · Qwen 2.5 72B | Savings while keeping quality. |
| Code-specialized | DeepSeek Coder V3 · Qwen 2.5 Coder 32B | Whether code specialization gives an edge. |
One standard prompt × 31 tasks. The Coincidental column shows how many times the model "guessed" the number with an expression that differs from the reference.
| # | Model | Provider | Number OK | Logic OK | Better | Coinc. | Tier |
|---|---|---|---|---|---|---|---|
| 01 | Gemini 2.5 Pro | 21/31 (68%) | 14/30 (47%) | 4 | 2 | Top | |
| 02 | Claude Opus 4.7 | Anthropic | 17/31 (55%) | 8/30 (27%) | 4 | 4 | Top |
| 03 | Claude Sonnet 4.6 | Anthropic | 16/31 (52%) | 6/30 (20%) | 3 | 6 | Top |
| 04 | Mistral Large | Mistral | 14/31 (45%) | 7/30 (23%) | 3 | 4 | Mid |
| 05 | Grok 3 | xAI | 14/31 (45%) | 8/30 (27%) | 3 | 2 | Mid |
| 06 | GPT-5 | OpenAI | 12/31 (39%) | 6/30 (20%) | 2 | 4 | Mid |
| 07 | DeepSeek V3 LOCAL | DeepSeek | 10/31 (32%) | 5/30 (17%) | 2 | 2 | Mid |
| 08 | Gemini 2.5 Flash | 8/31 (26%) | 3/30 (10%) | 2 | 2 | Mid | |
| 09 | Claude Haiku 4.5 | Anthropic | 8/31 (26%) | 6/30 (20%) | 1 | 1 | Mid |
| 10 | Qwen 2.5 72B LOCAL | Alibaba | 6/31 (19%) | 5/30 (17%) | 1 | 0 | Low |
| 11 | GPT-5 mini | OpenAI | 6/31 (19%) | 5/30 (17%) | 1 | 0 | Low |
| 12 | Llama 3.3 70B LOCAL | Meta | 2/31 (6%) | 1/30 (3%) | 0 | 1 | Low |
| 13 | Qwen 2.5 Coder 32B LOCAL | Alibaba | 2/31 (6%) | 1/30 (3%) | 1 | 0 | Low |
* DeepSeek Coder V3 excluded — API broken (0/31).
Top 5 models × 31 tasks × 3 prompt levels = 93 answers per model. Ranked by logic match.
| Model | Logic OK | Number OK | Better | Comment |
|---|---|---|---|---|
| Claude Opus 4.7 | 23/90 (26%) | 40/93 (43%) | 8 | Top tier |
| GPT-5 | 24/90 (27%) | 39/93 (42%) | 4 | Reasoning leader |
| Gemini 2.5 Pro | 22/90 (24%) | 32/93 (34%) | 5 | Strong on logic |
| Claude Sonnet 4.6 | 16/90 (18%) | 29/93 (31%) | 5 | Sweet spot |
| DeepSeek V3 | 13/90 (14%) | 24/93 (26%) | 5 | Budget |
On the first run GPT-5 = 0/31, Gemini 2.5 Pro = 2/31. These reasoning models spend tokens on hidden thinking that isn't returned to the user but burns through the same token limit.
At max_tokens=500 the entire budget goes to reasoning, and the models
returned either an empty answer (GPT-5) or a truncated expression (Gemini Pro).
The fix: max_tokens=4000 + reasoning_effort=low. After the fix:
Gemini 2.5 Pro → 21/31 (68%),
GPT-5 → 12/31 (39%).
Out of 868 answers, 115 produced the correct number with an expression different from the reference. But this isn't just "accidental correctness": in 54 of them the model's formula is semantically more correct than the reference (the model corrects the human), and only 61 were a fragile coincidence. Two frequent patterns:
Pattern A · ID instead of Name (Sports task #2):
count(distinct {<Sex={"M"}>} Name) / count(distinct Name)
Count({<Sex={'M'}>} DISTINCT ID) / Count(DISTINCT ID)
ID is the unique entity key. Counting by the key is standard modeling practice; on a complete dataset it's more reliable than the reference by Name, where namesakes collapse into one. The model acted like an experienced data architect — the real question is about the reference formula.
Pattern B · Games instead of Year+Season (Sports task #1):
{<Year = {'1996'},
Season = {'Summer'}>}
{<Games = {'1996 Summer'}>}
Games is a concatenation of Year+Season in this data model; filtering by it is equivalent to the reference by construction.
Of the 115 "different formula" cases, 54 (≈18% of all correct answers) are a formula that is more correct than the reference. On a complete dataset, counting by the ID key doesn't just match Name — it's more robust: the reference by Name errs on namesakes, while counting by the key does not. In other words, the model sometimes writes a more correct formula than the human reference.
A realistic accuracy estimate sits between the "by the number" and "by the logic" interpretations.
In Phase 2 we tested 3 prompt levels: minimal (just the question), standard (schema + role), enriched (plus examples + best practices + chain-of-thought).
The enriched prompt worsened 3 of 5 models: Sonnet, Gemini Pro, DeepSeek V3. Only the premium reasoning models (Opus, GPT-5) benefited from enrichment.
Mid-tier models "blindly copy" the structure from the few-shot examples and lose flexibility on non-standard tasks.
DeepSeek V3 with the enriched prompt showed a lower result than with the standard one: V1 45% → 36%, V2 15%.
The hypothesis "cheap model + smart prompt = expensive model" was not confirmed. Prompt engineering does not close the gap between budget and premium models.
On the same tasks with temperature=0:
Sources of noise: models aren't strictly deterministic at temperature=0, plus the LLM judge also gives different verdicts. Claims like "X beats Y by 3–5 pp" aren't supported by our data — that's within the noise.
~4,300 requests, ~2.7M tokens. 70% of the budget was eaten by the LLM-as-judge (Claude Opus in Phase 1) — on a repeat with Sonnet the cost was 14× lower for the same number of answers.
| Model · Role | Spend | Requests | Tokens |
|---|---|---|---|
| Claude Opus 4.7 · judge V1 | $12.30 | 1,980 | 1.81M |
| Gemini 2.5 Pro · candidate | $1.91 | 253 | 247K |
| GPT-5 · candidate | $1.46 | 253 | 199K |
| Sonnet 4.6 · candidate + judge V2 | $0.85 | 870 | ~150K |
| The other 9 models | $0.83 | 950 | 320K |
| Total | $17.35 | ~4,300 | ~2.7M |
The hypothesis "use Sonnet/Haiku as the judge" was confirmed — savings of 5–14× with no loss in grading quality.
Three integration scenarios with realistic accuracy (with mandatory human review) and cost per 1,000 requests.
| Scenario | Model | Prompt | Accuracy* | $/1000 |
|---|---|---|---|---|
| Basic assistant | Claude Sonnet 4.6 | standard | ~30–50% | ~$2 |
| Premium · critical tasks | GPT-5 | standard | ~35–55% | ~$20 |
| Prototyping | DeepSeek V3 | standard | ~15–30% | ~$0.30 |
* With mandatory human review.
max_tokens=4000 + reasoning_effort=low. Otherwise, systematically understated results.A separate question: if a cloud LLM is off-limits under your security policy — what to run on-prem.
Of the local models we tested, the best is DeepSeek V3 at ~17% strict logic. Qwen 2.5 72B — about 17%. Qwen 2.5 Coder 32B is weak — 3%: for the long CALCULATE/SUMX chains in Set Analysis, 32B parameters isn't enough. We did not test GLM.
One important caveat: even for the leader the expression logic is correct only 1 in 5 times. So in production any open-source model must be used with validation. Without it, it's still too raw.
The research confirms: LLMs can generate correct Qlik Set Analysis — but with a serious caveat about the strictness of the evaluation. By the number — up to 68% for the top models; by strict equivalence to the reference — up to 47%. A separate takeaway: in roughly 18% of correct answers the model's formula is more correct than the reference — on complete datasets, counting by the key is more reliable than counting by the displayed field.
The main recommendation is to use it only in "assistant for a human" mode, not in automatic generation mode without validation. The main technical insight — about configuring reasoning models — is critically important for any team that's going to integrate GPT-5 / Gemini Pro / o1 / o3 into production.
The main methodological insight — about the dual check (number + logic) — should become the standard for any future LLM benchmarks on the team.
| Criterion | Model | Insight |
|---|---|---|
| Best by number and logic | Gemini 2.5 Pro | 68% by the number, 47% strict logic. |
| Basic assistant | Claude Sonnet 4.6 | Sweet spot, ~30–50% (with review). |
| Sonnet 4.6 cost / 1,000 requests | ~$2 | Savings of up to 14× versus Opus. |
| Why Sonnet was chosen | Balance of accuracy and cost | Acceptable accuracy at low cost. |