TL;DR

The summary in four points.

If you're short on time, read this.

We tested 13 LLMs on 31 Qlik Set Analysis tasks from 3 different domains (Sports, HR, Sales). The tasks are real, with reference answers and automated checking.
We used a two-phase methodology + a stability check + a dual correctness check (by the numeric answer and by the expression logic).
The gap between "by the number" and "by the logic" is noticeable: up to 68% by the number and up to 47% by strict equivalence to the reference. And in ~18% of correct answers the model's formula is more correct than the reference — the model corrects the human.
Production takeaway: use an LLM only with mandatory validation of the result by a human or by the Qlik runtime. The best model — Gemini 2.5 Pro — 68% by the number, 47% strict logic, plus ~18% of cases where the formula is more correct than the reference. Budget $17.35 of $20.

Key numbers

Models tested

13

OpenAI · Anthropic · Google · …

Tasks

31

verified set analysis

Number-match

up to 68%

top tier

Logic-match

up to 47%

top tier · strict

Research goals

Four goals.

Understand which LLMs can actually handle generating Qlik Set Analysis.
Compare models on accuracy, cost, speed, and stability.
Test the hypothesis: can prompt engineering bring a cheap model up to the level of an expensive one?
Form data-driven recommendations for a possible integration of LLMs into the product.

Methodology

A two-phase scheme, a dual judge.

The tasks come from the QATA training platform. They're open: anyone can solve them and check themselves against the reference with automated grading. Real cases — no tasks invented by the researcher.

Source of the tasks

31 verified Set Analysis tasks from three domains: Sports, HR, Sales. We used the QATA platform to automatically check results against references. Access platform: OpenRouter (a single API to 300+ models), budget $20.

Phase 1 + Phase 2

Phase 1

13 models × 31 tasks × 1 prompt

Screening. Each of the 13 models solves all 31 tasks with one standard prompt. The output is a leaderboard across two checks and a shortlist of the top 5 models.

Phase 2

5 finalists × 3 prompts

Top 5 models × 31 tasks × 3 prompt levels (minimal / standard / enriched). The goal is to measure the effect of prompt engineering.

Two independent judges

Every model answer was run through two LLM judges. One looked at what came out, the other at how it was written. When they diverge, a "logic gap" appears.

Check #1 · Claude Opus 4.7

"Did the final number match the reference KPI?"

The judge runs the model's expression in Qlik and compares the resulting number with the reference KPI from the training platform. If the number matches, it counts — the expression logic is not analyzed.

Top models: up to 68%

Check #2 · Claude Sonnet 4.6

"Is the expression equivalent to the reference formula?"

The judge compares the Set Analysis expression with the reference one from qata.datanomix.pro. It only counts if the expressions are semantically equivalent. A number that matches "by accident" through different logic does not count.

Top models: up to 47%

Candidates

13 models · 4 categories.

We didn't include outdated versions (Llama 2, GPT-3.5), variant fine-tunes (for roleplay/medicine), or small models (≤8B parameters).

Category	Models	Rationale
Top premium	Claude Opus 4.7 · GPT-5 · Gemini 2.5 Pro	Flagships. Check whether the price is justified.
Mid-tier	Sonnet 4.6 · GPT-5 mini · Gemini 2.5 Flash · Mistral Large · Grok 3	The sweet spot for production.
Budget	Haiku 4.5 · Llama 3.3 70B · Qwen 2.5 72B	Savings while keeping quality.
Code-specialized	DeepSeek Coder V3 · Qwen 2.5 Coder 32B	Whether code specialization gives an edge.

Phase 1 · Leaderboard

13 models, ranked by number match.

One standard prompt × 31 tasks. The Coincidental column shows how many times the model "guessed" the number with an expression that differs from the reference.

#	Model	Provider	Number OK	Logic OK	Better	Coinc.	Tier
01	Gemini 2.5 Pro	Google	21/31 (68%)	14/30 (47%)	4	2	Top
02	Claude Opus 4.7	Anthropic	17/31 (55%)	8/30 (27%)	4	4	Top
03	Claude Sonnet 4.6	Anthropic	16/31 (52%)	6/30 (20%)	3	6	Top
04	Mistral Large	Mistral	14/31 (45%)	7/30 (23%)	3	4	Mid
05	Grok 3	xAI	14/31 (45%)	8/30 (27%)	3	2	Mid
06	GPT-5	OpenAI	12/31 (39%)	6/30 (20%)	2	4	Mid
07	DeepSeek V3 LOCAL	DeepSeek	10/31 (32%)	5/30 (17%)	2	2	Mid
08	Gemini 2.5 Flash	Google	8/31 (26%)	3/30 (10%)	2	2	Mid
09	Claude Haiku 4.5	Anthropic	8/31 (26%)	6/30 (20%)	1	1	Mid
10	Qwen 2.5 72B LOCAL	Alibaba	6/31 (19%)	5/30 (17%)	1	0	Low
11	GPT-5 mini	OpenAI	6/31 (19%)	5/30 (17%)	1	0	Low
12	Llama 3.3 70B LOCAL	Meta	2/31 (6%)	1/30 (3%)	0	1	Low
13	Qwen 2.5 Coder 32B LOCAL	Alibaba	2/31 (6%)	1/30 (3%)	1	0	Low

* DeepSeek Coder V3 excluded — API broken (0/31).

Phase 2 · 5 finalists × 3 prompts

Who holds up when the prompt varies.

Top 5 models × 31 tasks × 3 prompt levels = 93 answers per model. Ranked by logic match.

Model	Logic OK	Number OK	Better	Comment
Claude Opus 4.7	23/90 (26%)	40/93 (43%)	8	Top tier
GPT-5	24/90 (27%)	39/93 (42%)	4	Reasoning leader
Gemini 2.5 Pro	22/90 (24%)	32/93 (34%)	5	Strong on logic
Claude Sonnet 4.6	16/90 (18%)	29/93 (31%)	5	Sweet spot
DeepSeek V3	13/90 (14%)	24/93 (26%)	5	Budget

Findings

Six technical findings.

⚠ 4.1 Reasoning trap

Reasoning models need to be configured differently.

On the first run GPT-5 = 0/31, Gemini 2.5 Pro = 2/31. These reasoning models spend tokens on hidden thinking that isn't returned to the user but burns through the same token limit.

At max_tokens=500 the entire budget goes to reasoning, and the models returned either an empty answer (GPT-5) or a truncated expression (Gemini Pro). The fix: max_tokens=4000 + reasoning_effort=low. After the fix: Gemini 2.5 Pro → 21/31 (68%), GPT-5 → 12/31 (39%).

★ 4.2 Coincidental correctness — the main finding

The right number from an expression that doesn't match the reference — and often more correct.

Out of 868 answers, 115 produced the correct number with an expression different from the reference. But this isn't just "accidental correctness": in 54 of them the model's formula is semantically more correct than the reference (the model corrects the human), and only 61 were a fragile coincidence. Two frequent patterns:

Pattern A · ID instead of Name (Sports task #2):

Reference

count(distinct {<Sex={"M"}>} Name)
/ count(distinct Name)

LLM (by the ID key — more correct)

Count({<Sex={'M'}>} DISTINCT ID)
/ Count(DISTINCT ID)

ID is the unique entity key. Counting by the key is standard modeling practice; on a complete dataset it's more reliable than the reference by Name, where namesakes collapse into one. The model acted like an experienced data architect — the real question is about the reference formula.

Pattern B · Games instead of Year+Season (Sports task #1):

Reference

{<Year = {'1996'},
   Season = {'Summer'}>}

LLM (a different field)

{<Games = {'1996 Summer'}>}

Games is a concatenation of Year+Season in this data model; filtering by it is equivalent to the reference by construction.

◆ 4.3 Nuance

In 54 cases the model's formula is better than the reference.

Of the 115 "different formula" cases, 54 (≈18% of all correct answers) are a formula that is more correct than the reference. On a complete dataset, counting by the ID key doesn't just match Name — it's more robust: the reference by Name errs on namesakes, while counting by the key does not. In other words, the model sometimes writes a more correct formula than the human reference.

A realistic accuracy estimate sits between the "by the number" and "by the logic" interpretations.

⚠ 4.4 Prompt effect · counter-intuitive

The enriched prompt worsens results for mid-tier models.

In Phase 2 we tested 3 prompt levels: minimal (just the question), standard (schema + role), enriched (plus examples + best practices + chain-of-thought).

The enriched prompt worsened 3 of 5 models: Sonnet, Gemini Pro, DeepSeek V3. Only the premium reasoning models (Opus, GPT-5) benefited from enrichment.

Mid-tier models "blindly copy" the structure from the few-shot examples and lose flexibility on non-standard tasks.

✗ 4.5 Hypothesis not confirmed

A smart prompt doesn't turn a cheap model into an expensive one.

DeepSeek V3 with the enriched prompt showed a lower result than with the standard one: V1 45% → 36%, V2 15%.

The hypothesis "cheap model + smart prompt = expensive model" was not confirmed. Prompt engineering does not close the gap between budget and premium models.

∿ 4.6 Stability noise ±5–15 pp

A repeat run gives different numbers.

On the same tasks with temperature=0:

GPT-523 → 24+1
Claude Opus 4.719 → 23+4
Gemini 2.5 Pro19 → 22+3
Claude Sonnet 4.620 → 20±0 · the only stable one
DeepSeek V314 → 12−2

Sources of noise: models aren't strictly deterministic at temperature=0, plus the LLM judge also gives different verdicts. Claims like "X beats Y by 3–5 pp" aren't supported by our data — that's within the noise.

Cost breakdown

$17.35 for the whole benchmark.

~4,300 requests, ~2.7M tokens. 70% of the budget was eaten by the LLM-as-judge (Claude Opus in Phase 1) — on a repeat with Sonnet the cost was 14× lower for the same number of answers.

Model · Role	Spend	Requests	Tokens
Claude Opus 4.7 · judge V1	$12.30	1,980	1.81M
Gemini 2.5 Pro · candidate	$1.91	253	247K
GPT-5 · candidate	$1.46	253	199K
Sonnet 4.6 · candidate + judge V2	$0.85	870	~150K
The other 9 models	$0.83	950	320K
Total	$17.35	~4,300	~2.7M

The hypothesis "use Sonnet/Haiku as the judge" was confirmed — savings of 5–14× with no loss in grading quality.

Production guidance

If an LLM goes into the product.

Three integration scenarios with realistic accuracy (with mandatory human review) and cost per 1,000 requests.

Scenario	Model	Prompt	Accuracy*	$/1000
Basic assistant	Claude Sonnet 4.6	standard	~30–50%	~$2
Premium · critical tasks	GPT-5	standard	~35–55%	~$20
Prototyping	DeepSeek V3	standard	~15–30%	~$0.30

* With mandatory human review.

Production requirements

Four rules you don't go to production without.

Never without review. Never use it without human review or Qlik runtime validation. The best model (Gemini 2.5 Pro) — 47% strict logic; roughly every other answer needs checking.
Configure the reasoning models. GPT-5, Gemini 2.5 Pro require max_tokens=4000 + reasoning_effort=low. Otherwise, systematically understated results.
Don't overload few-shot. For most models the enriched prompt lowers accuracy. A simple prompt + strict validation works better.
Sonnet/Haiku as the judge. Not Opus. Savings of 5–14× with no loss in grading quality — verified on 868 answers.

On-prem deployment

Which open-source model to deploy locally?

A separate question: if a cloud LLM is off-limits under your security policy — what to run on-prem.

★ Local deployment recommendation

Of the local models we tested, the best is DeepSeek V3 at ~17% strict logic. Qwen 2.5 72B — about 17%. Qwen 2.5 Coder 32B is weak — 3%: for the long CALCULATE/SUMX chains in Set Analysis, 32B parameters isn't enough. We did not test GLM.

One important caveat: even for the leader the expression logic is correct only 1 in 5 times. So in production any open-source model must be used with validation. Without it, it's still too raw.

Conclusion

What we learned.

The research confirms: LLMs can generate correct Qlik Set Analysis — but with a serious caveat about the strictness of the evaluation. By the number — up to 68% for the top models; by strict equivalence to the reference — up to 47%. A separate takeaway: in roughly 18% of correct answers the model's formula is more correct than the reference — on complete datasets, counting by the key is more reliable than counting by the displayed field.

The main recommendation is to use it only in "assistant for a human" mode, not in automatic generation mode without validation. The main technical insight — about configuring reasoning models — is critically important for any team that's going to integrate GPT-5 / Gemini Pro / o1 / o3 into production.

The main methodological insight — about the dual check (number + logic) — should become the standard for any future LLM benchmarks on the team.

Quick model summary

Criterion	Model	Insight
Best by number and logic	Gemini 2.5 Pro	68% by the number, 47% strict logic.
Basic assistant	Claude Sonnet 4.6	Sweet spot, ~30–50% (with review).
Sonnet 4.6 cost / 1,000 requests	~$2	Savings of up to 14× versus Opus.
Why Sonnet was chosen	Balance of accuracy and cost	Acceptable accuracy at low cost.