QSABench· v1.0
Datanomix · Open Research · Updated June 2026

Which LLM writes Qlik Set Analysis best?

A benchmark of 13 large language models across 31 verified Qlik Set Analysis tasks from three domains: Sports, HR, Sales. A two-phase methodology with two independent LLM judges. Up to 68% of solutions return the correct number; up to 47% at the leader strictly replicate the reference logic, and in roughly another 18% of cases the model writes a formula that is even more correct than the reference formula.

Models13 Tasks31 Domains3 Budget$17.35 ByDatanomix
TL;DR

The summary in four points.

If you're short on time, read this.

Key numbers

Models tested
13
OpenAI · Anthropic · Google · …
Tasks
31
verified set analysis
Number-match
up to 68%
top tier
Logic-match
up to 47%
top tier · strict
Research goals

Four goals.

  1. Understand which LLMs can actually handle generating Qlik Set Analysis.
  2. Compare models on accuracy, cost, speed, and stability.
  3. Test the hypothesis: can prompt engineering bring a cheap model up to the level of an expensive one?
  4. Form data-driven recommendations for a possible integration of LLMs into the product.
Methodology

A two-phase scheme, a dual judge.

The tasks come from the QATA training platform. They're open: anyone can solve them and check themselves against the reference with automated grading. Real cases — no tasks invented by the researcher.

Source of the tasks

31 verified Set Analysis tasks from three domains: Sports, HR, Sales. We used the QATA platform to automatically check results against references. Access platform: OpenRouter (a single API to 300+ models), budget $20.

Phase 1 + Phase 2

Phase 1

13 models × 31 tasks × 1 prompt

Screening. Each of the 13 models solves all 31 tasks with one standard prompt. The output is a leaderboard across two checks and a shortlist of the top 5 models.

Phase 2

5 finalists × 3 prompts

Top 5 models × 31 tasks × 3 prompt levels (minimal / standard / enriched). The goal is to measure the effect of prompt engineering.

Two independent judges

Every model answer was run through two LLM judges. One looked at what came out, the other at how it was written. When they diverge, a "logic gap" appears.

Check #1 · Claude Opus 4.7

"Did the final number match the reference KPI?"

The judge runs the model's expression in Qlik and compares the resulting number with the reference KPI from the training platform. If the number matches, it counts — the expression logic is not analyzed.

Top models: up to 68%
Check #2 · Claude Sonnet 4.6

"Is the expression equivalent to the reference formula?"

The judge compares the Set Analysis expression with the reference one from qata.datanomix.pro. It only counts if the expressions are semantically equivalent. A number that matches "by accident" through different logic does not count.

Top models: up to 47%
Candidates

13 models · 4 categories.

We didn't include outdated versions (Llama 2, GPT-3.5), variant fine-tunes (for roleplay/medicine), or small models (≤8B parameters).

Category Models Rationale
Top premiumClaude Opus 4.7 · GPT-5 · Gemini 2.5 ProFlagships. Check whether the price is justified.
Mid-tierSonnet 4.6 · GPT-5 mini · Gemini 2.5 Flash · Mistral Large · Grok 3The sweet spot for production.
BudgetHaiku 4.5 · Llama 3.3 70B · Qwen 2.5 72BSavings while keeping quality.
Code-specializedDeepSeek Coder V3 · Qwen 2.5 Coder 32BWhether code specialization gives an edge.
Phase 1 · Leaderboard

13 models, ranked by number match.

One standard prompt × 31 tasks. The Coincidental column shows how many times the model "guessed" the number with an expression that differs from the reference.

# Model Provider Number OK Logic OK Better Coinc. Tier
01Gemini 2.5 ProGoogle21/31 (68%)14/30 (47%)42Top
02Claude Opus 4.7Anthropic17/31 (55%)8/30 (27%)44Top
03Claude Sonnet 4.6Anthropic16/31 (52%)6/30 (20%)36Top
04Mistral LargeMistral14/31 (45%)7/30 (23%)34Mid
05Grok 3xAI14/31 (45%)8/30 (27%)32Mid
06GPT-5OpenAI12/31 (39%)6/30 (20%)24Mid
07DeepSeek V3 LOCALDeepSeek10/31 (32%)5/30 (17%)22Mid
08Gemini 2.5 FlashGoogle8/31 (26%)3/30 (10%)22Mid
09Claude Haiku 4.5Anthropic8/31 (26%)6/30 (20%)11Mid
10Qwen 2.5 72B LOCALAlibaba6/31 (19%)5/30 (17%)10Low
11GPT-5 miniOpenAI6/31 (19%)5/30 (17%)10Low
12Llama 3.3 70B LOCALMeta2/31 (6%)1/30 (3%)01Low
13Qwen 2.5 Coder 32B LOCALAlibaba2/31 (6%)1/30 (3%)10Low

* DeepSeek Coder V3 excluded — API broken (0/31).

Phase 2 · 5 finalists × 3 prompts

Who holds up when the prompt varies.

Top 5 models × 31 tasks × 3 prompt levels = 93 answers per model. Ranked by logic match.

Model Logic OK Number OK Better Comment
Claude Opus 4.723/90 (26%)40/93 (43%)8Top tier
GPT-524/90 (27%)39/93 (42%)4Reasoning leader
Gemini 2.5 Pro22/90 (24%)32/93 (34%)5Strong on logic
Claude Sonnet 4.616/90 (18%)29/93 (31%)5Sweet spot
DeepSeek V313/90 (14%)24/93 (26%)5Budget
Findings

Six technical findings.

⚠ 4.1 Reasoning trap

Reasoning models need to be configured differently.

On the first run GPT-5 = 0/31, Gemini 2.5 Pro = 2/31. These reasoning models spend tokens on hidden thinking that isn't returned to the user but burns through the same token limit.

At max_tokens=500 the entire budget goes to reasoning, and the models returned either an empty answer (GPT-5) or a truncated expression (Gemini Pro). The fix: max_tokens=4000 + reasoning_effort=low. After the fix: Gemini 2.5 Pro → 21/31 (68%), GPT-5 → 12/31 (39%).

★ 4.2 Coincidental correctness — the main finding

The right number from an expression that doesn't match the reference — and often more correct.

Out of 868 answers, 115 produced the correct number with an expression different from the reference. But this isn't just "accidental correctness": in 54 of them the model's formula is semantically more correct than the reference (the model corrects the human), and only 61 were a fragile coincidence. Two frequent patterns:

Pattern A · ID instead of Name (Sports task #2):

Reference
count(distinct {<Sex={"M"}>} Name)
/ count(distinct Name)
LLM (by the ID key — more correct)
Count({<Sex={'M'}>} DISTINCT ID)
/ Count(DISTINCT ID)

ID is the unique entity key. Counting by the key is standard modeling practice; on a complete dataset it's more reliable than the reference by Name, where namesakes collapse into one. The model acted like an experienced data architect — the real question is about the reference formula.

Pattern B · Games instead of Year+Season (Sports task #1):

Reference
{<Year = {'1996'},
   Season = {'Summer'}>}
LLM (a different field)
{<Games = {'1996 Summer'}>}

Games is a concatenation of Year+Season in this data model; filtering by it is equivalent to the reference by construction.

◆ 4.3 Nuance

In 54 cases the model's formula is better than the reference.

Of the 115 "different formula" cases, 54 (≈18% of all correct answers) are a formula that is more correct than the reference. On a complete dataset, counting by the ID key doesn't just match Name — it's more robust: the reference by Name errs on namesakes, while counting by the key does not. In other words, the model sometimes writes a more correct formula than the human reference.

A realistic accuracy estimate sits between the "by the number" and "by the logic" interpretations.

⚠ 4.4 Prompt effect · counter-intuitive

The enriched prompt worsens results for mid-tier models.

In Phase 2 we tested 3 prompt levels: minimal (just the question), standard (schema + role), enriched (plus examples + best practices + chain-of-thought).

The enriched prompt worsened 3 of 5 models: Sonnet, Gemini Pro, DeepSeek V3. Only the premium reasoning models (Opus, GPT-5) benefited from enrichment.

Mid-tier models "blindly copy" the structure from the few-shot examples and lose flexibility on non-standard tasks.

✗ 4.5 Hypothesis not confirmed

A smart prompt doesn't turn a cheap model into an expensive one.

DeepSeek V3 with the enriched prompt showed a lower result than with the standard one: V1 45% → 36%, V2 15%.

The hypothesis "cheap model + smart prompt = expensive model" was not confirmed. Prompt engineering does not close the gap between budget and premium models.

∿ 4.6 Stability noise ±5–15 pp

A repeat run gives different numbers.

On the same tasks with temperature=0:

Sources of noise: models aren't strictly deterministic at temperature=0, plus the LLM judge also gives different verdicts. Claims like "X beats Y by 3–5 pp" aren't supported by our data — that's within the noise.

Cost breakdown

$17.35 for the whole benchmark.

~4,300 requests, ~2.7M tokens. 70% of the budget was eaten by the LLM-as-judge (Claude Opus in Phase 1) — on a repeat with Sonnet the cost was 14× lower for the same number of answers.

Model · Role Spend Requests Tokens
Claude Opus 4.7 · judge V1$12.301,9801.81M
Gemini 2.5 Pro · candidate$1.91253247K
GPT-5 · candidate$1.46253199K
Sonnet 4.6 · candidate + judge V2$0.85870~150K
The other 9 models$0.83950320K
Total$17.35~4,300~2.7M

The hypothesis "use Sonnet/Haiku as the judge" was confirmed — savings of 5–14× with no loss in grading quality.

Production guidance

If an LLM goes into the product.

Three integration scenarios with realistic accuracy (with mandatory human review) and cost per 1,000 requests.

Scenario Model Prompt Accuracy* $/1000
Basic assistantClaude Sonnet 4.6standard~30–50%~$2
Premium · critical tasksGPT-5standard~35–55%~$20
PrototypingDeepSeek V3standard~15–30%~$0.30

* With mandatory human review.

Production requirements

Four rules you don't go to production without.

  1. Never without review. Never use it without human review or Qlik runtime validation. The best model (Gemini 2.5 Pro) — 47% strict logic; roughly every other answer needs checking.
  2. Configure the reasoning models. GPT-5, Gemini 2.5 Pro require max_tokens=4000 + reasoning_effort=low. Otherwise, systematically understated results.
  3. Don't overload few-shot. For most models the enriched prompt lowers accuracy. A simple prompt + strict validation works better.
  4. Sonnet/Haiku as the judge. Not Opus. Savings of 5–14× with no loss in grading quality — verified on 868 answers.
On-prem deployment

Which open-source model to deploy locally?

A separate question: if a cloud LLM is off-limits under your security policy — what to run on-prem.

★ Local deployment recommendation

Of the local models we tested, the best is DeepSeek V3 at ~17% strict logic. Qwen 2.5 72B — about 17%. Qwen 2.5 Coder 32B is weak — 3%: for the long CALCULATE/SUMX chains in Set Analysis, 32B parameters isn't enough. We did not test GLM.

One important caveat: even for the leader the expression logic is correct only 1 in 5 times. So in production any open-source model must be used with validation. Without it, it's still too raw.

Conclusion

What we learned.

The research confirms: LLMs can generate correct Qlik Set Analysis — but with a serious caveat about the strictness of the evaluation. By the number — up to 68% for the top models; by strict equivalence to the reference — up to 47%. A separate takeaway: in roughly 18% of correct answers the model's formula is more correct than the reference — on complete datasets, counting by the key is more reliable than counting by the displayed field.

The main recommendation is to use it only in "assistant for a human" mode, not in automatic generation mode without validation. The main technical insight — about configuring reasoning models — is critically important for any team that's going to integrate GPT-5 / Gemini Pro / o1 / o3 into production.

The main methodological insight — about the dual check (number + logic) — should become the standard for any future LLM benchmarks on the team.

Quick model summary

Criterion Model Insight
Best by number and logicGemini 2.5 Pro68% by the number, 47% strict logic.
Basic assistantClaude Sonnet 4.6Sweet spot, ~30–50% (with review).
Sonnet 4.6 cost / 1,000 requests~$2Savings of up to 14× versus Opus.
Why Sonnet was chosenBalance of accuracy and costAcceptable accuracy at low cost.