FKQA Benchmark

FKQA is a benchmark consisting of 100 short, factual questions derived from Wikipedia articles, covering a diverse range of random topics without a specific focus on STEM, math, or coding.

FKQA-Hard is a more challenging version of this benchmark.

Model responses are evaluated by an LLM-as-a-judge setup, utilizing Gemini 2.5 Pro, which rigorously scores each answer according to detailed rules provided in a structured prompt, including the original article, question, and answer context.

The scoring uses six criteria:

Faithfulness: accuracy and alignment with the provided article
Relevance: directness in answering the exact question asked
Completeness: addressing all parts of the question thoroughly
Clarity: ease of understanding and grammatical correctness
Conciseness: lack of unnecessary detail or repetition
Self-Containedness: understandability without needing additional context

An overall weighted score (0–100) emphasizes Faithfulness (45%), followed by Relevance (20%), Completeness (18%), Clarity (7%), Conciseness (7%), and Self-Containedness (3%), highlighting models' capabilities in reliably handling general factual queries.

Four additional metrics provide deeper insights:

0% Hallucination: percentage of responses achieving 100% Faithfulness
Rejection Rate: percentage of answers where the model explicitly declined to answer (e.g., "I don't know")
Calibration: relative increase in Faithfulness after excluding rejections, indicating how effectively models abstain when uncertain
An "Exclude Rejections" checkbox allows recalculating scores by excluding responses marked as rejections. This highlights how accurately models recognize when they do or do not have sufficient information

The FKQA-Hard benchmark also consists of 100 questions, but these were specifically selected from a pool of 3000 FKQA-level questions, representing approximately the top 3% most challenging questions from the original FKQA dataset.

Additionally, search-powered models like Sonar Pro and GPT-4o Search Preview were also evaluated. However, comparing these directly to models without web access is not advisable. Filters are available in the results table to include or exclude models based on their web-search capabilities.

A detailed blog post describing the methodology and insights found during benchmarking is coming soon on my blog.

Organization: