LLM Quality Evaluation: From Classical to Modern Approaches

Alexander Meshkov
Gen AI QA Director

AI & ML Generative AI Services Quality Assurance 6 min read

With the widespread integration of large language models (LLMs) into production systems, the need for reliable evaluation methods to track LMM performance has become critical. For example, we must be able to measure how effectively a model responds to user queries, as well as if the generated answer is truly relevant to the question. This research is fundamental for companies developing chatbots, customer support systems, and other AI-powered solutions.

The challenge of evaluating GenAI solutions and AI agents is especially pressing, since traditional accuracy metrics used in classical machine learning tasks fall short. Language models generate text that may be semantically correct, but varies in phrasing. Moreover, the same question may have multiple valid answers expressed in different ways.

In this article, we explore a range of approaches to evaluating AI solutions: from simple lexical methods to advanced hybrid systems. We’ll analyze the strengths and limitations of each approach and explain why we at First Line Software use the LLM-as-a-Judge method—with verdicts and softmax-based final score aggregation—for our evaluation AI methodology.

Lexical Methods

TF-IDF: The Classic of Information Retrieval

TF-IDF (Term Frequency – Inverse Document Frequency) is one of the oldest techniques for assessing text relevance. The concept is simple: words that appear frequently in a given response, but rarely across a large corpus, are considered most significant.

Advantages of TF-IDF:

Extremely fast — evaluation takes microseconds
Requires no training data or complex models
Fully interpretable results
Easily scales to millions of documents
Minimal computational resource requirements

Disadvantages of TF-IDF:

Ignores semantics and synonyms
Doesn’t consider word order or context
Misses relevant answers phrased differently
Performs poorly with paraphrasing and generalizations

BM25: The Evolution of Lexical Scoring

BM25 (Best Matching 25), an improved version of TF-IDF, addresses some of its shortcomings by using non-linear term frequency saturation and document length normalization. These enhancements provide an evaluative method that avoids unfair advantages for longer texts and limits the impact of overly frequent words.

Advantages of BM25:

Outperforms TF-IDF by 15–25% in accuracy
Still extremely fast (milliseconds per query)
Handles word importance more intelligently
Widely used in modern search engines
Surprisingly competitive with neural methods

Disadvantages of BM25:

May miss correctly-phrased but differently-worded answers
Still lacks semantic understanding and synonym recognition
Requires exact word matches for high scores

Semantic Methods

Embeddings: From Words to Meaning

The introduction of vector representations (embeddings) brought the shift from lexical to semantic methods of evaluating text relevance. Instead of comparing words, we began comparing meanings, represented as points in a high-dimensional space.

Advantages of Embeddings:

Capture semantic similarity without shared words
Understand synonyms and paraphrases
Provide high recall in retrieval tasks
Fast after precomputation

Disadvantages of Embeddings:

Require pretrained models
Quality depends on the model training data
Can lose nuance when averaging word vectors

Sentence-BERT: Specializing in Comparison

Sentence-BERT is a tailored method for semantic text comparison. Unlike traditional BERT, which requires joint processing of text pairs, Sentence-BERT generates independent vectors for each sentence, dramatically speeding up comparisons.

Advantages of Sentence-BERT:

Reduces time to compare 10,000 sentences from 65 hours to 5 seconds
Trained specifically for semantic similarity tasks
BERT-level quality with minimal compute cost
Ideal for answer relevancy scoring

Disadvantages of Sentence-BERT:

Requires a GPU for efficient large-scale use
Larger than simple embedding models
May need fine-tuning for specific domains

USE: Balanced Efficiency

Developed by Google, Universal Sentence Encoder (USE) was designed for practical applications in two iterations: an accurate Transformer-based version and a fast Deep Averaging Network (DAN) version.

Advantages of USE:

Great compromise between speed and quality
Ready to use out of the box
Multilingual support
DAN version runs on the CPU

Disadvantages of USE:

May underperform compared to domain-specific models
Less specialized than Sentence-BERT

Hybrid Approaches

LLM-as-Judge: A Model Evaluates a Model

The “LLM-as-a-Judge” approach leverages powerful language models to assess the quality of other models’ outputs. This method enables nuanced evaluation, factoring in context, completeness, and correctness.

Advantages of LLM-as-Judge:

Closest to human judgment (80 %+ agreement)
Can assess complex quality aspects
Provides explanations for scores
Easily adaptable to various criteria

Disadvantages of LLM-as-Judge:

Computationally expensive
Susceptible to biases (e.g., verbosity, position bias)
May produce inconsistent scores
Requires careful prompt design

NLI-Based Methods: Logical Consistency Checks

Natural Language Inference (NLI) models determine whether an answer logically follows from a question—especially useful for hallucination detection and fact-checking.

Advantages of NLI:

Excellent for verifying factual claims
10–100× more efficient than full LLMs
Effectively detects contradictions

Disadvantages of NLI:

Requires specially curated datasets
Limited to binary logic
Struggles with open-ended questions

Our Approach: Verdicts with Softmax Aggregation

After analyzing existing techniques, we developed a hybrid method combining the strengths of LLM-as-Judge with a more structured and interpretable evaluation system.

Key Components of Our Solution

1. Intent Extraction

We determine what the user actually wants to know. This is critical for accurate relevancy assessment—an answer may be relevant to one intent and irrelevant to another.

2. Answer Decomposition into Atomic Statements

Rather than scoring the whole answer, we break it into individual facts or claims. This was a pivotal decision, based on our experience: when an LLM is asked to rate an entire answer, it tends to overrate due to its helpfulness bias. The model “wants to be useful” and often labels any coherent response as good.

Decomposition enables indirect evaluation—we don’t ask “how good is this answer?” but instead validate each specific statement. This greatly reduces bias and ensures a more objective assessment.

3. Verdict System

Each statement is evaluated on a 5-point scale:

Fully (1.0): Fully answers the question
Mostly (0.9): Mostly answers, but may lack details
Partial (0.6): Partially relevant
Minor (0.3): Weak connection to the question
None (0.0): Not relevant at all

Importantly, these scores can be dynamic—weights can be adjusted for specific domains. For example, medical consultations might require stricter criteria, reducing the weight of partially relevant answers.

4. Softmax Aggregation with Temperature

Instead of averaging, we use a softmax function to aggregate scores. This allows us to:

Give more weight to specific scores
Smooth out the impact of outliers
Tune sensitivity via a temperature parameter

Key Advantage: The final score is computed via a deterministic formula—not based on the LLM’s subjective judgment. The model only classifies statements; the final score is mathematically derived.

Why This Approach?

Overcoming a number of limitations of existing methods:

Unlike lexical methods, our approach understands semantics and can assess relevance even when synonyms and paraphrasing are used.
Unlike simple embeddings, we do not lose detail through averaging — each statement is evaluated separately.
Unlike basic LLM-as-Judge, our structured verdict system ensures stability and interpretability of assessments.
Unlike NLI-based approaches, we can work with any types of questions and answers, not limited to binary logic.

Unique Advantages:

Full transparency: Why a particular score was assigned is always clear. Each statement has its own verdict with an explanation, making the entire evaluation process fully interpretable. This is critically important for debugging and improving the system.
Mathematical foundation: Unlike approaches where the LLM makes a subjective judgment, our system is based on mathematical computations. The LLM only classifies statements, while the final score is calculated using a formula, ensuring consistency and predictability.
Configuration flexibility: The temperature parameter in softmax and the dynamic weighting system allow the strictness of the evaluation to be adapted to specific requirements without changing the core logic.
Bias resistance: Indirect evaluation through statements eliminates the tendency of LLMs to give inflated scores, making the system more objective.
Scalability: The approach can be applied to answers of any length and complexity, from short factual answers to detailed explanations.

Practical Results

The implementation of our approach has shown the following results:

Cost efficiency — despite using an LLM, the structured approach minimizes the number of model queries.
High correlation with human evaluation — in 92% of cases, the automatic evaluation matched expert opinion.
Score stability — repeated runs on the same data yield consistent results.
Process transparency — the development team and business users can understand why the system assigned a specific score.

Conclusion

The evolution of methods for evaluating the quality of LLM solutions has progressed from simple lexical algorithms to complex semantic systems. Each approach has its own application niche: lexical methods are indispensable for quick filtering, semantic embeddings work well for finding similar texts, and LLM-as-Judge provides the most nuanced evaluation.

Our approach with verdicts and softmax aggregation represents a synthesis of practices, combining the semantic understanding of the LLM with mathematical rigor and transparency in the evaluation process. The key distinction is that we do not rely on the model’s subjective opinion but use it to classify individual components of the answer, after which deterministic mathematical methods are applied to compute the final score.

Currently, this approach demonstrates its effectiveness in practice, providing a high correlation with human evaluation while maintaining full interpretability of results. The ability to dynamically configure weights and parameters makes the system adaptable to different domains and requirements.

Key takeaway: There is no universal “best” method for evaluating LLM quality. The choice of approach should be determined by the specific requirements of the system, available resources, and the necessary balance between accuracy, speed, and interpretability. For tasks requiring high evaluation precision while maintaining transparency and process control, the approach of decomposition into statements and mathematical aggregation appears to be the optimal solution.

The future of LLM quality evaluation is likely to be tied to the further integration of various approaches and the development of methods capable of adapting to the specifics of particular domains and tasks. Continued research in this area is important, as high-quality evaluation is the foundation for building reliable and useful AI systems.

References:
https://arxiv.org/html/2412.05579v2
https://arxiv.org/abs/2305.11171
https://arxiv.org/html/2406.01607v1
https://arxiv.org/abs/1908.10084
https://arxiv.org/html/2405.07437v1
https://arxiv.org/abs/2405.01535
https://arxiv.org/abs/2502.18018
https://arxiv.org/abs/2410.13341
https://cameronrwolfe.substack.com/p/llm-as-a-judge

Curious how our structured LLM evaluation works in practice?
Get hands-on with a live demo of our verdict-based scoring system and see how it aligns with human judgment in real use cases.