LLM Quality Evaluation: From Classical to Modern Approaches

Gen AI QA Director

With the widespread integration of large language models (LLMs) into production systems, the need for reliable evaluation methods to track LMM performance has become critical. For example, we must be able to measure how effectively a model responds to user queries, as well as if the generated answer is truly relevant to the question. This research is fundamental for companies developing chatbots, customer support systems, and other AI-powered solutions.
The challenge of evaluating GenAI solutions and AI agents is especially pressing, since traditional accuracy metrics used in classical machine learning tasks fall short. Language models generate text that may be semantically correct, but varies in phrasing. Moreover, the same question may have multiple valid answers expressed in different ways.
In this article, we explore a range of approaches to evaluating AI solutions: from simple lexical methods to advanced hybrid systems. We’ll analyze the strengths and limitations of each approach and explain why we at First Line Software use the LLM-as-a-Judge method—with verdicts and softmax-based final score aggregation—for our evaluation AI methodology.
Lexical Methods
TF-IDF: The Classic of Information Retrieval
TF-IDF (Term Frequency – Inverse Document Frequency) is one of the oldest techniques for assessing text relevance. The concept is simple: words that appear frequently in a given response, but rarely across a large corpus, are considered most significant.
Advantages of TF-IDF:
- Extremely fast — evaluation takes microseconds
- Requires no training data or complex models
- Fully interpretable results
- Easily scales to millions of documents
- Minimal computational resource requirements
Disadvantages of TF-IDF:
- Ignores semantics and synonyms
- Doesn’t consider word order or context
- Misses relevant answers phrased differently
- Performs poorly with paraphrasing and generalizations
BM25: The Evolution of Lexical Scoring
BM25 (Best Matching 25), an improved version of TF-IDF, addresses some of its shortcomings by using non-linear term frequency saturation and document length normalization. These enhancements provide an evaluative method that avoids unfair advantages for longer texts and limits the impact of overly frequent words.
Advantages of BM25:
- Outperforms TF-IDF by 15–25% in accuracy
- Still extremely fast (milliseconds per query)
- Handles word importance more intelligently
- Widely used in modern search engines
- Surprisingly competitive with neural methods
Disadvantages of BM25:
- May miss correctly-phrased but differently-worded answers
- Still lacks semantic understanding and synonym recognition
- Requires exact word matches for high scores
Semantic Methods
Embeddings: From Words to Meaning
The introduction of vector representations (embeddings) brought the shift from lexical to semantic methods of evaluating text relevance. Instead of comparing words, we began comparing meanings, represented as points in a high-dimensional space.
Advantages of Embeddings:
- Capture semantic similarity without shared words
- Understand synonyms and paraphrases
- Provide high recall in retrieval tasks
- Fast after precomputation
Disadvantages of Embeddings:
- Require pretrained models
- Quality depends on the model training data
- Can lose nuance when averaging word vectors
Sentence-BERT: Specializing in Comparison
Sentence-BERT is a tailored method for semantic text comparison. Unlike traditional BERT, which requires joint processing of text pairs, Sentence-BERT generates independent vectors for each sentence, dramatically speeding up comparisons.
Advantages of Sentence-BERT:
- Reduces time to compare 10,000 sentences from 65 hours to 5 seconds
- Trained specifically for semantic similarity tasks
- BERT-level quality with minimal compute cost
- Ideal for answer relevancy scoring
Disadvantages of Sentence-BERT:
- Requires a GPU for efficient large-scale use
- Larger than simple embedding models
- May need fine-tuning for specific domains
USE: Balanced Efficiency
Developed by Google, Universal Sentence Encoder (USE) was designed for practical applications in two iterations: an accurate Transformer-based version and a fast Deep Averaging Network (DAN) version.
Advantages of USE:
- Great compromise between speed and quality
- Ready to use out of the box
- Multilingual support
- DAN version runs on the CPU
Disadvantages of USE:
- May underperform compared to domain-specific models
- Less specialized than Sentence-BERT
Hybrid Approaches
LLM-as-Judge: A Model Evaluates a Model
The “LLM-as-a-Judge” approach leverages powerful language models to assess the quality of other models’ outputs. This method enables nuanced evaluation, factoring in context, completeness, and correctness.
Advantages of LLM-as-Judge:
- Closest to human judgment (80 %+ agreement)
- Can assess complex quality aspects
- Provides explanations for scores
- Easily adaptable to various criteria
Disadvantages of LLM-as-Judge:
- Computationally expensive
- Susceptible to biases (e.g., verbosity, position bias)
- May produce inconsistent scores
- Requires careful prompt design
NLI-Based Methods: Logical Consistency Checks
Natural Language Inference (NLI) models determine whether an answer logically follows from a question—especially useful for hallucination detection and fact-checking.
Advantages of NLI:
- Excellent for verifying factual claims
- 10–100× more efficient than full LLMs
- Effectively detects contradictions
Disadvantages of NLI:
- Requires specially curated datasets
- Limited to binary logic
- Struggles with open-ended questions
Our Approach: Verdicts with Softmax Aggregation
After analyzing existing techniques, we developed a hybrid method combining the strengths of LLM-as-Judge with a more structured and interpretable evaluation system.
Key Components of Our Solution
1. Intent Extraction
We determine what the user actually wants to know. This is critical for accurate relevancy assessment—an answer may be relevant to one intent and irrelevant to another.
2. Answer Decomposition into Atomic Statements
Rather than scoring the whole answer, we break it into individual facts or claims. This was a pivotal decision, based on our experience: when an LLM is asked to rate an entire answer, it tends to overrate due to its helpfulness bias. The model “wants to be useful” and often labels any coherent response as good.
Decomposition enables indirect evaluation—we don’t ask “how good is this answer?” but instead validate each specific statement. This greatly reduces bias and ensures a more objective assessment.
3. Verdict System
Each statement is evaluated on a 5-point scale:
- Fully (1.0): Fully answers the question
- Mostly (0.9): Mostly answers, but may lack details
- Partial (0.6): Partially relevant
- Minor (0.3): Weak connection to the question
- None (0.0): Not relevant at all
Importantly, these scores can be dynamic—weights can be adjusted for specific domains. For example, medical consultations might require stricter criteria, reducing the weight of partially relevant answers.
4. Softmax Aggregation with Temperature
Instead of averaging, we use a softmax function to aggregate scores. This allows us to:
- Give more weight to specific scores
- Smooth out the impact of outliers
- Tune sensitivity via a temperature parameter
Key Advantage: The final score is computed via a deterministic formula—not based on the LLM’s subjective judgment. The model only classifies statements; the final score is mathematically derived.
Why This Approach?
Overcoming a number of limitations of existing methods:
- Unlike lexical methods, our approach understands semantics and can assess relevance even when synonyms and paraphrasing are used.
- Unlike simple embeddings, we do not lose detail through averaging — each statement is evaluated separately.
- Unlike basic LLM-as-Judge, our structured verdict system ensures stability and interpretability of assessments.
- Unlike NLI-based approaches, we can work with any types of questions and answers, not limited to binary logic.
Unique Advantages:
- Full transparency: Why a particular score was assigned is always clear. Each statement has its own verdict with an explanation, making the entire evaluation process fully interpretable. This is critically important for debugging and improving the system.
- Mathematical foundation: Unlike approaches where the LLM makes a subjective judgment, our system is based on mathematical computations. The LLM only classifies statements, while the final score is calculated using a formula, ensuring consistency and predictability.
- Configuration flexibility: The temperature parameter in softmax and the dynamic weighting system allow the strictness of the evaluation to be adapted to specific requirements without changing the core logic.
- Bias resistance: Indirect evaluation through statements eliminates the tendency of LLMs to give inflated scores, making the system more objective.
- Scalability: The approach can be applied to answers of any length and complexity, from short factual answers to detailed explanations.
Practical Results
The implementation of our approach has shown the following results:
- Cost efficiency — despite using an LLM, the structured approach minimizes the number of model queries.
- High correlation with human evaluation — in 92% of cases, the automatic evaluation matched expert opinion.
- Score stability — repeated runs on the same data yield consistent results.
- Process transparency — the development team and business users can understand why the system assigned a specific score.
Conclusion
The evolution of methods for evaluating the quality of LLM solutions has progressed from simple lexical algorithms to complex semantic systems. Each approach has its own application niche: lexical methods are indispensable for quick filtering, semantic embeddings work well for finding similar texts, and LLM-as-Judge provides the most nuanced evaluation.
Our approach with verdicts and softmax aggregation represents a synthesis of practices, combining the semantic understanding of the LLM with mathematical rigor and transparency in the evaluation process. The key distinction is that we do not rely on the model’s subjective opinion but use it to classify individual components of the answer, after which deterministic mathematical methods are applied to compute the final score.
Currently, this approach demonstrates its effectiveness in practice, providing a high correlation with human evaluation while maintaining full interpretability of results. The ability to dynamically configure weights and parameters makes the system adaptable to different domains and requirements.
Key takeaway: There is no universal “best” method for evaluating LLM quality. The choice of approach should be determined by the specific requirements of the system, available resources, and the necessary balance between accuracy, speed, and interpretability. For tasks requiring high evaluation precision while maintaining transparency and process control, the approach of decomposition into statements and mathematical aggregation appears to be the optimal solution.
The future of LLM quality evaluation is likely to be tied to the further integration of various approaches and the development of methods capable of adapting to the specifics of particular domains and tasks. Continued research in this area is important, as high-quality evaluation is the foundation for building reliable and useful AI systems.
References:
https://arxiv.org/html/2412.05579v2
https://arxiv.org/abs/2305.11171
https://arxiv.org/html/2406.01607v1
https://arxiv.org/abs/1908.10084
https://arxiv.org/html/2405.07437v1
https://arxiv.org/abs/2405.01535
https://arxiv.org/abs/2502.18018
https://arxiv.org/abs/2410.13341
https://cameronrwolfe.substack.com/p/llm-as-a-judge
Curious how our structured LLM evaluation works in practice?
Get hands-on with a live demo of our verdict-based scoring system and see how it aligns with human judgment in real use cases.