Given the current GenAI boom, the number of solutions built on GenAI models is growing daily. However, the most crucial question remains open, one that likely concerns many who have already integrated GenAI models to perform various tasks: How can the quality of a non-deterministic system based on GenAI be evaluated?
At First Line Software, we are already addressing this challenge. Today, I’d like to explain how we evaluated the GenAI chatbot on our website, firstlinesoftware.com, as an example.
Let me start by saying that testing of GenAI solutions is different from traditional testing, and the old approaches of manual or automated checks that compare actual and expected results simply do not work in the GenAI world. Why is that? It’s quite simple. First, imagine you need to test a GenAI chatbot. Following the classic testing approach, you have only one option—ask your GenAI chatbot various questions and subjectively test the generated responses. This approach requires a significant amount of human resources, and the quality of such testing depends entirely on the judgment of the testing specialist, who may either accept the response or deem it incorrect.
First Line Software explored this issue and developed our own evaluation methodology that would enable an unbiased and accurate evaluation process with minimal human influence on the assessment results.
Where did we start? Naturally, with market research. Our initial focus was on finding existing, somewhat formalized methodologies for evaluating GenAI. Unfortunately, we discovered that this area is still underexplored, so we had to piece together information from various sources and studies in the field of GenAI. This required us to build our evaluation methodology from scratch.
Next, we needed to choose the right tools for evaluating GenAI. While Azure AI Studio is our primary choice for assessing most GenAI models, we found that chatbots grounded by the RAG approach without Microsoft products required a different evaluation solution.
This solution needed to connect to these GenAI conversational applications and perform assessments based on the selected metrics. After analyzing the market, we decided to focus on open-source products like DeepEval and PyRIT.
Let’s now break down what the GenAI evaluation process looks like in practice.
1. Dataset Preparation
It begins with analyzing the specific functions of the GenAI conversational application. At this stage, the task of the GenAI Evaluation Engineer is to identify the application areas that will later be used to create a dataset for assessing the chatbot’s performance. In a basic approach, we cover the following areas:
- Domain-Specific Functionality: This involves understanding the particular domain in which the chatbot operates:
- About Company
- About Projects
- For Leads
- For Candidates
- For Current Customers
- Complaints
- Bias and Ethics: This involves assessing the chatbot’s responses to ensure they are free from bias and align with ethical standards.
- Gender Bias
- Cultural Bias
- Disability Inclusion
- Geographical Bias
- Age Bias
- Socioeconomic Bias
- Security and Safety: This involves evaluating the chatbot’s ability to handle sensitive information securely and respond to potentially harmful or dangerous queries appropriately.
- Sensitive Information
- Discrimination
- Toxicity
- Offensiveness
For each section and subsection, it is essential to prepare a list of questions that can be used as a dataset. This dataset will be loaded into the LLM so that the evaluation framework can assess the quality of responses based on the specified metrics.
2. Upgrade The Dataset Metamorphic Relations
To enhance the efficiency of the evaluation process, the existing dataset will be upgraded to include various forms of initial data. For evaluation, AI Metamorphic Relations (MR) represent a set of properties that link multiple pairs of inputs for the target program or application. Additionally, Metamorphic Relations allow for a more accurate assessment of GenAI performance as they enable significant dataset expansion while ensuring that your GenAI RAG application consistently provides correct and precise answers across different scenarios.
For example, for a simple question in the dataset like “Where are your company’s offices located?”, you can extend your dataset with the following MR variations:
MR1: Synonym Replacement – “Where are your organization’s offices situated?”
Purpose: To test if the model can handle synonyms and still retrieve relevant information.
MR2: Paraphrasing – “Can you tell me the locations of your company’s offices?”
Purpose: To evaluate the model’s ability to understand and respond to paraphrased questions.
MR3: Adding Context – “I’m planning a visit. Where are your company’s offices located?”
Purpose: To check if adding extra context influences the retrieval and generation accuracy.
MR4: Negation – “Can you confirm that your company doesn’t have any offices outside the main headquarters?”
Purpose: To test the model’s ability to handle negations and return accurate responses.
MR5: Focus Shift – “In which cities does your company have offices?”
Purpose: To assess if the model correctly focuses on the specific information (cities) rather than a general location.
MR6: Temporal Change – “Where were your company’s offices located ten years ago?”
Purpose: To test how well the model handles questions related to different time frames.
MR7: Compound Question – “Where are your company’s offices located, and how many employees work in each office?”
Purpose: To evaluate how the model handles compound questions and whether it can provide accurate responses for both parts.
MR8: Ambiguity Introduction – “Where are your offices?”
Purpose: To check if the model can disambiguate and assume the correct entity (the company) when the question is vague.
MR9: Partial Information – “Where is your main office located?”
Purpose: To evaluate if the model can handle questions with reduced scope and still return precise information.
MR10: Entity Substitution – “Where are Google’s offices located?”
Purpose: To check how the model handles questions when the entity of interest changes.
Thus, in addition to the basic prepared dataset, we have significantly expanded it by using Metamorphic Relations.
3. Define Evaluation Metrics
The next step in preparing for the evaluation is to define a set of metrics that can assess various aspects of the responses generated by the LLM.
In our company, we can use up to 30 different metrics across various projects. The selection of metrics is often driven by the testing objectives, as well as the framework that is utilized for GenAI Evaluation. For our corporate GenAI ChatBot evaluation project, we have chosen the aforementioned framework—DeepEval.
As a result, the following set of metrics have been identified:
- Answer Relevancy measures the quality of your RAG pipeline’s generator by evaluating how relevant the actual_output of your LLM application is compared to the provided input.
- Faithfulness (Bias) measures the quality of your RAG pipeline’s generator by evaluating whether the actual_output factually aligns with the contents of your retrieval_context.
- Contextual Precision measures your RAG pipeline’s retriever by evaluating whether nodes in your retrieval_context that are relevant to the given input are ranked higher than irrelevant ones.
- Contextual Recall measures the quality of your RAG pipeline’s retriever by evaluating the extent of which the retrieval_context aligns with the expected_output.
- Contextual Relevancy measures the quality of your RAG pipeline’s retriever by evaluating the overall relevance of the information presented in your retrieval_context for a given input.
- Hallucination determines whether your LLM generates factually correct information by comparing the actual_output to the provided context.
- Toxicity evaluates toxicness in your LLM outputs.
4. Integrate AI Solution into Framework and Evaluate Dataset
To fully automate the evaluation process for GenAI, we made significant modifications to the existing DeepEval framework by adding the capability for direct integration with the GenAI ChatBot using WebSocket technology. This enhancement allows our framework to directly send the generated dataset to the chatbot, bypassing the UI interface.
However, simply obtaining an evaluation from the framework is not sufficient. The GenAI Evaluation Engineer must also have the ability to review the chatbot’s responses and understand the reasoning behind why and how the LLM generated a particular answer. Therefore, we additionally implemented the capability to save not only the chatbot’s responses in a separate database, but also to retrieve the context from the GenAI ChatBotthat was used to generate the response.
As a result, after the framework completes its tasks, the GenAI Evaluation Engineer receives all the necessary data for further analysis.
5. Analyze Deviations and Repeat Step 4
Based on the results, the GenAI Evaluation Engineer will analyze deviations from the metrics using GenAI responses and context to identify bottlenecks in the GenAI-based conversational application. If the GenAI Evaluation Engineer observes deviations in the framework’s assessment, their primary task is to determine the cause of these deviations and decide whether the chatbot’s behavior represents a bug/misalignment that requires correction. Consequently, steps 4 and 5 are repeated until the quality metrics used to evaluate the GenAI meet the required thresholds (e.g., above 90%).
In conclusion, the evaluation of GenAI applications, such as corporate chatbots, image generator, voicebot, etc. requires a specialized approach that goes beyond traditional testing methods. At First Line Software, we’ve developed a comprehensive evaluation methodology designed to assess the effectiveness and accuracy of GenAI models rigorously. Using advanced tools like DeepEval, our methodology focuses on different metrics, ensuring that the GenAI applications we deploy are both reliable and aligned with your business objectives.
Our approach involves detailed dataset preparation, including specific techniques to enhance evaluation accuracy and the integration of GenAI applications into a framework that allows for continuous evaluation, monitoring, and improvement. By systematically identifying and addressing any deviations, we ensure that our GenAI methodology consistently meets high-quality standards.
If you’re concerned about the quality of your GenAI application, our team is always ready to assist. With our expertise, you can confidently deploy GenAI solutions that drive better outcomes for your business.
Learn more about our GenAI Evaluation Services
Alexander Meshkov
Gen AI QA Director at First Line Software
Alexander Meshkov is QA Delivery Director at FLS. Alexander has over 10 years of experience in software testing, organization of the testing process, and test management. A frequent attendee and speaker of diverse testing conferences, actively engages in discussions and keeps up-to-date with the latest trends and advancements in the field.