Hi everyone,
I’ve just completed a project focused on quantifying RAG pipeline performance, specifically comparing different Llama models hosted on Groq. Given the speed of Groq, it’s an ideal platform for running the multiple iterations required for rigorous evaluation.
The Technical Stack
• Orchestration: LangChain.
• Inference: Groq API (testing Llama 3.1, 3.3, and 4).
• Vector Store: Chroma DB with Google embeddings.
• Structured Output: Pydantic was used to ensure the “LLM-as-a-Judge” returned evaluation metrics in a clean JSON format.
Evaluation Framework
I divided the evaluation into two distinct components:
1. Retrieval Metrics: Measured using Mean Reciprocal Rank (MRR) for single-answer queries, Mean Average Precision (MAP) for multi-document retrieval, and NDCG to evaluate the specific ranking order.
2. Generation Metrics: Used an “LLM-as-a-Judge” to score responses based on Accuracy, Completeness, and Conciseness.
Key Findings & Model Comparison
Using a custom dataset of company work policies, I found significant variations in generation performance:
• Llama 3.3: The top performer for this dataset, achieving 94% accuracy and 91% completeness.
• Llama 3.1: Reached 85% accuracy and 75% completeness.
• Llama 4: While accuracy remained high at 90%, there was a notable drop in conciseness (62%) compared to the other models.
• Retrieval: All models achieved a perfect 1.0 score for MRR, MAP, and NDCG on this specific test set, meaning the correct context was consistently ranked first.
Video Tutorial & Code
I’ve put together a full video walkthrough showing how to implement these formulas in Python and how to structure your evaluation pipeline for production.
YouTube Video: https://youtu.be/5syl6THrTGI
GitHub Link: GitHub - pratikskarnik/rag_evaluation: Repository for RAG Evaluation Demo on YouTube
I’d love to hear how others are handling RAG evaluation—are you finding similar performance gaps between Llama 3.3 and 4 in your specific use cases?