Benchmarking Llama 3.1, 3.3, and 4 for RAG Evaluation using Groq & LangChain

pratikskarnik · January 2, 2026, 8:03am

Hi everyone,

I’ve just completed a project focused on quantifying RAG pipeline performance, specifically comparing different Llama models hosted on Groq. Given the speed of Groq, it’s an ideal platform for running the multiple iterations required for rigorous evaluation.

The Technical Stack

• Orchestration: LangChain.

• Inference: Groq API (testing Llama 3.1, 3.3, and 4).

• Vector Store: Chroma DB with Google embeddings.

• Structured Output: Pydantic was used to ensure the “LLM-as-a-Judge” returned evaluation metrics in a clean JSON format.

Evaluation Framework

I divided the evaluation into two distinct components:

1. Retrieval Metrics: Measured using Mean Reciprocal Rank (MRR) for single-answer queries, Mean Average Precision (MAP) for multi-document retrieval, and NDCG to evaluate the specific ranking order.

2. Generation Metrics: Used an “LLM-as-a-Judge” to score responses based on Accuracy, Completeness, and Conciseness.

Key Findings & Model Comparison

Using a custom dataset of company work policies, I found significant variations in generation performance:

• Llama 3.3: The top performer for this dataset, achieving 94% accuracy and 91% completeness.

• Llama 3.1: Reached 85% accuracy and 75% completeness.

• Llama 4: While accuracy remained high at 90%, there was a notable drop in conciseness (62%) compared to the other models.

• Retrieval: All models achieved a perfect 1.0 score for MRR, MAP, and NDCG on this specific test set, meaning the correct context was consistently ranked first.

Video Tutorial & Code

I’ve put together a full video walkthrough showing how to implement these formulas in Python and how to structure your evaluation pipeline for production.

YouTube Video: https://youtu.be/5syl6THrTGI
GitHub Link: GitHub - pratikskarnik/rag_evaluation: Repository for RAG Evaluation Demo on YouTube

I’d love to hear how others are handling RAG evaluation—are you finding similar performance gaps between Llama 3.3 and 4 in your specific use cases?

Topic		Replies	Views
Tool_use_failed on Llama4 models Forum	5	304	July 24, 2025
Mixture of Agents Powered by Groq using Langchain LCEL Tutorials	1	279	August 19, 2025
When will Embedding Models be available in Groq? Feature Requests	14	2171	March 3, 2026
Support for Gemma 3 Feature Requests	25	840	November 13, 2025
Inquiries about Qwen-32B to be available as a production model Forum	2	96	July 3, 2025

Benchmarking Llama 3.1, 3.3, and 4 for RAG Evaluation using Groq & LangChain

Related topics