RelevanceIQ: Grounded Product Search over 568K Reviews

Posted Jun 23, 2026 Updated Jul 5, 2026

By Abhijit More 5 min read

RelevanceIQ: Grounded Product Search

Most RAG demos are a vector store and a prompt. RelevanceIQ is a production-grade hybrid retrieval and citation-grounded RAG system over the Amazon Fine Food Reviews corpus (568K reviews / 74K products), where every design choice is measured, benchmarked, and served through a live FastAPI stack.

BM25 · dense retrieval · cross-encoder reranking · query reformulation · evidence-grounded generation, with a stage-by-stage visual UI.

Every claim is cited to a retrieved review (100% citation precision). The Live Pipeline view streams each retrieval stage as it runs.

Why This Exists

This is a deliberately engineered retrieval system: every design choice is measured against a labeled evaluation set, the serving path is benchmarked, and failure modes are visible rather than hidden.

It answers three questions, with numbers:

Does hybrid retrieval + reranking beat a strong lexical baseline, and by how much?
Does citation grounding reduce hallucination in RAG, measurably?
Can it be served at production latency, and what is the honest p95?

The stack combines OpenSearch, FAISS, sentence-transformers, Ollama, FastAPI, and MLflow into a fully local, end-to-end search and RAG platform.

Key Results

All numbers are measured, reproducible (make eval | rag | bench), and tracked in MLflow. Evaluation uses 100 queries with 3,625 LLM-judged graded relevance labels.

Hybrid retrieval + reranking: +17.7% NDCG@10 over BM25

Pipeline	NDCG@10	Recall@100	MAP	vs BM25
BM25 (lexical baseline)	0.805	0.661	0.622	—
Dense (bge-base)	0.941	0.697	0.689	+16.8%
Hybrid (RRF)	0.920	0.994	0.963	+14.2%
Hybrid + cross-encoder rerank	0.948	0.994	0.978	+17.7%

The headline is not just NDCG. Hybrid lifts Recall@100 from 0.66 to 0.99. BM25 alone misses a third of relevant reviews that dense retrieval catches. Fusion plus reranking wins on precision and recall at the same time.

Citation-grounded RAG: 80% fewer hallucinations

Metric	Naive RAG	Grounded RAG
Faithfulness (LLM-judged)	87.5%	97.5%
Citation precision	—	100%

Grounding uses query reformulation, mandatory [doc_id] citations, and post-hoc validation that drops uncited claims. That cuts the unfaithful-answer rate from 12.5% to 2.5%, an 80% relative reduction, while every citation points to a genuinely retrieved document.

Serving latency: sub-80ms p95 hybrid retrieval

Warm, measured under load (concurrency 4, 300 requests/mode):

Mode	p95 latency	sustained QPS
BM25 (lexical)	4.6ms	1,181
Dense (bge-base + FAISS)	35.6ms	123
Hybrid (BM25 + dense + RRF)	46ms	99
+ cross-encoder rerank	~670ms	—
End-to-end RAG (retrieve → rerank → generate)	~5.7s	—

Honest latency framing: FAISS HNSW vector search alone is 0.63ms. The index is never the bottleneck. Hybrid throughput plateaus at ~100 QPS because query embedding serializes on a single laptop GPU (MPS), while BM25 sustains over 1,000 QPS with no GPU step. The full end-to-end RAG path is ~5.7s, almost entirely local LLM generation; retrieval is ~56ms of it.

Architecture

Layer	Choice	Why
Lexical	OpenSearch 2.17 BM25	Battle-tested inverted index, exact term matching
Dense	`bge-base-en-v1.5` (768-d) + FAISS HNSW	Strong semantic recall; sub-ms ANN search
Fusion	Reciprocal Rank Fusion	Scale-free, robust, no tuning needed
Rerank	`bge-reranker-base` cross-encoder	Joint (query, doc) scoring for top-k precision
LLM	Ollama `qwen3:8b`	Local; reasoning off for eval (14× faster), on for answers
Eval	`pytrec_eval` + LLM-judged graded relevance	Rigorous IR metrics without human labels
Serving	FastAPI + Uvicorn · MLflow	One-process serving; full experiment provenance

Query → reformulation → parallel BM25 + dense retrieval → RRF fusion
      → cross-encoder rerank → grounded generation → citation validation → cited answer

Interactive UI

A dependency-free web app (no build step) that makes the pipeline legible. Run make serve and open http://127.0.0.1:8000/.

Live progress — each stage shows real latency as it runs
Side-by-side retrieval — see which reviews lexical found that semantic did not, and vice versa
Rank-change deltas — watch the cross-encoder pull a review from rank #9 to #1
Grounded answer — types out with clickable citations; click any hit to read the full review

Three tabs: Live Pipeline · Grounded RAG · Results dashboard

Engineering That Holds Up

Built to survive failure, not just demo once:

Resumable jobs — embedding, query generation, LLM judging, and RAG eval checkpoint to append-only JSONL and skip completed work on restart
Retry with backoff — external calls never silently corrupt a result; MLflow degrades to a no-op if the server is down
Honest evaluation — NDCG is scored over the judged pool; limitations are documented, not omitted

28 modules · ~2,800 lines of Python · 11 tests · zero absolute paths · fully self-contained.

Corpus vs. index counts: the raw corpus is 568,454 reviews. After removing exact duplicate (user, text) reviews, 394,718 unique reviews are indexed. Both numbers are reported honestly throughout.

Getting Started

Prerequisites: Python 3.11, Docker, Ollama (ollama pull qwen3:8b). Everything runs locally.

  
git clone https://github.com/AbhijitMore/relevanceiq.git
cd relevanceiq

uv venv --python 3.11 .venv && uv sync

make up           # OpenSearch + MLflow
make verify       # preflight checks
make download     # fetch dataset (568,454 rows)
make ingest       # clean + dedup → parquet
make index        # BM25 bulk index + FAISS HNSW
make eval-set     # 100 queries + LLM-judged qrels
make eval         # NDCG@10 / Recall@100 / MRR / MAP
make rag          # naive vs grounded faithfulness eval
make serve        # FastAPI + UI
make bench        # p50/p95/p99 latency + QPS
make test         # 11 unit tests

Limitations

LLM-judged relevance approximates human labels; relative gaps between methods are the trustworthy signal
Single laptop setup; production would use managed OpenSearch, GPU embedding/reranking, and sharded vector stores
Local LLM generation (qwen3:8b); the grounding mechanism and its measured 80% hallucination reduction are model-agnostic