Post

RelevanceIQ: Grounded Product Search over 568K Reviews

Open in Github Page

RelevanceIQ: Grounded Product Search

Most RAG demos are a vector store and a prompt. RelevanceIQ is a production-grade hybrid retrieval and citation-grounded RAG system over the Amazon Fine Food Reviews corpus (568K reviews / 74K products), where every design choice is measured, benchmarked, and served through a live FastAPI stack.

BM25 · dense retrieval · cross-encoder reranking · query reformulation · evidence-grounded generation, with a stage-by-stage visual UI.

Grounded RAG output

Every claim is cited to a retrieved review (100% citation precision). The Live Pipeline view streams each retrieval stage as it runs.


Why This Exists

This is a deliberately engineered retrieval system: every design choice is measured against a labeled evaluation set, the serving path is benchmarked, and failure modes are visible rather than hidden.

It answers three questions, with numbers:

  1. Does hybrid retrieval + reranking beat a strong lexical baseline, and by how much?
  2. Does citation grounding reduce hallucination in RAG, measurably?
  3. Can it be served at production latency, and what is the honest p95?

The stack combines OpenSearch, FAISS, sentence-transformers, Ollama, FastAPI, and MLflow into a fully local, end-to-end search and RAG platform.


Key Results

All numbers are measured, reproducible (make eval | rag | bench), and tracked in MLflow. Evaluation uses 100 queries with 3,625 LLM-judged graded relevance labels.

Results dashboard

Hybrid retrieval + reranking: +17.7% NDCG@10 over BM25

PipelineNDCG@10Recall@100MAPvs BM25
BM25 (lexical baseline)0.8050.6610.622
Dense (bge-base)0.9410.6970.689+16.8%
Hybrid (RRF)0.9200.9940.963+14.2%
Hybrid + cross-encoder rerank0.9480.9940.978+17.7%

The headline is not just NDCG. Hybrid lifts Recall@100 from 0.66 to 0.99. BM25 alone misses a third of relevant reviews that dense retrieval catches. Fusion plus reranking wins on precision and recall at the same time.

Citation-grounded RAG: 80% fewer hallucinations

MetricNaive RAGGrounded RAG
Faithfulness (LLM-judged)87.5%97.5%
Citation precision100%

Grounding uses query reformulation, mandatory [doc_id] citations, and post-hoc validation that drops uncited claims. That cuts the unfaithful-answer rate from 12.5% to 2.5%, an 80% relative reduction, while every citation points to a genuinely retrieved document.

Serving latency: sub-80ms p95 hybrid retrieval

Warm, measured under load (concurrency 4, 300 requests/mode):

Modep95 latencysustained QPS
BM25 (lexical)4.6ms1,181
Dense (bge-base + FAISS)35.6ms123
Hybrid (BM25 + dense + RRF)46ms99
+ cross-encoder rerank~670ms
End-to-end RAG (retrieve → rerank → generate)~5.7s

Honest latency framing: FAISS HNSW vector search alone is 0.63ms. The index is never the bottleneck. Hybrid throughput plateaus at ~100 QPS because query embedding serializes on a single laptop GPU (MPS), while BM25 sustains over 1,000 QPS with no GPU step. The full end-to-end RAG path is ~5.7s, almost entirely local LLM generation; retrieval is ~56ms of it.


Architecture

RelevanceIQ architecture

LayerChoiceWhy
LexicalOpenSearch 2.17 BM25Battle-tested inverted index, exact term matching
Densebge-base-en-v1.5 (768-d) + FAISS HNSWStrong semantic recall; sub-ms ANN search
FusionReciprocal Rank FusionScale-free, robust, no tuning needed
Rerankbge-reranker-base cross-encoderJoint (query, doc) scoring for top-k precision
LLMOllama qwen3:8bLocal; reasoning off for eval (14× faster), on for answers
Evalpytrec_eval + LLM-judged graded relevanceRigorous IR metrics without human labels
ServingFastAPI + Uvicorn · MLflowOne-process serving; full experiment provenance
1
2
Query → reformulation → parallel BM25 + dense retrieval → RRF fusion
      → cross-encoder rerank → grounded generation → citation validation → cited answer

Interactive UI

A dependency-free web app (no build step) that makes the pipeline legible. Run make serve and open http://127.0.0.1:8000/.

  • Live progress — each stage shows real latency as it runs
  • Side-by-side retrieval — see which reviews lexical found that semantic did not, and vice versa
  • Rank-change deltas — watch the cross-encoder pull a review from rank #9 to #1
  • Grounded answer — types out with clickable citations; click any hit to read the full review

Three tabs: Live Pipeline · Grounded RAG · Results dashboard

Live pipeline view


Engineering That Holds Up

Built to survive failure, not just demo once:

  • Resumable jobs — embedding, query generation, LLM judging, and RAG eval checkpoint to append-only JSONL and skip completed work on restart
  • Retry with backoff — external calls never silently corrupt a result; MLflow degrades to a no-op if the server is down
  • Honest evaluation — NDCG is scored over the judged pool; limitations are documented, not omitted

28 modules · ~2,800 lines of Python · 11 tests · zero absolute paths · fully self-contained.

Corpus vs. index counts: the raw corpus is 568,454 reviews. After removing exact duplicate (user, text) reviews, 394,718 unique reviews are indexed. Both numbers are reported honestly throughout.


Getting Started

Prerequisites: Python 3.11, Docker, Ollama (ollama pull qwen3:8b). Everything runs locally.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
git clone https://github.com/AbhijitMore/relevanceiq.git
cd relevanceiq

uv venv --python 3.11 .venv && uv sync

make up           # OpenSearch + MLflow
make verify       # preflight checks
make download     # fetch dataset (568,454 rows)
make ingest       # clean + dedup → parquet
make index        # BM25 bulk index + FAISS HNSW
make eval-set     # 100 queries + LLM-judged qrels
make eval         # NDCG@10 / Recall@100 / MRR / MAP
make rag          # naive vs grounded faithfulness eval
make serve        # FastAPI + UI
make bench        # p50/p95/p99 latency + QPS
make test         # 11 unit tests

Limitations

  • LLM-judged relevance approximates human labels; relative gaps between methods are the trustworthy signal
  • Single laptop setup; production would use managed OpenSearch, GPU embedding/reranking, and sharded vector stores
  • Local LLM generation (qwen3:8b); the grounding mechanism and its measured 80% hallucination reduction are model-agnostic

Contributions and feedback are welcome.


This post is licensed under CC BY 4.0 by the author.