Should we filter permissions before or after retrieving vector search results?

Filter before, not after. Query-time metadata filtering (pre-filtering) on tenant ID, department, or classification level works best with vector databases that support it, because post-filtering after an approximate nearest-neighbor search can silently return fewer results than the requested top-k once unauthorized chunks are removed. Test this explicitly, since a query expected to return five chunks may return only two after permission checks strip out the rest.

How often should we re-index our knowledge base for RAG?

Full re-embedding on every change is too slow and expensive at scale, so incremental re-indexing based on content hashes or version identifiers is essential. High-churn corpora like internal wikis or ticketing systems benefit from a faster cadence for frequently accessed documents and a slower cadence for the long tail. Deletions should propagate promptly, since stale chunks retrieved after their source is removed are a common cause of hallucinated citations.

Is it safe to switch embedding models after a RAG system is already in production?

Not without a full migration, since every vector in the store becomes incompatible with the new model and must be re-embedded from scratch. The safest approach is a blue-green rollout: build the new index in parallel, run shadow evaluation against production traffic, and cut over only once the new index outperforms the old one on your golden dataset.

Can automated LLM-judge metrics replace human review in RAG evaluation?

No. LLM judges are useful for quickly catching regressions but have blind spots and can be gamed by verbose, hedge-everything answers that score well on faithfulness while being useless to users. Pair automated scoring with a monthly human audit of a random sample and track agreement between the two so you know when the automated judge needs recalibration.

Building Production RAG Systems: Lessons from Multiple Deployments

Retrieval-augmented generation (RAG) looks simple in a notebook: embed documents, store vectors, attach a prompt, call a model. Production is a different beast—latency budgets, stale content, authorization boundaries, and evaluation loops dominate the engineering calendar.

Chunking is the first lever. Semantic chunks usually outperform arbitrary token windows for factual recall, but they require investment in cleaning HTML/PDF noise and preserving headings and tables. Hybrid retrieval (BM25 + vectors) still wins many enterprise corpora where keyword overlap matters as much as semantic similarity.

Evaluation cannot be an afterthought. You need labeled question-answer pairs from real users, automatic regression suites on golden datasets, and online checks for toxicity, PII leakage, and citation faithfulness. Without these, teams chase anecdotal bugs while the model silently drifts as the knowledge base changes.

Latency and cost follow from architecture: cache embeddings for stable corpora, stream tokens to the UI, batch where possible, and cap context windows deliberately. Observability should include retrieval traces—what chunks were selected, with what scores—so incidents are debuggable without reproducing user sessions by hand.

Human-in-the-loop remains essential for regulated domains or high-stakes answers. Design explicit escalation paths, queue review tooling, and feedback capture that feeds back into chunk metadata and evaluation sets.

Finally, treat RAG as a data product. Owners, SLAs, and change management for the knowledge base matter more than the embedding model du jour. If your content pipeline is messy, RAG will amplify the mess—fix ingestion and metadata before chasing marginal recall gains.

Related: see AI strategy & MLOps, anonymized retail copilot case patterns, and more resources.

Access control is a first-class retrieval concern

Most teams bolt on document-level permissions after the first security review flags that RAG happily surfaces content a user shouldn't see. Retrofitting this is expensive because it touches indexing, retrieval, and caching all at once. Design the permission model before you pick a vector store: decide whether you filter at query time with metadata predicates, maintain per-tenant indexes, or post-filter results and accept the recall loss.

Query-time filtering on metadata (tenant ID, department, classification level) is the most common approach and works well with vector databases that support pre-filtering rather than post-filtering, since post-filtering after an approximate nearest-neighbor search can return fewer results than the requested top-k once permission checks remove candidates. Test this explicitly—teams are frequently surprised when a query that should return five chunks returns two because three were filtered out after retrieval.

For multi-tenant SaaS products, per-tenant indexes trade operational complexity for a cleaner security boundary and simpler debugging. Shared indexes with metadata filtering scale better operationally but require rigorous testing to prove there is no cross-tenant leakage, especially as you add new document types or change embedding models. Whichever you choose, write an automated test suite that attempts unauthorized retrieval and fails the build if it succeeds.

Audit logging deserves the same rigor as the retrieval path itself. Log which chunks were returned to which user for which query, and retain enough context to reconstruct an answer's provenance during an incident review. Regulated customers will ask for this during procurement, and retrofitting audit logging into a system that wasn't built for it usually means re-architecting the retrieval layer.

Managing knowledge base drift and freshness

A RAG system's accuracy degrades the moment the underlying documents change and the index doesn't catch up. Source documents get edited, deprecated, or deleted, and the vector store has no inherent awareness of any of this unless you build a synchronization pipeline. Treat ingestion as a continuous pipeline with the same rigor as a CI/CD system, not a one-time batch job run during the proof-of-concept phase.

Incremental re-indexing is non-negotiable at any real scale—full re-embedding of a large corpus on every content change is slow and expensive, and teams that skip incremental updates end up running batch jobs so infrequently that the index is stale for days. Track content hashes or version identifiers per source document so you only re-embed what actually changed, and propagate deletions promptly; a stale chunk that still gets retrieved after its source was removed is one of the most common causes of hallucinated citations.

Freshness also matters for the retrieval ranking itself. Recency-weighted scoring, or simply surfacing document timestamps to the generation step, prevents the model from confidently citing a policy that was superseded six months ago. For high-churn corpora like internal wikis or ticketing systems, consider a shorter re-index cadence for a hot subset of frequently accessed documents and a slower cadence for the long tail.

Finally, instrument drift detection. Track retrieval quality metrics over time against a fixed evaluation set, and alert when scores degrade—this is often the first signal that upstream content has changed in a way that broke assumptions your chunking or embedding strategy depended on, well before users start filing complaints.

Choosing and tuning the embedding model

Teams often treat the embedding model as a fixed decision made in week one and never revisited, but it is one of the highest-leverage choices in the whole system. Off-the-shelf general-purpose embeddings work reasonably well for broad web-style content, but domain-specific corpora—legal contracts, medical records, industrial manuals—frequently benefit from a fine-tuned or domain-adapted embedding model. Benchmark this against your own evaluation set rather than trusting public leaderboards, since public benchmarks rarely resemble your actual document mix.

Changing the embedding model later is a bigger operation than it looks, because every vector in the store becomes incompatible with the new model and needs full re-embedding. Version your embedding model alongside your index schema, and plan migrations as a blue-green rollout: build the new index in parallel, run shadow evaluation against production traffic, and cut over only once the new index outperforms the old one on your golden dataset.

Dimensionality and cost also matter more at scale than people expect. Higher-dimensional embeddings can improve recall marginally but increase storage and query latency substantially once you're indexing millions of chunks. Quantization techniques (int8 or binary embeddings) can cut storage costs by 4x or more with a small, measurable recall penalty—test this trade-off explicitly against your evaluation set before committing to it in production.

Evaluation loops that survive contact with real users

Golden datasets built once during the pilot phase go stale fast, because they reflect the questions your team anticipated rather than the questions users actually ask. Set up a pipeline that samples real production queries on a rolling basis, routes a subset to human reviewers for labeling, and folds the results back into your evaluation suite every sprint. This keeps the benchmark honest instead of measuring performance against a snapshot from six months ago.

Automated metrics like answer relevance and faithfulness scores from an LLM judge are useful for catching regressions quickly, but they are not a substitute for human review on a sampled basis. LLM judges have their own blind spots and can be gamed by verbose, hedge-everything answers that score well on faithfulness while being useless to the user. Pair automated scoring with a monthly human audit of a random sample, and track agreement between the two so you know when the automated judge needs recalibration.

Segment your evaluation by query type and user cohort rather than reporting a single aggregate score. A system that performs well on simple factual lookups but poorly on multi-hop reasoning questions will hide that weakness behind a healthy average unless you break the numbers down. The same applies to different content sources within the corpus—one poorly chunked document type can drag down overall quality metrics for the whole system without anyone noticing which source is responsible.

Building Production RAG Systems: Lessons from Multiple Deployments

Access control is a first-class retrieval concern

Managing knowledge base drift and freshness

Choosing and tuning the embedding model

Evaluation loops that survive contact with real users

Frequently Asked Questions

Keep exploring

Ready to transform your infrastructure?