Building RAG Systems That Scale

Production RAG systems often hit two walls: context limits and citation quality. In a recent project we tackled both.

First, we moved from naive chunk retrieval to a hybrid approach—combining semantic search with metadata filters and re-ranking. That alone improved relevance. Then we added post-hoc processing to normalize and deduplicate citations, which took citation accuracy from inconsistent to reliably high.

The bigger win was MCP pagination. Our agents were hitting context ceilings when pulling from external tools. We implemented Redis-backed cursor pagination so the agent could stream results in pages instead of loading everything at once. Client-side accuracy jumped because the model could focus on the right slice of data.

Key takeaways: invest in retrieval quality before scaling, and design your tool layer (MCP, function calling) with pagination and limits from day one.