All you know about RAG is a lie.

Lessons learned from the past year

Alexandru Vesa

and

Cube Digital

Dec 26, 2024

Why do so many AI projects feel like déjà vu?

You start with bold ambitions, tackle a proof of concept, and… it stalls. Again.

At ML Vanguards, we know this story all too well. It’s the cycle we’ve broken countless times in our work at Cube Digital.

The truth? Building production-grade AI isn’t about chasing buzzwords — it’s about combining engineering knowledge with practical AI to deliver systems that actually work, scale, and drive real-world impact.

In this article, we’re going beyond the hype to give you the tools, insights, and strategies to escape PoC purgatory and find your way to production paradise.

Everything starts with a PoC, right? A client approaches you with basic requirements and a vision to create something groundbreaking. That’s when the excitement begins—turning an idea into a proof of concept (PoC) feels like the first step toward innovation.

Over the past twelve months, I’ve gone through five different attempts to launch a fully functional Retrieval-Augmented Generation (RAG) system in production. Every single one ended up on the scrap heap for different reasons. Some projects died early in the prototyping phase, while others crashed and burned when scaling issues reared their ugly heads.

The journey taught me one critical lesson: choosing the right focus areas during the PoC phase can make or break the project.

As shown in the graphic above, the RAG pipeline consists of multiple moving parts—from preprocessing documents to integrating with vector databases and large language models. Each layer comes with its own engineering challenges; not all are worth solving during a PoC.

The key to a successful PoC is identifying which parts of the RAG pipeline truly matter and warrant deeper engineering effort.

Focusing too broadly or tackling production-scale issues prematurely is a recipe for wasted time, blown budgets, and, ultimately, failed projects.

In the following sections, I’ll share the lessons I learned across five different attempts, highlighting what worked and what didn’t and how careful selection during the PoC phase could have saved me a lot of headaches.

Project #1: Let’s "LangChain" everything

Generative AI was everywhere.

It seemed like everyone was talking about the next generation of chatbots, proclaiming that classical machine learning was outdated.

A lot of noise was in my mind, so I decided to take what appeared to be the easiest route: using open-source LLM orchestrators like LangChain.

I went into their documentation, binge-watched YouTube tutorials, and for a moment, I felt invincible—like everything was finally falling into place, as if a divine hand was guiding me.

Armed with an open-source framework, I figured hooking up a vector database to a large language model was no big deal. After all, I had worked with AI APIs and text embeddings before.

But I couldn’t have been more wrong.

What went wrong?

Dependency hell: LangChain and its associated libraries were frequently updated, and with every update came compatibility issues. The vector database APIs and LLM integrations would often break, requiring constant troubleshooting and rework.
Loss of control: Using an external framework meant I had little control over its internal workings. Changes in the framework’s imports or logic disrupted my implementation, forcing me to rewrite parts of my code every time the framework evolved.
Scalability issues: While LangChain worked well for a single-user PoC, scaling it to multiple concurrent users introduced latency and resource allocation issues that the framework was not equipped to handle.
Security gaps: Sensitive information, such as user data, leaked through generated responses because there was no built-in mechanism to manage private or confidential data securely. These leaks led to compliance concerns and blocked progress.

Takeaway:
LangChain and similar frameworks are fantastic for building quick proofs of concept, offering a way to validate ideas and experiment with LLMs.

However, transitioning to production requires an entirely different approach.

For production, you need complete control over your pipeline, robust scalability strategies, and a security-first mindset. The flexibility and speed that make frameworks like LangChain appealing for PoCs can become liabilities when faced with real-world demands.

Project #2: The "It can’t be that hard" prototype → no frameworks, 100% control over data

In my mind, data is the most crucial part of any AI system. So, in one of our projects, I decided to build the data ingestion and indexing components entirely from scratch. My thinking was simple: if we could ensure 100% control over the data pipeline, we’d avoid the issues that come with off-the-shelf frameworks and guarantee long-term flexibility.

To make this approach even more robust, we decided to build custom data connectors for various sources like Google Drive, Microsoft Outlook, PDFs, and wikis.

On top of that, we added the Ray framework for distributed processing and used low-level control with the Qdrant SDK for vector indexing. This would give us unparalleled control—or so I thought.

What went wrong?

Document ingestion nightmares: Parsing files turned into a complete fiasco. Hidden metadata in PDFs caused chunking logic to break. Microsoft Outlook attachments came in unpredictable formats, and wikis were riddled with inconsistent structure. Each source introduced unique quirks that required constant debugging.
Hallucinations: Despite the focus on data quality, the LLM still generated references to nonexistent documents. Adjusting prompt parameters helped marginally, but hallucinations were far from eliminated.
Complexity overload: Developing custom connectors and indexing logic during the PoC phase created a flood of bugs. Prematurely adding production-level features—like distributed processing with Ray—complicated the system far beyond what was necessary for a proof of concept.
Qdrant SDK challenges: While Qdrant is powerful, using its low-level SDK demands a deeper understanding of how vector databases work. This introduced a steep learning curve, and bugs in query performance and indexing logic delayed progress.

Takeaway:
Preprocessing and data consistency are critical to AI success, but trying to build everything from scratch for a PoC is overkill.

Building custom data connectors is hard enough without the added complexity of integrating distributed frameworks like Ray or low-level vector database tools like Qdrant SDK. For a PoC, simplicity should be the priority—production-level features can (and should) wait for later.

I learned that while data is king, focusing solely on data quality during a PoC can derail the entire project if it comes at the expense of speed and simplicity.

Project #3: The grand re-architecture

After two challenging attempts, I had learned a lot. I was confident that this time, I could build the most perfect RAG system. Armed with my previous lessons, I designed a complex microservice architecture that would address every issue I had encountered before.

Everything was meticulously decoupled, giving me 100% control over each step of the process—from embeddings to retrieval, chunking, generation, and security checks.

This time, I ensured everything was developed on time and followed all the "best practices" I had collected along the way. It felt like a dream architecture—a seamless blend of modularity, control, and scalability.

The Cost?
Reality hit hard when we went into production. The system had all the same issues I thought I’d eliminated—hallucinations, speed problems, incorrect data indexing in the vector database, search paths that didn’t align with expectations, and more. The complexity I introduced to avoid problems only created new ones.

What went wrong?

Too many moving parts: The microservice architecture, while elegant on paper, introduced extreme complexity. Each service (e.g., embeddings, chunking, retrieval, generation, and security checks) depended on others, making debugging a nightmare. When something broke, tracing the issue through multiple services consumed hours, if not days.
Shifting business priorities: As business needs evolved, our overly complicated architecture became a bottleneck. Iterating quickly was nearly impossible because every change impacted multiple interconnected services. The gaps between business requirements and the tech implementation widened.
Technical debt accumulation: Debugging complexity and the need to maintain intricate microservices led to a mountain of technical debt. Over time, the system became harder to maintain and adapt, slowing us down even further.
Production fragility: Despite our focus on control, issues like hallucinations, incorrect data ingestion, and vector database mismatches persisted. These problems undermined confidence in the system and made users hesitant to rely on it.

Takeaway:
A perfect architecture doesn’t guarantee a perfect system.

While decoupling and modularity are important, overengineering can be just as damaging as underengineering. For a RAG system, simplicity is often the key to stability.

In hindsight, I realized that building a highly complex system too early introduces rigidity and fragility, making it difficult to adapt to changing business needs. A simpler, monolithic architecture might have been a better choice for iterating quickly and addressing issues as they arose.

Production environments are messy, and no amount of architectural brilliance can replace the need for iterative, incremental improvements.

Conclusion

PoC focus: Building a RAG PoC is about validating the business idea or proving the potential of the product. The goal is to demonstrate value, not perfection.
Transition challenges: Moving from PoC to production requires an entirely different approach, including robust engineering, scalability, and integration.
Critical requirements for production:
- Clear use case: A strong, well-defined business problem that the RAG system solves effectively.
- Time: Adequate time to design, build, and iterate without rushing into production.
- Specialized team: Experts in data engineering, vector databases, ML pipelines, security, and scalability are essential.
- Budget: Prepare for hidden costs, such as infrastructure, monitoring, compliance, and ongoing maintenance.
Simplicity and focus: Overcomplicating the PoC with production-level features or overly ambitious architectures often leads to unnecessary pain and wasted resources. Keep the PoC focused and streamlined.
The bottom line: Focus on validating the business idea during the PoC phase and leave heavy engineering for production. This ensures you avoid PoC purgatory and build a scalable, impactful system when the business case is solid.

By focusing on solving real-world problems at scale, you can transition from bold ambitions to delivering production-ready RAG systems.

👇👇👇

Within our newsletter, we keep things short and sweet.

If you enjoyed reading this article, consider subscribing to our FREE newsletter.

ML Vanguards

All you know about RAG is a lie.

Lessons learned from the past year

Why do so many AI projects feel like déjà vu?

Project #1: Let’s "LangChain" everything

Project #2: The "It can’t be that hard" prototype → no frameworks, 100% control over data

Project #3: The grand re-architecture

Conclusion

Discussion about this post