Smarter Techniques for Optimizing User Queries in RAG Systems
How I went from basic vector search to intelligent filter prediction with LLMs and a sidecar metadata DB
When I first built my Retrieval-Augmented Generation (RAG) pipeline, the flow was pretty standard, users could upload notes, textbooks, PDFs, or exam material, and I’d chunk it, embed it, and store the embeddings in Pinecone. On the user side, they’d ask a question, I’d embed it too, search vectors semantically, and send the top-k chunks to the LLM for answering.
Simple, effective, and surprisingly decent.
But it didn’t feel intelligent.
The Problem: Generic Questions, Overloaded Vector Search
Real users don’t ask specific, filtered questions.
They just ask “What’s the Doppler effect?” or “Why do dogs hear better than humans?”
At this point, I was running the query across the entire namespace or file, across all subjects, all topics, all difficulties, and just hoping the top semantic match was good enough. And while vector search is powerful, it’s not immune to:
Noisy matches
Overlapping concepts
Chunks from unrelated topics being semantically similar (especially in educational content)
So, I started thinking:
“I already have metadata on each chunk… can I use that to tighten my vector search?”
Step 1: Metadata Tagging at Chunk Time
I used LLMs (GPT-4o-mini) to extract structured metadata for each chunk during ingestion:
subject→ Physics, Biology, Chemistry, etc.topicandsubtopicdifficulty(easy/medium/hard)keywords(up to 5 terms)
This gave me control. I could now store this metadata alongside each vector in Pinecone and apply structured filters like:
filter: { fileId: ..., subject: "Physics", topic: "Sound" }
This was a big upgrade. Suddenly, vector search became more focused, and more accurate. But that was only part of the equation.
Step 2: Reverse Filter Prediction from User Query
Most users don’t select subjects or topics from a dropdown. They just ask the question.
So I thought, why not flip the idea?
What if I use an LLM to guess the subject and topic of the user query and use that to filter the vector search?
So I built a lightweight LLM wrapper that looks at the incoming user query and infers possible metadata like:
{ subject: "Physics", topic: "Sound" }
This worked surprisingly well.
Questions like:
“How does echo differ from reverberation?”
would get tagged as:{ subject: "Physics", topic: "Sound" }→ and boom, only Physics+Sound chunks are searched.
The Roadblock: Hallucinated Filters = 0 Results
Then came the issue.
Sometimes the LLM would guess metadata that didn’t actually exist in my chunked data.
For example, for a question about acoustics, it guessed:
{ subject: "Physics", topic: "Music" }
Problem? I had no vector chunks tagged with topic: "Music", so Pinecone’s strict filtering returned 0 results.
And here’s the kicker:
All vector DBs (Pinecone, Qdrant, Weaviate…) do exact metadata filtering.
They won’t “fuzzy match” or try alternatives. No match = no results.
So now I had a clever guess that completely stopped my retrieval process.
Step 3: Building a Fallback System
To handle these edge cases, I added graceful filter fallback logic in the app layer:
Try full filter (fileId + subject + topic)
If no match, try relaxed filter (fileId + subject)
If still nothing, fallback to pure vector search (just fileId or even entire namespace)
This gave me:
⚡ Speed when filters worked
🎯 Precision when filters were accurate
🔄 Resilience when filters failed
Now even if the LLM guessed something too niche, I didn’t lose the result, I just searched wider.
Step 4: Smarter Guesses Using a Sidecar Metadata DB
But I wasn’t done.
The LLM still occasionally hallucinated weird or off-topic guesses — like "Music" or "Hearing", which weren’t present in my dataset.
So I took it further:
What if the LLM could only choose from actual metadata I already have in the DB?
So I built a small sidecar metadata index, basically a MongoDB collection or even a memory cache,
that stores all known:
Subjects
Topics
Keywords (per file or per namespace)
Now, before guessing filters, I preload those known values into the LLM prompt like:
Available subjects: [Physics, Chemistry, Biology]
Available topics: [Sound, Light, Motion, Laws of Motion, Photosynthesis]
And the LLM only guesses from this known list.
No more hallucinated filters. No more Music. Just sharp, valid filter predictions.
Final Setup (and why I love it)
Now my query pipeline looks like this:
User asks anything : simple, free-form
LLM guesses
subject/topicfrom a list of known valuesFilter-based vector search runs (fileId + guessed filters)
Fallback if no results
Top chunks reranked
Answer generated via LLM
It’s smart. It’s resilient. It’s clean.
🤝 Final Thoughts
This whole process turned out to be one of the most valuable architectural upgrades I’ve made.
It added intelligence to the system without sacrificing robustness, and it cost me no extra infra just smarter orchestration.
If you’re building a RAG-based system with user-uploaded content, definitely consider:
Storing chunk-level metadata
Letting LLMs predict filters
Building a fallback tree
Feeding known metadata into the LLM prompt
You’ll end up with a system that feels 10x smarter — and your users won’t even know why.
⚠️ A quick word of caution
This kind of intelligent metadata-driven filtering is best suited for high-accuracy domains (like education, legal, healthcare) or apps operating at production scale, where precision, performance, and cost-efficiency truly matter.
If you're building a simple MVP or experimenting with RAG for the first time, you likely don't need this level of orchestration.
In those cases, a clean vector search with a good reranker is more than enough to get you started, add this layer only when your app (or your users) demand it.
