<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Not Just Code]]></title><description><![CDATA[Not Just Code]]></description><link>https://notjustcode.dev.karthiksai.com</link><generator>RSS for Node</generator><lastBuildDate>Fri, 05 Jun 2026 22:58:11 GMT</lastBuildDate><atom:link href="https://notjustcode.dev.karthiksai.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Making LLMs Efficient for Survey Cleaning: My Journey from Arrays to Choice Maps]]></title><description><![CDATA[Problem Statement
I was building an AI pipeline to clean survey responses. The data structure was like this:
Sample Question:
{
  "id": 3271,
  "text": "How satisfied are you with our service?",
  "choices": [
    { "id": 1, "label": "Very Satisfied"...]]></description><link>https://notjustcode.dev.karthiksai.com/token-limit-reasoning-optimization</link><guid isPermaLink="true">https://notjustcode.dev.karthiksai.com/token-limit-reasoning-optimization</guid><category><![CDATA[llm]]></category><category><![CDATA[AI]]></category><category><![CDATA[scaling]]></category><category><![CDATA[backend]]></category><dc:creator><![CDATA[Karthik Sai]]></dc:creator><pubDate>Mon, 28 Apr 2025 14:20:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1745849215827/04da40f9-0d5b-46f6-be48-2e4877e73f53.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-problem-statement">Problem Statement</h2>
<p>I was building an AI pipeline to clean survey responses. The data structure was like this:</p>
<p><strong>Sample Question:</strong></p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"id"</span>: <span class="hljs-number">3271</span>,
  <span class="hljs-attr">"text"</span>: <span class="hljs-string">"How satisfied are you with our service?"</span>,
  <span class="hljs-attr">"choices"</span>: [
    { <span class="hljs-attr">"id"</span>: <span class="hljs-number">1</span>, <span class="hljs-attr">"label"</span>: <span class="hljs-string">"Very Satisfied"</span> },
    { <span class="hljs-attr">"id"</span>: <span class="hljs-number">2</span>, <span class="hljs-attr">"label"</span>: <span class="hljs-string">"Neutral"</span> },
    { <span class="hljs-attr">"id"</span>: <span class="hljs-number">3</span>, <span class="hljs-attr">"label"</span>: <span class="hljs-string">"Dissatisfied"</span> }
  ]
}
</code></pre>
<p><strong>Sample Response:</strong></p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"responseId"</span>: <span class="hljs-number">1001</span>,
  <span class="hljs-attr">"responses"</span>: [
    { <span class="hljs-attr">"questionId"</span>: <span class="hljs-number">3271</span>, <span class="hljs-attr">"responses"</span>: <span class="hljs-string">"2"</span> }
  ]
}
</code></pre>
<p>Simple na? The user selected <code>2</code>, meaning "Neutral".<br />Now, when sending batches of survey responses to LLM for cleaning and fraud detection, I had a big question in mind: <strong>How to send questions and responses efficiently without wasting tokens and making model slow?</strong></p>
<hr />
<h2 id="heading-my-thought-process">My Thought Process</h2>
<p>Initially, I thought - <strong><em>"Aree yaar, just send the full questions array and responses array. Simple."</em></strong><br />So I was packing:</p>
<ul>
<li><p>Full questions (with choices array)</p>
</li>
<li><p>Full responses (with choiceIds)</p>
</li>
</ul>
<h3 id="heading-but-slowly-i-realised">But slowly I realised...</h3>
<p>Every batch was sending the same choices again and again. Every user response needed LLM to read question choices, scan array, match choiceId.<br />Even a small survey was eating 2k-3k tokens easily just for system context!<br />Then I thought:</p>
<blockquote>
<p>"What if instead of sending same data again and again, I somehow make the choice lookup easier for the model?"</p>
</blockquote>
<hr />
<h2 id="heading-i-had-explored-three-options">I had explored three Options</h2>
<h3 id="heading-option-1-keep-choices-as-array-default">Option 1: Keep Choices as Array (Default)</h3>
<ul>
<li><p>Each question has <code>choices: [{ id, label }]</code> array.</p>
</li>
<li><p>Response uses choiceId.</p>
</li>
<li><p>LLM scans array to match.</p>
</li>
</ul>
<p><strong>Pros:</strong> Tiny initial payload.</p>
<p><strong>Cons:</strong></p>
<ul>
<li><p>Model has to do O(n) array scanning.</p>
</li>
<li><p>Slow reasoning.</p>
</li>
<li><p>Wastes attention and tokens if survey grows.</p>
</li>
</ul>
<p><em>(Imagine scanning 10 choices manually every time — uff..)</em></p>
<hr />
<h3 id="heading-option-2-expand-label-inside-every-response">Option 2: Expand Label Inside Every Response</h3>
<ul>
<li><p>Instead of sending <code>choiceId</code>, I replace it with "Neutral", "Dissatisfied", etc.</p>
</li>
<li><p>Responses directly readable by model.</p>
</li>
</ul>
<p><strong>Pros:</strong> Fast LLM understanding.</p>
<p><strong>Cons:</strong></p>
<ul>
<li><p>Response size doubles or triples.</p>
</li>
<li><p>Huge token waste.</p>
</li>
<li><p>Not good for 10k+ responses batch.</p>
</li>
</ul>
<p><em>(At small scale ok, but at big scale —</em> 🪦<em>RIP tokens!)</em></p>
<hr />
<h3 id="heading-option-3-prebuilt-choice-map-per-question">Option 3: Prebuilt Choice Map per Question</h3>
<ul>
<li>Build a map like:</li>
</ul>
<pre><code class="lang-json">{
  <span class="hljs-attr">"3271"</span>: {
    <span class="hljs-attr">"1"</span>: <span class="hljs-string">"Very Satisfied"</span>,
    <span class="hljs-attr">"2"</span>: <span class="hljs-string">"Neutral"</span>,
    <span class="hljs-attr">"3"</span>: <span class="hljs-string">"Dissatisfied"</span>
  }
}
</code></pre>
<ul>
<li><p>Response stays as choiceId ("2").</p>
</li>
<li><p>LLM just does O(1) lookup using map.</p>
</li>
</ul>
<p><strong>Pros:</strong></p>
<ul>
<li><p>One-time small cost.</p>
</li>
<li><p>Fastest reasoning.</p>
</li>
<li><p>Smallest token usage long term.</p>
</li>
<li><p>Bulletproof at 100k, 1M responses scale.</p>
</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li>Slightly more work backend-side to generate map.</li>
</ul>
<p><em>(But haan yaar... once done, clean and scalable!)</em></p>
<hr />
<h2 id="heading-final-flow">Final Flow</h2>
<pre><code class="lang-plaintext">Survey Questions (choices array)
            ↓
Preprocess into Choice Map (one time)
            ↓
Store Choice Map in System Context
            ↓
Send Responses with choiceId only
            ↓
LLM does O(1) lookup from Map
            ↓
Efficient fraud detection and response validation
</code></pre>
<p><strong>✅ Pucho advantages kya hai ?</strong></p>
<ul>
<li><p>No duplicate choices in every batch.</p>
</li>
<li><p>No ballooning of response size.</p>
</li>
<li><p>No array scanning overhead for LLM.</p>
</li>
</ul>
<hr />
<h2 id="heading-key-benefits">Key Benefits</h2>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Approach</strong></td><td><strong>Token Usage</strong></td><td><strong>LLM Speed</strong></td><td><strong>Scale Readiness</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Choices as Array</td><td>Medium</td><td>Medium</td><td>Ok only for small surveys</td></tr>
<tr>
<td>Expanded Labels</td><td>High</td><td>Fast</td><td>Very costly at scale</td></tr>
<tr>
<td>Prebuilt Choice Map</td><td>Low</td><td>Fastest</td><td>Best for 100k+ responses</td></tr>
</tbody>
</table>
</div><hr />
<h2 id="heading-final-thought">💡 Final Thought</h2>
<p>Sometimes, small design decisions, like, whether to send a list vs a map, matter A LOT when you want to scale cleanly.<br />I learned this by thinking deeply from the angle of:</p>
<ul>
<li><p>Token cost</p>
</li>
<li><p>LLM cognitive load</p>
</li>
<li><p>Real-world scaling for lakhs of survey responses</p>
</li>
</ul>
<hr />
<h2 id="heading-tldr">TL;DR</h2>
<p>This idea is <strong>not only for surveys</strong>! It can be applied wherever structured choices are involved.<br /><strong>Some real examples:</strong></p>
<ul>
<li><p>Auto-grading MCQ exams at scale (education apps).</p>
</li>
<li><p>Screening candidate forms in HRTech startups.</p>
</li>
<li><p>Cleaning healthcare intake forms efficiently.</p>
</li>
<li><p>Processing ecommerce customer feedback forms cheaply.</p>
</li>
<li><p>Analyzing product satisfaction surveys in SaaS platforms.</p>
</li>
</ul>
<p><strong>Main benefits of using Maps in AI pipelines:</strong></p>
<ul>
<li><p>✅ Save massive tokens.</p>
</li>
<li><p>✅ Make LLM think faster.</p>
</li>
<li><p>✅ Scale to millions of records easily.</p>
</li>
<li><p>✅ Keep backend and API payloads clean and simple.</p>
</li>
</ul>
<p><strong>Thanks for reading!</strong> 🙏<br />If you're building AI pipelines like this, comment your thoughts and approaches.</p>
]]></content:encoded></item><item><title><![CDATA[Smarter Techniques for Optimizing User Queries in RAG Systems]]></title><description><![CDATA[When I first built my Retrieval-Augmented Generation (RAG) pipeline, the flow was pretty standard, users could upload notes, textbooks, PDFs, or exam material, and I’d chunk it, embed it, and store the embeddings in Pinecone. On the user side, they’d...]]></description><link>https://notjustcode.dev.karthiksai.com/intelligent-query-handling-rag</link><guid isPermaLink="true">https://notjustcode.dev.karthiksai.com/intelligent-query-handling-rag</guid><dc:creator><![CDATA[Karthik Sai]]></dc:creator><pubDate>Fri, 11 Apr 2025 10:52:36 GMT</pubDate><content:encoded><![CDATA[<p>When I first built my Retrieval-Augmented Generation (RAG) pipeline, the flow was pretty standard, users could upload notes, textbooks, PDFs, or exam material, and I’d chunk it, embed it, and store the embeddings in Pinecone. On the user side, they’d ask a question, I’d embed it too, search vectors semantically, and send the top-k chunks to the LLM for answering.<br /><strong>Simple, effective, and surprisingly decent.</strong></p>
<p>But it didn’t feel intelligent.</p>
<hr />
<h3 id="heading-the-problem-generic-questions-overloaded-vector-search">The Problem: Generic Questions, Overloaded Vector Search</h3>
<p>Real users don’t ask specific, filtered questions.<br />They just ask <em>“What’s the Doppler effect?”</em> or <em>“Why do dogs hear better than humans?”</em></p>
<p>At this point, I was running the query across the entire namespace or file, across all subjects, all topics, all difficulties, and just hoping the top semantic match was good enough. And while vector search is powerful, it’s not immune to:</p>
<ul>
<li><p>Noisy matches</p>
</li>
<li><p>Overlapping concepts</p>
</li>
<li><p>Chunks from unrelated topics being semantically similar (especially in educational content)</p>
</li>
</ul>
<p>So, I started thinking:</p>
<blockquote>
<p>“I already have metadata on each chunk… can I use that to tighten my vector search?”</p>
</blockquote>
<hr />
<h3 id="heading-step-1-metadata-tagging-at-chunk-time">Step 1: Metadata Tagging at Chunk Time</h3>
<p>I used LLMs (GPT-4o-mini) to extract structured metadata for each chunk during ingestion:</p>
<ul>
<li><p><code>subject</code> → Physics, Biology, Chemistry, etc.</p>
</li>
<li><p><code>topic</code> and <code>subtopic</code></p>
</li>
<li><p><code>difficulty</code> (easy/medium/hard)</p>
</li>
<li><p><code>keywords</code> (up to 5 terms)</p>
</li>
</ul>
<p>This gave me control. I could now store this metadata alongside each vector in Pinecone and apply structured filters like:</p>
<pre><code class="lang-plaintext">filter: { fileId: ..., subject: "Physics", topic: "Sound" }
</code></pre>
<p>This was a big upgrade. Suddenly, vector search became more focused, and more accurate. But that was only part of the equation.</p>
<hr />
<h3 id="heading-step-2-reverse-filter-prediction-from-user-query">Step 2: Reverse Filter Prediction from User Query</h3>
<p>Most users don’t select subjects or topics from a dropdown. They just ask the question.<br />So I thought, why not flip the idea?</p>
<blockquote>
<p>What if I use an LLM to guess the subject and topic of the user query and use that to filter the vector search?</p>
</blockquote>
<p>So I built a lightweight LLM wrapper that looks at the incoming user query and infers possible metadata like:</p>
<pre><code class="lang-plaintext">{ subject: "Physics", topic: "Sound" }
</code></pre>
<p>This worked surprisingly well.<br />Questions like:</p>
<blockquote>
<p>“How does echo differ from reverberation?”<br />would get tagged as:<br /><code>{ subject: "Physics", topic: "Sound" }</code> → and boom, only Physics+Sound chunks are searched.</p>
</blockquote>
<hr />
<h3 id="heading-the-roadblock-hallucinated-filters-0-results">The Roadblock: Hallucinated Filters = 0 Results</h3>
<p>Then came the issue.<br />Sometimes the LLM would guess metadata that didn’t actually exist in my chunked data.<br />For example, for a question about acoustics, it guessed:</p>
<pre><code class="lang-plaintext">{ subject: "Physics", topic: "Music" }
</code></pre>
<p>Problem? I had <strong>no</strong> vector chunks tagged with <code>topic: "Music"</code>, so Pinecone’s strict filtering returned <strong>0 results.</strong><br />And here’s the kicker:<br />All vector DBs (Pinecone, Qdrant, Weaviate…) do <strong>exact metadata filtering</strong>.<br />They won’t “fuzzy match” or try alternatives. No match = no results.<br />So now I had a clever guess that completely stopped my retrieval process.</p>
<hr />
<h3 id="heading-step-3-building-a-fallback-system">Step 3: Building a Fallback System</h3>
<p>To handle these edge cases, I added <strong>graceful filter fallback logic</strong> in the app layer:</p>
<ol>
<li><p>Try full filter (fileId + subject + topic)</p>
</li>
<li><p>If no match, try relaxed filter (fileId + subject)</p>
</li>
<li><p>If still nothing, fallback to pure vector search (just fileId or even entire namespace)</p>
</li>
</ol>
<p>This gave me:</p>
<ul>
<li><p>⚡ Speed when filters worked</p>
</li>
<li><p>🎯 Precision when filters were accurate</p>
</li>
<li><p>🔄 Resilience when filters failed</p>
</li>
</ul>
<p>Now even if the LLM guessed something too niche, I didn’t lose the result, I just searched wider.</p>
<hr />
<h3 id="heading-step-4-smarter-guesses-using-a-sidecar-metadata-db">Step 4: Smarter Guesses Using a Sidecar Metadata DB</h3>
<p>But I wasn’t done.<br />The LLM still occasionally hallucinated weird or off-topic guesses — like <code>"Music"</code> or <code>"Hearing"</code>, which weren’t present in my dataset.<br />So I took it further:</p>
<blockquote>
<p>What if the LLM could only choose from actual metadata I already have in the DB?</p>
</blockquote>
<p>So I built a small sidecar metadata index, basically a MongoDB collection or even a memory cache,<br />that stores all known:</p>
<ul>
<li><p>Subjects</p>
</li>
<li><p>Topics</p>
</li>
<li><p>Keywords (per file or per namespace)</p>
</li>
</ul>
<p>Now, before guessing filters, I preload those known values into the LLM prompt like:</p>
<pre><code class="lang-plaintext">Available subjects: [Physics, Chemistry, Biology]
Available topics: [Sound, Light, Motion, Laws of Motion, Photosynthesis]
</code></pre>
<p>And the LLM only guesses from this known list.<br />No more hallucinated filters. No more Music. Just sharp, valid filter predictions.</p>
<hr />
<h3 id="heading-final-setup-and-why-i-love-it">Final Setup (and why I love it)</h3>
<p>Now my query pipeline looks like this:</p>
<ol>
<li><p><strong>User asks anything</strong> : simple, free-form</p>
</li>
<li><p><strong>LLM guesses</strong> <code>subject/topic</code> from a list of known values</p>
</li>
<li><p><strong>Filter-based vector search</strong> runs (fileId + guessed filters)</p>
</li>
<li><p><strong>Fallback if no results</strong></p>
</li>
<li><p><strong>Top chunks reranked</strong></p>
</li>
<li><p><strong>Answer generated via LLM</strong></p>
</li>
</ol>
<p>It’s smart. It’s resilient. It’s clean.</p>
<hr />
<h3 id="heading-final-thoughts">🤝 Final Thoughts</h3>
<p>This whole process turned out to be one of the most <strong>valuable architectural upgrades</strong> I’ve made.<br />It added <strong>intelligence to the system without sacrificing robustness</strong>, and it cost me no extra infra just smarter orchestration.<br />If you’re building a RAG-based system with user-uploaded content, definitely consider:</p>
<ul>
<li><p>Storing chunk-level metadata</p>
</li>
<li><p>Letting LLMs predict filters</p>
</li>
<li><p>Building a fallback tree</p>
</li>
<li><p>Feeding known metadata into the LLM prompt</p>
</li>
</ul>
<p>You’ll end up with a system that feels 10x smarter — and your users won’t even know why.</p>
<h3 id="heading-ia"> </h3>
<p>⚠️ A quick word of caution</p>
<p>This kind of intelligent metadata-driven filtering is best suited for <strong>high-accuracy domains</strong> (like education, legal, healthcare) or apps operating at <strong>production scale</strong>, where precision, performance, and cost-efficiency truly matter.<br />If you're building a simple MVP or experimenting with RAG for the first time, you likely don't need this level of orchestration.<br />In those cases, a clean vector search with a good reranker is more than enough to get you started, add this layer only when your app (or your users) demand it.</p>
]]></content:encoded></item></channel></rss>