Skip to main content

Command Palette

Search for a command to run...

Making LLMs Efficient for Survey Cleaning: My Journey from Arrays to Choice Maps

Published
β€’4 min read
Making LLMs Efficient for Survey Cleaning: My Journey from Arrays to Choice Maps

Problem Statement

I was building an AI pipeline to clean survey responses. The data structure was like this:

Sample Question:

{
  "id": 3271,
  "text": "How satisfied are you with our service?",
  "choices": [
    { "id": 1, "label": "Very Satisfied" },
    { "id": 2, "label": "Neutral" },
    { "id": 3, "label": "Dissatisfied" }
  ]
}

Sample Response:

{
  "responseId": 1001,
  "responses": [
    { "questionId": 3271, "responses": "2" }
  ]
}

Simple na? The user selected 2, meaning "Neutral".
Now, when sending batches of survey responses to LLM for cleaning and fraud detection, I had a big question in mind: How to send questions and responses efficiently without wasting tokens and making model slow?


My Thought Process

Initially, I thought - "Aree yaar, just send the full questions array and responses array. Simple."
So I was packing:

  • Full questions (with choices array)

  • Full responses (with choiceIds)

But slowly I realised...

Every batch was sending the same choices again and again. Every user response needed LLM to read question choices, scan array, match choiceId.
Even a small survey was eating 2k-3k tokens easily just for system context!
Then I thought:

"What if instead of sending same data again and again, I somehow make the choice lookup easier for the model?"


I had explored three Options

Option 1: Keep Choices as Array (Default)

  • Each question has choices: [{ id, label }] array.

  • Response uses choiceId.

  • LLM scans array to match.

Pros: Tiny initial payload.

Cons:

  • Model has to do O(n) array scanning.

  • Slow reasoning.

  • Wastes attention and tokens if survey grows.

(Imagine scanning 10 choices manually every time β€” uff..)


Option 2: Expand Label Inside Every Response

  • Instead of sending choiceId, I replace it with "Neutral", "Dissatisfied", etc.

  • Responses directly readable by model.

Pros: Fast LLM understanding.

Cons:

  • Response size doubles or triples.

  • Huge token waste.

  • Not good for 10k+ responses batch.

(At small scale ok, but at big scale β€” πŸͺ¦RIP tokens!)


Option 3: Prebuilt Choice Map per Question

  • Build a map like:
{
  "3271": {
    "1": "Very Satisfied",
    "2": "Neutral",
    "3": "Dissatisfied"
  }
}
  • Response stays as choiceId ("2").

  • LLM just does O(1) lookup using map.

Pros:

  • One-time small cost.

  • Fastest reasoning.

  • Smallest token usage long term.

  • Bulletproof at 100k, 1M responses scale.

Cons:

  • Slightly more work backend-side to generate map.

(But haan yaar... once done, clean and scalable!)


Final Flow

Survey Questions (choices array)
            ↓
Preprocess into Choice Map (one time)
            ↓
Store Choice Map in System Context
            ↓
Send Responses with choiceId only
            ↓
LLM does O(1) lookup from Map
            ↓
Efficient fraud detection and response validation

βœ… Pucho advantages kya hai ?

  • No duplicate choices in every batch.

  • No ballooning of response size.

  • No array scanning overhead for LLM.


Key Benefits

ApproachToken UsageLLM SpeedScale Readiness
Choices as ArrayMediumMediumOk only for small surveys
Expanded LabelsHighFastVery costly at scale
Prebuilt Choice MapLowFastestBest for 100k+ responses

πŸ’‘ Final Thought

Sometimes, small design decisions, like, whether to send a list vs a map, matter A LOT when you want to scale cleanly.
I learned this by thinking deeply from the angle of:

  • Token cost

  • LLM cognitive load

  • Real-world scaling for lakhs of survey responses


TL;DR

This idea is not only for surveys! It can be applied wherever structured choices are involved.
Some real examples:

  • Auto-grading MCQ exams at scale (education apps).

  • Screening candidate forms in HRTech startups.

  • Cleaning healthcare intake forms efficiently.

  • Processing ecommerce customer feedback forms cheaply.

  • Analyzing product satisfaction surveys in SaaS platforms.

Main benefits of using Maps in AI pipelines:

  • βœ… Save massive tokens.

  • βœ… Make LLM think faster.

  • βœ… Scale to millions of records easily.

  • βœ… Keep backend and API payloads clean and simple.

Thanks for reading! πŸ™
If you're building AI pipelines like this, comment your thoughts and approaches.

Efficient LLMs for Survey Cleaning