RAG, embeddings, and vector databases enable LLMs to query external knowledge without fine-tuning

Retrieval-Augmented Generation (RAG) is the most practical technique for giving a local LLM access to a specific knowledge base without the cost and complexity of fine-tuning.


Core Components

Embeddings: dense vector representations of text that capture semantic meaning. Similar texts produce similar vectors. Common local models: nomic-embed-text (Ollama), all-MiniLM-L6-v2 (sentence-transformers). OpenAI equivalent: text-embedding-3-small.

Vector Databases: store embeddings for fast similarity search. The query “what is the estimated effort for this issue type?” is itself embedded and matched against stored embeddings of past issues. Local options: ChromaDB (in-memory, simple), Qdrant (production-grade), Weaviate.

RAG pipeline:

Document → Chunking → Embedding → Vector DB
              ↓
Query → Embedding → Similarity Search → Top-k Chunks
              ↓
Chunks + Query → LLM Prompt → Answer

PUMA Relevance

Stage 4 (Optional): RAG-enhanced triage agent. The agent retrieves the k most similar historical issues (from Jira SR) before classifying a new one. This tests whether historical context improves F1-macro beyond zero/few-shot prompting.

Research question: Does RAG retrieval from historical issues outperform few-shot prompting when historical labels are available?

Expected complexity: ALTA / MEDIO risk. Scheduled for Stage 4 if Stages 1–3 are complete.

Local stack for PUMA RAG:

  • Embeddings: nomic-embed-text via Ollama
  • Vector store: ChromaDB (simple) or Qdrant (production)
  • Retrieval: LlamaIndex or direct ChromaDB client
  • LLM: Llama 3.2 8B via Ollama

Key Distinction: RAG vs Fine-Tuning

ApproachWhenCostPUMA Use
Zero/Few-shotNo training data in context$0 / token-onlyStages 1–3
RAGKnowledge base availableStorage + retrieval overheadStage 4
Fine-tuningLarge labelled dataset, stable distributionGPU training, weeksOut of scope

Cognitive Offloading Risk

RAG reduces cognitive load on the LLM by externalising knowledge — but this creates a dependency on retrieval quality. Poor chunking or embedding mismatch produces irrelevant context that degrades performance. PUMA Stage 4 must evaluate whether retrieval actually helps.


References

  • Lewis, P., et al. (2020). Retrieval-Augmented Generation for knowledge-intensive NLP tasks. NeurIPS 2020. arXiv:2005.11401
  • LlamaIndex documentation: https://docs.llamaindex.ai

MOCs