RAG, embeddings, and vector databases enable LLMs to query external knowledge without fine-tuning
Retrieval-Augmented Generation (RAG) is the most practical technique for giving a local LLM access to a specific knowledge base without the cost and complexity of fine-tuning.
Core Components
Embeddings: dense vector representations of text that capture semantic meaning. Similar texts produce similar vectors. Common local models: nomic-embed-text (Ollama), all-MiniLM-L6-v2 (sentence-transformers). OpenAI equivalent: text-embedding-3-small.
Vector Databases: store embeddings for fast similarity search. The query “what is the estimated effort for this issue type?” is itself embedded and matched against stored embeddings of past issues. Local options: ChromaDB (in-memory, simple), Qdrant (production-grade), Weaviate.
RAG pipeline:
Document → Chunking → Embedding → Vector DB
↓
Query → Embedding → Similarity Search → Top-k Chunks
↓
Chunks + Query → LLM Prompt → Answer
PUMA Relevance
Stage 4 (Optional): RAG-enhanced triage agent. The agent retrieves the k most similar historical issues (from Jira SR) before classifying a new one. This tests whether historical context improves F1-macro beyond zero/few-shot prompting.
Research question: Does RAG retrieval from historical issues outperform few-shot prompting when historical labels are available?
Expected complexity: ALTA / MEDIO risk. Scheduled for Stage 4 if Stages 1–3 are complete.
Local stack for PUMA RAG:
- Embeddings:
nomic-embed-textvia Ollama - Vector store: ChromaDB (simple) or Qdrant (production)
- Retrieval: LlamaIndex or direct ChromaDB client
- LLM: Llama 3.2 8B via Ollama
Key Distinction: RAG vs Fine-Tuning
| Approach | When | Cost | PUMA Use |
|---|---|---|---|
| Zero/Few-shot | No training data in context | $0 / token-only | Stages 1–3 |
| RAG | Knowledge base available | Storage + retrieval overhead | Stage 4 |
| Fine-tuning | Large labelled dataset, stable distribution | GPU training, weeks | Out of scope |
Cognitive Offloading Risk
RAG reduces cognitive load on the LLM by externalising knowledge — but this creates a dependency on retrieval quality. Poor chunking or embedding mismatch produces irrelevant context that degrades performance. PUMA Stage 4 must evaluate whether retrieval actually helps.
References
- Lewis, P., et al. (2020). Retrieval-Augmented Generation for knowledge-intensive NLP tasks. NeurIPS 2020. arXiv:2005.11401
- LlamaIndex documentation: https://docs.llamaindex.ai
Related Notes
- PN-KeyConcepts-Agents-Reproducibility-RedTeam — Reproducibility risk with RAG
- PN-CoT-FewShot-Prompting — Few-shot vs RAG comparison
- PN-ReAct-AgentPattern — ReAct loop + RAG (Stage 4)
- PN-LLM-Local-vs-Cloud — Local RAG stack
- EX-Stages-Overview — Stage 4 RAG plan
- SP-Architecture — Layer 2 memory