PN: Context Engineering — Designing the LLM Context Window as a System

Core Idea

Context Engineering (CE) is the discipline of constructing the entire content that enters an LLM’s context window — including system prompts, retrieved knowledge, tool outputs, conversation history, and structured data — to maximise task performance. It is the 2026 successor to Prompt Engineering: where PE focuses on phrasing a single prompt, CE designs the information system that fills the context at inference time.


From Prompt Engineering to Context Engineering

DimensionPrompt EngineeringContext Engineering
ScopeSingle instruction or queryEntire context window pipeline
Unit of designPrompt textInformation architecture
Dynamic contentRare (few-shot examples)Central (RAG, memory, tool calls)
Primary leverWording and structureWhat to include, what to exclude
Failure modeBad phrasingContext pollution, retrieval noise
Year of maturity20232025–2026

The key shift: the model is fixed; the context is the engineering artefact.


The Six Slots of a Context Window

A production agent context typically fills six conceptual slots:

┌─────────────────────────────────────────────┐
│ 1. SYSTEM PROMPT      (role, rules, persona) │
│ 2. RETRIEVED KNOWLEDGE (RAG / vector search) │
│ 3. TOOL DEFINITIONS   (function schemas)     │
│ 4. CONVERSATION HISTORY (prior turns)        │
│ 5. WORKING MEMORY     (scratch-pad, state)   │
│ 6. USER QUERY         (current task)         │
└─────────────────────────────────────────────┘

CE decisions:

  • What to include in each slot at this inference step
  • How much (token budget per slot)
  • In what order (recency bias; position affects attention)
  • When to flush (conversation compaction, memory eviction)

Context Engineering Strategies

1. System Prompt Design

  • Encode role, constraints, output format, and persona here
  • Use CO-STAR structure: Context → Objective → Style → Tone → Audience → Response format
  • Keep stable between turns; avoid redundancy with retrieved content

2. Retrieval-Augmented Generation (RAG)

  • Retrieve top-k documents from vector store at query time
  • Filter by relevance score threshold (cosine similarity > 0.75)
  • Summarise or truncate retrieved chunks to fit token budget
  • Rerank using cross-encoder or LLM-as-judge before injection

3. Working Memory (Scratch-Pad)

  • Maintains agent state across multi-step tasks
  • Implemented as a JSON block in the context updated after each action
  • Eviction policy: keep last N entries; summarise older entries via LLM

4. Conversation History Management

  • Sliding window: keep last k turns; discard oldest
  • Summarisation: compress old turns into a summary paragraph
  • Selective retention: keep turns flagged as “important” regardless of age
  • Anthropic’s recommended pattern: summarise once context hits 80% of window

5. Tool Output Integration

  • Tool outputs replace the tool call in the context (assistant turn format)
  • Truncate large outputs (e.g., database results) to relevant fields only
  • Error outputs from tools must be included — they inform retry logic

6. Token Budget Allocation

Recommended allocation for a 128k-token window:

SlotBudgetNotes
System prompt2–5kStable; amortised by caching
Retrieved knowledge30–60kLargest variable component
Tool definitions1–3kCompressed JSON schemas
Conversation history20–40kManaged by compaction
Working memory2–5kCompact JSON state
User query + response5–10kCurrent turn

Context Pollution — The Central Failure Mode

Context Pollution

Injecting irrelevant or noisy content into the context degrades LLM performance more than slightly incorrect wording. A retrieval pipeline that returns off-topic chunks causes more errors than a poorly-phrased prompt.

Common sources of pollution:

  1. Retrieval failures: Top-k returning thematically similar but semantically irrelevant chunks
  2. Tool verbosity: Full API responses injected without filtering
  3. History accumulation: Old conversation turns contradicting current state
  4. Prompt redundancy: Same instruction stated multiple times in different slots

Mitigation strategies:

  • MMR (Maximal Marginal Relevance) retrieval: balance relevance + diversity
  • Structured summarisation pipelines before injection
  • Context audits: log token usage per slot per inference step

PUMA Application

Stage 1–4 (Triage & Prioritisation)

Each issue triage call fills the context as:

context = {
    "system": PUMA_SYSTEM_PROMPT,           # Role + output schema
    "few_shot_examples": retrieve_similar(issue, k=3),  # RAG
    "working_memory": agent_state,           # Prior episode summary
    "task": format_issue(issue)              # Current Jira issue
}

Key CE decisions in PUMA:

  • k=3 few-shot examples retrieved by similarity to current issue (not random)
  • Issue fields filtered: title, description, labels, priority; exclude changelog, comments
  • Reflexion memory appended as a working-memory slot (last 3 critiques)
  • Output format prescribed in system prompt (JSON schema)

Stage 5 (Smart PMO — MCP Integration)

MCP tools are injected as tool definitions. Context budget splits:

  • 40% for tool outputs (Jira API results, GitHub data)
  • 30% for conversation history (multi-turn PM queries)
  • 20% for retrieved domain knowledge (PUMA policies, project history)
  • 10% for system + query

Relation to Other Techniques

TechniqueRelationship to CE
RAGOne slot in the CE pipeline (retrieved knowledge)
CO-STARTemplate for the system prompt slot
ReflexionCE manages episodic memory slot (prior critiques)
CoT / ToTReasoning patterns for the response generation phase
MCPProtocol for tool definition + tool output slots
LoRA fine-tuningModifies the model, not the context; complementary to CE

Metrics for CE Quality

MetricMeasurement
Context utilisation rateRelevant tokens / total tokens injected
Retrieval precision@kRelevant chunks in top-k / k
Token budget adherenceActual token usage vs. budget per slot
Cache hit rateStable-slot cache hits (system prompt amortisation)
Task performance deltaF1 / MAE improvement from CE design vs. naive context

MOCs