PN: Context Engineering — Designing the LLM Context Window as a System

Core Idea

Context Engineering (CE) is the discipline of constructing the entire content that enters an LLM’s context window — including system prompts, retrieved knowledge, tool outputs, conversation history, and structured data — to maximise task performance. It is the 2026 successor to Prompt Engineering: where PE focuses on phrasing a single prompt, CE designs the information system that fills the context at inference time.

From Prompt Engineering to Context Engineering

Dimension	Prompt Engineering	Context Engineering
Scope	Single instruction or query	Entire context window pipeline
Unit of design	Prompt text	Information architecture
Dynamic content	Rare (few-shot examples)	Central (RAG, memory, tool calls)
Primary lever	Wording and structure	What to include, what to exclude
Failure mode	Bad phrasing	Context pollution, retrieval noise
Year of maturity	2023	2025–2026

The key shift: the model is fixed; the context is the engineering artefact.

The Six Slots of a Context Window

A production agent context typically fills six conceptual slots:

┌─────────────────────────────────────────────┐
│ 1. SYSTEM PROMPT      (role, rules, persona) │
│ 2. RETRIEVED KNOWLEDGE (RAG / vector search) │
│ 3. TOOL DEFINITIONS   (function schemas)     │
│ 4. CONVERSATION HISTORY (prior turns)        │
│ 5. WORKING MEMORY     (scratch-pad, state)   │
│ 6. USER QUERY         (current task)         │
└─────────────────────────────────────────────┘

CE decisions:

What to include in each slot at this inference step
How much (token budget per slot)
In what order (recency bias; position affects attention)
When to flush (conversation compaction, memory eviction)

Context Engineering Strategies

1. System Prompt Design

Encode role, constraints, output format, and persona here
Use CO-STAR structure: Context → Objective → Style → Tone → Audience → Response format
Keep stable between turns; avoid redundancy with retrieved content

2. Retrieval-Augmented Generation (RAG)

Retrieve top-k documents from vector store at query time
Filter by relevance score threshold (cosine similarity > 0.75)
Summarise or truncate retrieved chunks to fit token budget
Rerank using cross-encoder or LLM-as-judge before injection

3. Working Memory (Scratch-Pad)

Maintains agent state across multi-step tasks
Implemented as a JSON block in the context updated after each action
Eviction policy: keep last N entries; summarise older entries via LLM

4. Conversation History Management

Sliding window: keep last k turns; discard oldest
Summarisation: compress old turns into a summary paragraph
Selective retention: keep turns flagged as “important” regardless of age
Anthropic’s recommended pattern: summarise once context hits 80% of window

5. Tool Output Integration

Tool outputs replace the tool call in the context (assistant turn format)
Truncate large outputs (e.g., database results) to relevant fields only
Error outputs from tools must be included — they inform retry logic

6. Token Budget Allocation

Recommended allocation for a 128k-token window:

Slot	Budget	Notes
System prompt	2–5k	Stable; amortised by caching
Retrieved knowledge	30–60k	Largest variable component
Tool definitions	1–3k	Compressed JSON schemas
Conversation history	20–40k	Managed by compaction
Working memory	2–5k	Compact JSON state
User query + response	5–10k	Current turn

Context Pollution — The Central Failure Mode

Context Pollution

Injecting irrelevant or noisy content into the context degrades LLM performance more than slightly incorrect wording. A retrieval pipeline that returns off-topic chunks causes more errors than a poorly-phrased prompt.

Common sources of pollution:

Retrieval failures: Top-k returning thematically similar but semantically irrelevant chunks
Tool verbosity: Full API responses injected without filtering
History accumulation: Old conversation turns contradicting current state
Prompt redundancy: Same instruction stated multiple times in different slots

Mitigation strategies:

MMR (Maximal Marginal Relevance) retrieval: balance relevance + diversity
Structured summarisation pipelines before injection
Context audits: log token usage per slot per inference step

PUMA Application

Stage 1–4 (Triage & Prioritisation)

Each issue triage call fills the context as:

context = {
    "system": PUMA_SYSTEM_PROMPT,           # Role + output schema
    "few_shot_examples": retrieve_similar(issue, k=3),  # RAG
    "working_memory": agent_state,           # Prior episode summary
    "task": format_issue(issue)              # Current Jira issue
}

Key CE decisions in PUMA:

k=3 few-shot examples retrieved by similarity to current issue (not random)
Issue fields filtered: title, description, labels, priority; exclude changelog, comments
Reflexion memory appended as a working-memory slot (last 3 critiques)
Output format prescribed in system prompt (JSON schema)

Stage 5 (Smart PMO — MCP Integration)

MCP tools are injected as tool definitions. Context budget splits:

40% for tool outputs (Jira API results, GitHub data)
30% for conversation history (multi-turn PM queries)
20% for retrieved domain knowledge (PUMA policies, project history)
10% for system + query

Relation to Other Techniques

Technique	Relationship to CE
RAG	One slot in the CE pipeline (retrieved knowledge)
CO-STAR	Template for the system prompt slot
Reflexion	CE manages episodic memory slot (prior critiques)
CoT / ToT	Reasoning patterns for the response generation phase
MCP	Protocol for tool definition + tool output slots
LoRA fine-tuning	Modifies the model, not the context; complementary to CE

Metrics for CE Quality

Metric	Measurement
Context utilisation rate	Relevant tokens / total tokens injected
Retrieval precision@k	Relevant chunks in top-k / k
Token budget adherence	Actual token usage vs. budget per slot
Cache hit rate	Stable-slot cache hits (system prompt amortisation)
Task performance delta	F1 / MAE improvement from CE design vs. naive context

PN-COSTAR-SelfConsistency — system prompt design framework
PN-RAG-Embeddings-VectorDB — retrieval slot implementation
PN-Reflexion-SelfCritique — episodic memory slot
PN-MCP-ModelContextProtocol — tool definition + output slots
LN-Videos-Context-Engineering — video references

PUMA Vault

Explorador

Context Engineering — Designing the LLM Context Window as a System

PN: Context Engineering — Designing the LLM Context Window as a System

From Prompt Engineering to Context Engineering

The Six Slots of a Context Window

Context Engineering Strategies

1. System Prompt Design

2. Retrieval-Augmented Generation (RAG)

3. Working Memory (Scratch-Pad)

4. Conversation History Management

5. Tool Output Integration

6. Token Budget Allocation

Context Pollution — The Central Failure Mode

PUMA Application

Stage 1–4 (Triage & Prioritisation)

Stage 5 (Smart PMO — MCP Integration)

Relation to Other Techniques

Metrics for CE Quality

MOCs

Vista Gráfica

Tabla de Contenidos

Retroenlaces

PUMA Vault

Explorador

Context Engineering — Designing the LLM Context Window as a System

PN: Context Engineering — Designing the LLM Context Window as a System

From Prompt Engineering to Context Engineering

The Six Slots of a Context Window

Context Engineering Strategies

1. System Prompt Design

2. Retrieval-Augmented Generation (RAG)

3. Working Memory (Scratch-Pad)

4. Conversation History Management

5. Tool Output Integration

6. Token Budget Allocation

Context Pollution — The Central Failure Mode

PUMA Application

Stage 1–4 (Triage & Prioritisation)

Stage 5 (Smart PMO — MCP Integration)

Relation to Other Techniques

Metrics for CE Quality

Related Notes

MOCs

Vista Gráfica

Tabla de Contenidos

Retroenlaces