PN: Prompting Frameworks — CO-STAR, Self-Consistency, and Structured Prompting

Core Idea

Prompting is PUMA’s primary method for controlling LLM behavior without fine-tuning. This note synthesizes the key prompting frameworks used in PUMA experiments: CO-STAR for prompt structure, Chain-of-Thought for reasoning, Few-Shot for reference class injection, Self-Consistency for reliability, and structured output strategies for JSON compliance.

CO-STAR Framework

CO-STAR (developed as a structured prompt design pattern) organizes prompts into six components:

Component	Abbreviation	Purpose	PUMA Example
Context	C	Background information and task framing	”You are a senior project manager analyzing Jira issues from a software project.”
Objective	O	The specific task to perform	”Classify the following issue into one of: Bug, Feature, Task, Improvement, Sub-task.”
Style	S	Tone, voice, persona	”Respond in precise, technical language. Use engineering judgment.”
Tone	T	Emotional register	”Be concise and confident. Avoid hedging unless genuinely uncertain.”
Audience	A	Who will use the output	”Output will be parsed programmatically. Use only the specified JSON schema.”
Response format	R	Exact output specification	`{"type": "Bug", "priority": "High", "confidence": 0.87, "rationale": "..."}`

CO-STAR Prompt Template for PUMA H1

**Context**: You are an automated project management agent analyzing software issues from a Jira repository. The repository uses standard Agile issue types: Bug, Feature, Task, Improvement, Sub-task, Epic.

**Objective**: Classify the provided issue into one of the above types. Also assign a priority (Critical, High, Medium, Low) and identify the most relevant component (API, UI, Backend, Database, Infrastructure, Testing, Documentation).

**Style**: Use precise software engineering terminology. Base decisions on the issue title, description, labels, and resolution fields.

**Tone**: Analytical. If uncertain, state your confidence level and the competing classifications.

**Audience**: Your output will be parsed by a Python script. Non-conforming JSON will cause a pipeline failure.

**Response format**:
```json
{
  "type": "Bug",
  "priority": "High", 
  "component": "API",
  "confidence": 0.87,
  "rationale": "Issue describes a null pointer exception in the payment processing API..."
}


---

## Chain-of-Thought (CoT) Prompting

**Wei et al., 2022** — *Chain-of-thought prompting elicits reasoning in large language models* (NeurIPS 2022)

CoT prompts the model to "think step by step" before producing a final answer, improving accuracy on multi-step reasoning tasks.

### Zero-Shot CoT

Add "Let's think step by step." to the prompt:

Classify this issue and estimate its story points. Let’s think step by step.

Issue: “Authentication service returns 500 error when OAuth token expires during a batch job”

Step 1: Issue type — This is a failure in production functionality → Bug Step 2: Priority — Affects running batch jobs → High Step 3: Component — Authentication service → Backend/API Step 4: Effort — Token refresh logic fix, need to trace the 500 path, write test → ~5 SP


### Few-Shot CoT

Provide complete worked examples including reasoning traces:

Example 1: Issue: “Add CSV export to the reporting dashboard” Reasoning: New user-requested functionality → Feature. No urgency signals → Medium priority. Similar past issues: “Add PDF export” (5 SP), “Add Excel export” (3 SP) This adds CSV with custom columns → 5 SP Output: {“type”: “Feature”, “priority”: “Medium”, “sp”: 5}

Now classify: Issue: “Add XLSX export with pivot table support to reporting dashboard”


**PUMA**: Few-Shot CoT is the primary experimental condition for H1/H2; expected to outperform zero-shot by 8–15% Macro-F1.

---

## Self-Consistency

**Wang et al., 2023** — *Self-consistency improves chain of thought reasoning in language models* (ICLR 2023)

Instead of greedy decoding (single deterministic output), sample k independent reasoning paths and aggregate by majority vote.

### Algorithm

1. Set temperature > 0 (e.g., 0.7) to enable sampling diversity
2. Generate k = 5–10 independent responses to the same prompt
3. Extract the final answer from each response
4. Return the majority-vote answer

```python
def self_consistent_classify(issue, model, k=5):
    responses = []
    for _ in range(k):
        result = model.generate(
            prompt=build_prompt(issue),
            temperature=0.7,
            max_tokens=512
       )
        parsed = parse_json(result)
        if parsed:
            responses.append(parsed["type"])
    
    if responses:
        from collections import Counter
        return Counter(responses).most_common(1)[0][0]
    return None

When to Use Self-Consistency in PUMA

Condition	Use Self-Consistency?
Standard triage experiment	No — use temperature=0 for reproducibility
Ambiguous issues (confidence < 0.75)	Yes — k=5 sampling reduces variance
Human-in-the-loop review queue	Yes — use confidence of majority vote as escalation signal
Production deployment	Yes — trades latency for reliability

Tradeoff: 5× inference cost; not suitable for real-time pipelines.

Structured Output Strategies

JSON Mode (OpenAI)

response = client.chat.completions.create(
    model="gpt-4o",
    response_format={"type": "json_object"},
    messages=[{"role": "user", "content": prompt}]
)

Forces model output to be valid JSON. Reduces Successful Parsing Rate failures by ~85%.

Pydantic / Instructor (Python)

from pydantic import BaseModel
from instructor import patch
 
client = patch(openai.OpenAI())
 
class TriageResult(BaseModel):
    type: Literal["Bug", "Feature", "Task", "Improvement", "Sub-task"]
    priority: Literal["Critical", "High", "Medium", "Low"]
    component: str
    confidence: float
    rationale: str
 
result = client.chat.completions.create(
    model="gpt-4o",
    response_model=TriageResult,
    messages=[{"role": "user", "content": prompt}]
)
# result is a typed Python object, not raw text

Grammar Sampling (llama.cpp / Ollama)

For local models, enforce JSON grammar at the sampling level:

import ollama
 
response = ollama.generate(
    model="llama3.2:8b",
    prompt=prompt,
    format="json"  # Forces JSON output at sampling layer
)

This eliminates format errors by constraining the token sampling to valid JSON sequences.

PUMA Prompt Strategy Matrix

Condition	Strategy	Temperature	Notes
H1 Zero-Shot baseline	Zero-Shot CoT	0	Minimal prompt, no examples
H1 Few-Shot primary	Few-Shot CoT (3 examples)	0	Reference class encoding
H1 Few-Shot+	Few-Shot CoT (6 examples)	0	More reference class coverage
H2 Estimation baseline	Zero-Shot + SP range	0	Ask for range [low, likely, high]
H2 Few-Shot primary	Few-Shot CoT with SP examples	0	Historical issues as anchors
H2 Self-Consistent	Few-Shot + k=5 sampling	0.7	Confidence robustness check
Production HITL	Few-Shot + confidence gate	0	Escalate if conf < 0.80

PN-KeyConcepts-Agents-Reproducibility-RedTeam — temperature=0 reproducibility rationale
PN-Reflexion-SelfCritique — iterative prompting complement
PN-Evaluation-Metrics-Comprehensive — SPR metric measures prompt format compliance
EX-Hypotheses-H1-H2 — experimental prompt design

PUMA Vault

Explorador

Prompting Frameworks — CO-STAR, Self-Consistency, and Structured Prompting

PN: Prompting Frameworks — CO-STAR, Self-Consistency, and Structured Prompting

CO-STAR Framework

CO-STAR Prompt Template for PUMA H1

When to Use Self-Consistency in PUMA

Structured Output Strategies

JSON Mode (OpenAI)

Pydantic / Instructor (Python)

Grammar Sampling (llama.cpp / Ollama)

PUMA Prompt Strategy Matrix

MOCs

Vista Gráfica

Tabla de Contenidos

Retroenlaces

PUMA Vault

Explorador

Prompting Frameworks — CO-STAR, Self-Consistency, and Structured Prompting

PN: Prompting Frameworks — CO-STAR, Self-Consistency, and Structured Prompting

CO-STAR Framework

CO-STAR Prompt Template for PUMA H1

When to Use Self-Consistency in PUMA

Structured Output Strategies

JSON Mode (OpenAI)

Pydantic / Instructor (Python)

Grammar Sampling (llama.cpp / Ollama)

PUMA Prompt Strategy Matrix

Related Notes

MOCs

Vista Gráfica

Tabla de Contenidos

Retroenlaces