Video: Let’s build the GPT Tokenizer — Andrej Karpathy

One-sentence summary: Deep-dive into how text is tokenised by LLMs (byte-pair encoding), explaining why specific word choices in prompts have non-obvious effects on model behaviour.

🎯 Key Insights for PUMA

Insight 1 — Capitalisation matters: The token for “Critical” and “critical” may differ in some tokenisers. This means consistently using capital-first priority labels in prompts may improve parsing reliability.

Insight 2 — Token boundary effects: Labels like “High priority” may be tokenised differently than just “High”. The prompt template should end with just the label word to avoid token ambiguity.

Insight 3 — Rare tokens: Priority labels like “Critical” are frequent enough in training data to be single tokens in most LLaMA/Mistral vocabularies. “Story Points” may be split across multiple tokens.

📝 Cornell Notes

Question	Notes
Do tokenisers vary across models?	Yes — Llama 3.2 and Mistral 7B have different tokeniser vocabularies
Should PUMA verify tokenisation?	Log token counts per prompt as part of experiment metadata
Impact on reproducibility?	Temperature=0 + seed=42 controls this, but worth documenting

Summary: Tokenisation is a hidden source of variance between models. Document which tokeniser each model uses and log prompt token counts in experiment metadata.

🔗 Connections

Threat to validity: PR-PUMA-Ch3-Methods (threats section) Related concept: PN-KeyConcepts-Agents-Reproducibility-RedTeam (LLM Agents section)

id: LN-Video-Codina-SLR-AI title: “Video: Cómo hacer una revisión bibliográfica con IA — Lluis Codina” type: literature-note subtype: video tags: [literature, video, youtube, codina, slr, ai-research, marco-veritas] author: “Lluis Codina” channel: “Lluis Codina” url: “https://www.youtube.com/watch?v=pg0AJNvK_Bk” year: 2024 duration: “~45m” puma_relevance: “Source of the ‘Marco Veritas’ framework used in PUMA’s AI use declaration (Section 1.8). Core principles: no delegation of judgement, traceability, cross-validation, substantial rewriting.” watched_status: complete created: 2026-03-01

Video: How to do a Literature Review with AI — Lluis Codina

One-sentence summary: Codina proposes “Marco Veritas” — a practical ethical framework for using AI in academic research that preserves intellectual integrity while leveraging AI efficiency gains.

🎯 Marco Veritas Principles (Used in PUMA Section 1.8)

No delegation of judgement — AI generates options; researcher decides
Traceability — Log every significant AI use with tool, purpose, output, action
Cross-validation — Verify all AI-sourced facts in primary sources
Substantial rewriting — No AI text incorporated verbatim
Proactive declaration — Declare AI use upfront, not only when asked
Tutor validation — Discuss AI use with academic supervisor

🔗 Connections

Implemented in: PR-PUMA-Ch1-Introduction (Section 1.8) Logged via: AI-Use-Log

id: LN-Reddit-LocalLlama-Benchmarks title: “Reddit r/LocalLlama — Benchmark Discussions” type: literature-note subtype: reddit tags: [literature, reddit, local-llm, benchmarks, informal, community] url: “https://reddit.com/r/LocalLlama” puma_relevance: “Community knowledge about local model performance, hardware requirements, and practical considerations not covered in academic papers. Useful for model selection and hardware calibration.” created: 2026-03-01

Reddit r/LocalLlama — Benchmark Discussions

Type: Informal community source. Treat as exploration/hypothesis generation only. Never cite directly in academic project without primary source verification.

🎯 Key Community Observations for PUMA

Thread type 1 — Phi-3.5 Mini performance: Community reports suggest Phi-3.5 Mini (3.8B) outperforms Mistral 7B on some classification tasks while using 40% less RAM. This motivates including it as a fallback/bonus model in PUMA.

Thread type 2 — Quantization effects: Q4_K_M quantization (used in Ollama default) reportedly loses ~2-5% accuracy vs Q8 for classification tasks. PUMA should document quantization level used.

Thread type 3 — CPU vs GPU inference: CPU-only inference (PUMA’s constraint) introduces non-determinism in some edge cases even with temperature=0. Community workarounds: Docker isolation + pinned CPU affinity.

⚠️ Validation Requirements

Everything from Reddit is a fleeting note / hypothesis until verified:

Phi-3.5 vs Mistral comparison → find academic paper or run own test
Quantization accuracy loss → verify with Ollama/LM Studio comparison
CPU non-determinism → test empirically with seed=42

🔗 Connections

PN-LLM-Local-vs-Cloud (quantization section) | SP-Architecture

id: LN-Repo-Langgraph-Template title: “Repo: fastapi-langgraph-agent-production-ready-template” type: literature-note subtype: repo tags: [literature, repo, langgraph, fastapi, agent, template] author: “wassim249” github: “https://github.com/wassim249/fastapi-langgraph-agent-production-ready-template” stars: “~200” puma_relevance: “Reference architecture for production-grade LLM agent systems. Relevant for PUMA Stage 4-5 (RAG + multi-agent). Shows how to structure FastAPI + LangGraph + Docker for reproducible agent deployment.” relevance: medium created: 2026-03-01

Repo: fastapi-langgraph-agent-production-ready-template

Architecture pattern: FastAPI (REST API) → LangGraph (agent workflow) → Ollama/OpenAI (inference) → Docker (deployment)

Key Patterns Relevant to PUMA

1. State management in multi-turn agent workflows LangGraph’s StateGraph pattern is useful if PUMA’s orchestrator needs to track conversation history or multi-step reasoning chains.

2. Tool integration pattern Shows how to register Python functions as agent tools — directly applicable to PUMA’s architecture where the triage agent could call dataset lookup, priority classification, and reporting tools.

3. Async + streaming Production pattern for handling multiple concurrent inference requests — relevant if PUMA is extended to a web service.

PUMA Use

Stage 1-3: Not needed (simple pipeline, not agentic)
Stage 4+ (RAG): Reference for retriever integration
Stage 5 (Smart PMO): Reference for multi-agent coordination

🔗 Connections

LN-Tools-Dev-Stack | SP-Architecture

PUMA Vault

Explorador

Video: Let's build the GPT Tokenizer — Andrej Karpathy

Video: Let’s build the GPT Tokenizer — Andrej Karpathy

🎯 Key Insights for PUMA

📝 Cornell Notes

🔗 Connections

Video: How to do a Literature Review with AI — Lluis Codina

🎯 Marco Veritas Principles (Used in PUMA Section 1.8)

🔗 Connections

Reddit r/LocalLlama — Benchmark Discussions

🎯 Key Community Observations for PUMA

⚠️ Validation Requirements

🔗 Connections

Repo: fastapi-langgraph-agent-production-ready-template

Key Patterns Relevant to PUMA

PUMA Use

🔗 Connections

Vista Gráfica

Tabla de Contenidos

Retroenlaces