LN: Tunstall, von Werra & Wolf (2022) — NLP with Transformers

Bibliographic Reference

Citation: Tunstall, L., von Werra, L., & Wolf, T. (2022). Natural language processing with Transformers: Building language applications with Hugging Face. O’Reilly Media. https://www.oreilly.com/library/view/natural-language-processing/9781098136780/


Pass 1 — Bird’s Eye View (5 Cs)

CAssessment
CategoryTechnical book — implementation guide
ContextWritten by members of Hugging Face (the organisation behind the transformers library). Combines theoretical foundations with hands-on Python code
CorrectnessCode examples tested against Hugging Face library versions; covers published models with accurate capability claims
Contributions(1) End-to-end guide for fine-tuning and deploying Transformer models; (2) Quantization techniques (4-bit, 8-bit, GGUF); (3) Local inference with open-weight models; (4) Text classification, summarisation, generation, and Q&A pipelines
ClarityExcellent — progressive difficulty, extensive code examples

Relevance: ⭐⭐⭐⭐

Provides the technical basis for PUMA’s model selection, fine-tuning rationale (LoRA/QLoRA), and the local execution pipeline (Ollama + open-weight models). The classification pipeline chapters directly map to PUMA’s H1 implementation.


Pass 2 — Key Concepts

Text Classification Pipeline (PUMA H1 Mapping)

The book’s classification chapter maps directly to PUMA’s triage task:

from transformers import pipeline
 
classifier = pipeline("text-classification",
                      model="meta-llama/Llama-3.2-8B-Instruct")
 
result = classifier(
    "Bug: User cannot login after password reset",
    candidate_labels=["Bug", "Story", "Task", "Improvement"]
)

For PUMA, the equivalent is a prompted LLM (via Ollama) rather than a fine-tuned classifier — but the conceptual pipeline is identical.

Fine-Tuning with LoRA

The book’s fine-tuning chapters cover parameter-efficient fine-tuning:

from peft import LoraConfig, get_peft_model
 
config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, config)

This is directly applicable to PUMA’s optional fine-tuning stage (PUMA Stage 3+).

Quantization and Local Deployment

Key quantization concepts from the book:

  • 8-bit quantization (bitsandbytes): 2× memory reduction, <1% quality loss
  • 4-bit NF4 quantization (QLoRA): 4× memory reduction, enables 8B models on consumer GPU
  • GGUF (llama.cpp): CPU-compatible quantization format used by Ollama

These directly justify PUMA’s choice to use Ollama for local model execution.


PUMA Integration

  • Ch.3 Methods: Cite Tunstall et al. for the technical justification of local inference, quantization, and the Python classification pipeline
  • Model selection: The book’s coverage of Mistral 7B, Gemma 2, and Phi-3.5 validates PUMA’s model choices

MOCs