Fleeting Note — {{date}}

Capture now, process later. This note should be converted to a Literature Note or Permanent Note within 48 hours, then deleted or archived.


Raw capture:

{{Write exactly what you want to remember. No formatting needed. Examples:

  • “Angermeir 2025 says only 5/18 papers with artefacts are reproducible — need to check exact quote”
  • “Idea: Could Contextual Anchoring help when using long few-shot prompts for triage? The model might forget the priority definitions”
  • “YouTube: Andrej Karpathy video on tokenisation — relevant for understanding why ‘Critical’ might be tokenised differently across models”
  • “Reddit r/LocalLlama: someone mentioned phi-3.5-mini outperforms mistral-7b on classification tasks at 1/3 the RAM — worth testing as fallback” }}

Action required:

  • Verify in primary source
  • Create literature note:
  • Create permanent note:
  • Add to experiment plan
  • Add to backlog
  • Just delete — not useful

Process by: {{date+2days}} — if not processed, move to 71 Old-Notes


📥 Example Fleeting Notes (Educational)

Example 1 — Paper idea

FL-2026-03-12-CoT-small-models

Source: Reading Calikli & Alhamed (2025)

Raw capture:
The paper says request format has "non-monotonic" effect on effort estimation.
This is surprising — more prompt structure doesn't always help.
Specifically: few-shot-6 sometimes WORSE than few-shot-3.
Question: Is this because 6 examples increases context length and the model
"forgets" the task definition? Or is it example selection bias?

Need to:
- Read the paper carefully (currently only read abstract)
- Check if they tested this with local models or only GPT
- Design PUMA experiments to test few-shot-3 vs few-shot-6 for this reason

Action: Create permanent note on "non-monotonic prompting effects"

Example 2 — Tool observation

FL-2026-03-15-Ollama-seed-issue

Source: Own testing

Raw capture:
Discovered that Ollama doesn't guarantee determinism across different
hardware even with seed=42. Tested same prompt on M1 Mac and Intel CPU —
different responses. This might affect reproducibility claims.

Need to:
- Research if this is known behaviour
- Check Ollama GitHub issues
- Consider documenting this as a limitation
- Maybe: always use Docker to standardise hardware layer?

Action: Update SP-Architecture-v1 with reproducibility caveat
        Log in experiment design under "Threats to Validity"

Example 3 — YouTube / Podcast

FL-2026-03-18-Karpathy-tokenisation

Source: YouTube - Andrej Karpathy "Let's build the GPT Tokenizer" 
URL: https://youtu.be/zduSFxRajkE

Raw capture:
Karpathy explains that tokenisation matters enormously for model behaviour.
The word "Critical" might be a single token, while "critical" might be split.
This could explain why the triage agent sometimes ignores case in the prompt.
Also: priority labels might tokenise differently in different model families.

Implication for PUMA: document model tokenisers used,
test with multiple capitalisation variants in labels.

Action: Add to "Threats to Validity" in experiment design.
        Maybe add to Glossary under "tokenization"
        Note type: Fleeting → maybe Permanent if confirmed