bahree · bahree · Jun 7, 2026 · Jun 7, 2026
diff --git a/README.md b/README.md
@@ -27,7 +27,7 @@ Every chapter ships with runnable code. The hands-on chapters (4 through 9) repr
 | **[Chapter 1: Why Model Adaptation?](code/chapter01/README.md)** | A reproducibility script for the §1.6 sidebar. Runs the same prompt through base Qwen3-4B, the Chapter 5 LoRA adapter, and the Chapter 6 SFT model side by side; degrades gracefully if the later-chapter artifacts are not yet built. |
 | **[Chapter 2: How Do I Do Model Adaptation?](code/chapter02/README.md)** | A five-step LoRA fine-tuning quickstart on Qwen3-4B-Instruct-2507 using a 40-example Dolly subset (TRL's `SFTTrainer` plus PEFT): dataset prep, LoRA training, generation, and adapter save. Runs in under 10 minutes on a 12 GB GPU, and on Apple Silicon via MPS. |
 | **[Chapter 3: What Data Do I Need?](code/chapter03/README.md)** | Data-quality experiment that trains the same model on four versions of Financial PhraseBank and compares results on a held-out test set; a six-step synthetic data generation pipeline (load → prompt → generate → quality-gate → distribution-check → mix-and-save) using a frontier teacher; and a standalone `DatasetManifest` module for content hashing, lineage tracking, and retention scheduling. |
-| **[Chapter 4: In-Context Learning and Few-Shot Adaptation](code/chapter04/README.md)** | Few-shot ticket classifier, prompt validator with run-to-run variability measurement, minimal RAG pipeline (50 lines), and a Precision@k / Recall@k / Hit@1 retrieval evaluator. CPU-friendly; GPU optional. |
+| **[Chapter 4: In-Context Learning, Few-Shot, and RAG](code/chapter04/README.md)** | Few-shot ticket classifier, prompt validator with run-to-run variability measurement, minimal RAG pipeline (50 lines), and a Precision@k / Recall@k / Hit@1 retrieval evaluator. CPU-friendly; GPU optional. |
 | **[Chapter 5: Parameter-Efficient Fine-Tuning (LoRA and QLoRA)](code/chapter05/README.md)** | LoRA and QLoRA adapters trained on a 400-example Dolly subset of Qwen3-4B-Instruct-2507, evaluated against the base model with per-category Token-F1 and a safety regression suite. |
 | **[Chapter 6: Supervised Fine-Tuning (SFT)](code/chapter06/README.md)** | A full-parameter SFT of Qwen3-4B-Instruct-2507 on a technical-support Dolly subset, with overfit monitoring, three-way base-vs-LoRA-vs-SFT comparison, behavioral tests, and a separate safety regression suite. |
 | **[Chapter 7: Knowledge Distillation](code/chapter07/README.md)** | Black-box distillation from the chapter 6 SFT teacher into a chapter 5-style LoRA student, with quality filtering, three-way base-vs-teacher-vs-student evaluation, safety robustness check, and an optional OpenRouter-backed SFT-vs-frontier-API comparison. |

diff --git a/code/chapter02/quickstart.py b/code/chapter02/quickstart.py
@@ -70,7 +70,7 @@
 def step1_prepare_dataset() -> tuple[HFDataset, HFDataset, List[Dict[str, Any]]]:
     """Step 1: download Dolly 15K and keep 40 train + 5 valid + 3 demo examples.
 
-    Same filter and seed as chapter 5's listing_5_2_prepare_dataset.py, just a
+    Same filter and seed as chapter 5's listing_5_1_prepare_dataset.py, just a
     smaller slice so the run finishes in minutes.
     """
     print("Step 1: prepare dataset")

diff --git a/code/chapter02/run_chapter5_adapter.py b/code/chapter02/run_chapter5_adapter.py
@@ -111,7 +111,7 @@ def print_no_adapter_instructions(args: argparse.Namespace) -> None:
     print("Two ways to fix this:")
     print()
     print("Option A. Train the chapter 5 adapter locally:")
-    print("  python -m chapter05.scripts.listing_5_2_prepare_dataset \\")
+    print("  python -m chapter05.scripts.listing_5_1_prepare_dataset \\")
     print("    --out chapter05/data/dolly_subset --seed 42")
     print("  python -m chapter05.train_lora \\")
     print("    --train chapter05/data/dolly_subset/train.jsonl \\")

diff --git a/code/chapter04/README.md b/code/chapter04/README.md
@@ -1,4 +1,4 @@
-# Chapter 4 -- In-Context Learning and Few-Shot Adaptation
+# Chapter 4 -- In-Context Learning, Few-Shot, and RAG
 
 This chapter covers how to get useful work out of a model without training it: few-shot prompting, many-shot prompting on long-context models, prompt validation against held-out test sets, and a minimal retrieval-augmented generation (RAG) pipeline. The code in this folder backs the four numbered listings in the chapter.
 

diff --git a/code/chapter05/README.md b/code/chapter05/README.md
diff --git a/code/chapter05/eval.py b/code/chapter05/eval.py
@@ -6,8 +6,8 @@
     3. **Toy golden set** - Simple Q&A pairs to sanity-check model behavior.
 
 Also includes loss/perplexity computation on held-out JSONL data, and report
-generation (JSON + Markdown). Used by ``scripts/listing_5_4_evaluate.py``
-(Listing 5.4) to compare base model vs. adapter variants.
+generation (JSON + Markdown). Used by ``scripts/listing_5_3_evaluate.py``
+(Listing 5.3) to compare base model vs. adapter variants.
 """
 from __future__ import annotations
 
@@ -473,7 +473,7 @@ def write_report(path: str | Path, obj: Dict[str, Any]) -> None:
     """Write an evaluation results dict as a JSON file.
 
     The JSON report is the machine-readable counterpart to the human-readable
-    Markdown summary generated by ``listing_5_4_evaluate.py``. Both are saved
+    Markdown summary generated by ``listing_5_3_evaluate.py``. Both are saved
     to the same output directory (e.g., ``chapter05/runs/eval_report/``).
 
     Args:

diff --git a/code/chapter05/examples/README_INTERPRETING_RESULTS.md b/code/chapter05/examples/README_INTERPRETING_RESULTS.md
@@ -1,6 +1,6 @@
 # Understanding Your Evaluation Results
 
-This guide helps you interpret the evaluation report from `listing_5_4_evaluate.py`.
+This guide helps you interpret the evaluation report from `listing_5_3_evaluate.py`.
 
 ---
 

diff --git a/code/chapter05/examples/example_data_prep_outcome_types.md b/code/chapter05/examples/example_data_prep_outcome_types.md
@@ -3,7 +3,7 @@
 These illustrate the response types discussed in the chapter's "Data quality
 iterations" section, using the Contoso IT-support assistant. Each is a single
 training row in the same `messages` format produced by
-`scripts/listing_5_2_prepare_dataset.py` (see `dolly_to_messages`).
+`scripts/listing_5_1_prepare_dataset.py` (see `dolly_to_messages`).
 
 > **These rows are illustrative.** The Dolly 15K subset used in this chapter
 > contains no refusals and no tone tags, so these are examples of what you would

diff --git a/code/chapter05/examples/example_qlora_evaluation_output.md b/code/chapter05/examples/example_qlora_evaluation_output.md
@@ -5,7 +5,7 @@ This file captures a typical run of the evaluation script when comparing the **b
 ## Command
 
 ```bash
-python chapter05/scripts/listing_5_4_evaluate.py \
+python chapter05/scripts/listing_5_3_evaluate.py \
   --base Qwen/Qwen3-4B-Instruct-2507 \
   --adapter chapter05/runs/dolly_lora \
   --adapter_alt chapter05/runs/dolly_qlora \

diff --git a/code/chapter05/generate.py b/code/chapter05/generate.py
@@ -1,110 +1,110 @@
-"""Inference script: generate text with the base model and an optional LoRA/QLoRA adapter (Listing 5.5).
-
-Loads the base model, optionally attaches a LoRA or QLoRA adapter, and generates
-a response for a single user prompt. Supports adapter merging for deployment.
-
-Usage (base model only):
-    python -m chapter05.generate --base Qwen/Qwen3-4B-Instruct-2507 \\
-        --prompt "Explain how photosynthesis works in simple terms."
-
-Usage (with LoRA adapter):
-    python -m chapter05.generate --base Qwen/Qwen3-4B-Instruct-2507 \\
-        --adapter chapter05/runs/dolly_lora \\
-        --prompt "Explain how photosynthesis works in simple terms."
-
-Usage (with QLoRA adapter -- must use --quantized_4bit):
-    python -m chapter05.generate --base Qwen/Qwen3-4B-Instruct-2507 \\
-        --adapter chapter05/runs/dolly_qlora --quantized_4bit \\
-        --prompt "Explain how photosynthesis works in simple terms."
-
-See Chapter 5, Section 5.1 (Step 4) and the README for full details.
-"""
-from __future__ import annotations
-
-import argparse
-from pathlib import Path
-
-import torch
-from peft import PeftModel
-from transformers import AutoModelForCausalLM
-
-from chapter05 import DEFAULT_MODEL_NAME
-from chapter05.chat_template import DEFAULT_SYSTEM_PROMPT, build_prompt_text
-from chapter05.modeling import load_base_model_lora, load_base_model_qlora, load_tokenizer
-
-
-def parse_args() -> argparse.Namespace:
-    """Parse command-line arguments for inference.
-
-    Returns:
-        Namespace with base model, adapter path, prompt, generation settings,
-        and optional merge/quantization flags.
-    """
-    ap = argparse.ArgumentParser()
-    ap.add_argument("--base", default=DEFAULT_MODEL_NAME)
-    ap.add_argument("--adapter", default=None, help="Path to LoRA/QLoRA adapter folder")
-    ap.add_argument("--prompt", required=True, help="User prompt")
-    ap.add_argument("--system_prompt", default=DEFAULT_SYSTEM_PROMPT)
-    ap.add_argument("--max_new_tokens", type=int, default=128)
-    ap.add_argument("--do_sample", action="store_true")
-    ap.add_argument("--temperature", type=float, default=0.7)
-    ap.add_argument("--quantized_4bit", action="store_true", help="Load base in 4-bit (requires bitsandbytes)")
-    ap.add_argument("--merge_adapter", action="store_true", help="Merge adapter into base before generation")
-    ap.add_argument("--save_merged", default=None, help="If set, save merged model to this folder")
-    return ap.parse_args()
-
-
-def main() -> None:
-    """Load model, optionally attach adapter, and generate a response."""
-    args = parse_args()
-    tokenizer = load_tokenizer(args.base)
-
-    # Use --quantized_4bit when running a QLoRA-trained adapter so the base
-    # model is loaded in 4-bit (matching the precision used during training).
-    if args.quantized_4bit:
-        model = load_base_model_qlora(args.base, gradient_checkpointing=False)
-    else:
-        model = load_base_model_lora(args.base, gradient_checkpointing=False)
-
-    if args.adapter:
-        model = PeftModel.from_pretrained(model, args.adapter)
-        if args.merge_adapter:
-            # merge_and_unload() permanently folds LoRA weights into the base.
-            # This loses modularity (can't swap adapters) but can be faster
-            # for high-throughput serving. See Section 5.11 deployment options.
-            model = model.merge_and_unload()
-            if args.save_merged:
-                Path(args.save_merged).mkdir(parents=True, exist_ok=True)
-                model.save_pretrained(args.save_merged)
-                tokenizer.save_pretrained(args.save_merged)
-
-    model.eval()
-
-    messages = [
-        {"role": "system", "content": args.system_prompt},
-        {"role": "user", "content": args.prompt},
-    ]
-    text = build_prompt_text(tokenizer, messages)
-    inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device)
-
-    with torch.no_grad():
-        out = model.generate(
-            **inputs,
-            max_new_tokens=args.max_new_tokens,
-            do_sample=args.do_sample,
-            # Pass temperature=None when not sampling to avoid HF warnings
-            # about unused generation parameters.
-            temperature=args.temperature if args.do_sample else None,
-            pad_token_id=tokenizer.pad_token_id,
-            eos_token_id=tokenizer.eos_token_id,
-        )
-
-    # skip_special_tokens=False to show the full chat template (system/user/assistant
-    # markers). This is useful for debugging and demonstrating the template structure.
-    decoded = tokenizer.decode(out[0], skip_special_tokens=False)
-    print(decoded)
-
-
-if __name__ == "__main__":
-    main()
-
+"""Inference script: generate text with the base model and an optional LoRA/QLoRA adapter (Listing 5.4).
+
+Loads the base model, optionally attaches a LoRA or QLoRA adapter, and generates
+a response for a single user prompt. Supports adapter merging for deployment.
+
+Usage (base model only):
+    python -m chapter05.generate --base Qwen/Qwen3-4B-Instruct-2507 \\
+        --prompt "Explain how photosynthesis works in simple terms."
+
+Usage (with LoRA adapter):
+    python -m chapter05.generate --base Qwen/Qwen3-4B-Instruct-2507 \\
+        --adapter chapter05/runs/dolly_lora \\
+        --prompt "Explain how photosynthesis works in simple terms."
+
+Usage (with QLoRA adapter -- must use --quantized_4bit):
+    python -m chapter05.generate --base Qwen/Qwen3-4B-Instruct-2507 \\
+        --adapter chapter05/runs/dolly_qlora --quantized_4bit \\
+        --prompt "Explain how photosynthesis works in simple terms."
+
+See Chapter 5, Section 5.1 (Step 4) and the README for full details.
+"""
+from __future__ import annotations
+
+import argparse
+from pathlib import Path
+
+import torch
+from peft import PeftModel
+from transformers import AutoModelForCausalLM
+
+from chapter05 import DEFAULT_MODEL_NAME
+from chapter05.chat_template import DEFAULT_SYSTEM_PROMPT, build_prompt_text
+from chapter05.modeling import load_base_model_lora, load_base_model_qlora, load_tokenizer
+
+
+def parse_args() -> argparse.Namespace:
+    """Parse command-line arguments for inference.
+
+    Returns:
+        Namespace with base model, adapter path, prompt, generation settings,
+        and optional merge/quantization flags.
+    """
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--base", default=DEFAULT_MODEL_NAME)
+    ap.add_argument("--adapter", default=None, help="Path to LoRA/QLoRA adapter folder")
+    ap.add_argument("--prompt", required=True, help="User prompt")
+    ap.add_argument("--system_prompt", default=DEFAULT_SYSTEM_PROMPT)
+    ap.add_argument("--max_new_tokens", type=int, default=128)
+    ap.add_argument("--do_sample", action="store_true")
+    ap.add_argument("--temperature", type=float, default=0.7)
+    ap.add_argument("--quantized_4bit", action="store_true", help="Load base in 4-bit (requires bitsandbytes)")
+    ap.add_argument("--merge_adapter", action="store_true", help="Merge adapter into base before generation")
+    ap.add_argument("--save_merged", default=None, help="If set, save merged model to this folder")
+    return ap.parse_args()
+
+
+def main() -> None:
+    """Load model, optionally attach adapter, and generate a response."""
+    args = parse_args()
+    tokenizer = load_tokenizer(args.base)
+
+    # Use --quantized_4bit when running a QLoRA-trained adapter so the base
+    # model is loaded in 4-bit (matching the precision used during training).
+    if args.quantized_4bit:
+        model = load_base_model_qlora(args.base, gradient_checkpointing=False)
+    else:
+        model = load_base_model_lora(args.base, gradient_checkpointing=False)
+
+    if args.adapter:
+        model = PeftModel.from_pretrained(model, args.adapter)
+        if args.merge_adapter:
+            # merge_and_unload() permanently folds LoRA weights into the base.
+            # This loses modularity (can't swap adapters) but can be faster
+            # for high-throughput serving. See Section 5.11 deployment options.
+            model = model.merge_and_unload()
+            if args.save_merged:
+                Path(args.save_merged).mkdir(parents=True, exist_ok=True)
+                model.save_pretrained(args.save_merged)
+                tokenizer.save_pretrained(args.save_merged)
+
+    model.eval()
+
+    messages = [
+        {"role": "system", "content": args.system_prompt},
+        {"role": "user", "content": args.prompt},
+    ]
+    text = build_prompt_text(tokenizer, messages)
+    inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device)
+
+    with torch.no_grad():
+        out = model.generate(
+            **inputs,
+            max_new_tokens=args.max_new_tokens,
+            do_sample=args.do_sample,
+            # Pass temperature=None when not sampling to avoid HF warnings
+            # about unused generation parameters.
+            temperature=args.temperature if args.do_sample else None,
+            pad_token_id=tokenizer.pad_token_id,
+            eos_token_id=tokenizer.eos_token_id,
+        )
+
+    # skip_special_tokens=False to show the full chat template (system/user/assistant
+    # markers). This is useful for debugging and demonstrating the template structure.
+    decoded = tokenizer.decode(out[0], skip_special_tokens=False)
+    print(decoded)
+
+
+if __name__ == "__main__":
+    main()
+