Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Every chapter ships with runnable code. The hands-on chapters (4 through 9) repr
| **[Chapter 1: Why Model Adaptation?](code/chapter01/README.md)** | A reproducibility script for the §1.6 sidebar. Runs the same prompt through base Qwen3-4B, the Chapter 5 LoRA adapter, and the Chapter 6 SFT model side by side; degrades gracefully if the later-chapter artifacts are not yet built. |
| **[Chapter 2: How Do I Do Model Adaptation?](code/chapter02/README.md)** | A five-step LoRA fine-tuning quickstart on Qwen3-4B-Instruct-2507 using a 40-example Dolly subset (TRL's `SFTTrainer` plus PEFT): dataset prep, LoRA training, generation, and adapter save. Runs in under 10 minutes on a 12 GB GPU, and on Apple Silicon via MPS. |
| **[Chapter 3: What Data Do I Need?](code/chapter03/README.md)** | Data-quality experiment that trains the same model on four versions of Financial PhraseBank and compares results on a held-out test set; a six-step synthetic data generation pipeline (load → prompt → generate → quality-gate → distribution-check → mix-and-save) using a frontier teacher; and a standalone `DatasetManifest` module for content hashing, lineage tracking, and retention scheduling. |
| **[Chapter 4: In-Context Learning and Few-Shot Adaptation](code/chapter04/README.md)** | Few-shot ticket classifier, prompt validator with run-to-run variability measurement, minimal RAG pipeline (50 lines), and a Precision@k / Recall@k / Hit@1 retrieval evaluator. CPU-friendly; GPU optional. |
| **[Chapter 4: In-Context Learning, Few-Shot, and RAG](code/chapter04/README.md)** | Few-shot ticket classifier, prompt validator with run-to-run variability measurement, minimal RAG pipeline (50 lines), and a Precision@k / Recall@k / Hit@1 retrieval evaluator. CPU-friendly; GPU optional. |
| **[Chapter 5: Parameter-Efficient Fine-Tuning (LoRA and QLoRA)](code/chapter05/README.md)** | LoRA and QLoRA adapters trained on a 400-example Dolly subset of Qwen3-4B-Instruct-2507, evaluated against the base model with per-category Token-F1 and a safety regression suite. |
| **[Chapter 6: Supervised Fine-Tuning (SFT)](code/chapter06/README.md)** | A full-parameter SFT of Qwen3-4B-Instruct-2507 on a technical-support Dolly subset, with overfit monitoring, three-way base-vs-LoRA-vs-SFT comparison, behavioral tests, and a separate safety regression suite. |
| **[Chapter 7: Knowledge Distillation](code/chapter07/README.md)** | Black-box distillation from the chapter 6 SFT teacher into a chapter 5-style LoRA student, with quality filtering, three-way base-vs-teacher-vs-student evaluation, safety robustness check, and an optional OpenRouter-backed SFT-vs-frontier-API comparison. |
Expand Down
2 changes: 1 addition & 1 deletion code/chapter02/quickstart.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@
def step1_prepare_dataset() -> tuple[HFDataset, HFDataset, List[Dict[str, Any]]]:
"""Step 1: download Dolly 15K and keep 40 train + 5 valid + 3 demo examples.

Same filter and seed as chapter 5's listing_5_2_prepare_dataset.py, just a
Same filter and seed as chapter 5's listing_5_1_prepare_dataset.py, just a
smaller slice so the run finishes in minutes.
"""
print("Step 1: prepare dataset")
Expand Down
2 changes: 1 addition & 1 deletion code/chapter02/run_chapter5_adapter.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ def print_no_adapter_instructions(args: argparse.Namespace) -> None:
print("Two ways to fix this:")
print()
print("Option A. Train the chapter 5 adapter locally:")
print(" python -m chapter05.scripts.listing_5_2_prepare_dataset \\")
print(" python -m chapter05.scripts.listing_5_1_prepare_dataset \\")
print(" --out chapter05/data/dolly_subset --seed 42")
print(" python -m chapter05.train_lora \\")
print(" --train chapter05/data/dolly_subset/train.jsonl \\")
Expand Down
2 changes: 1 addition & 1 deletion code/chapter04/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Chapter 4 -- In-Context Learning and Few-Shot Adaptation
# Chapter 4 -- In-Context Learning, Few-Shot, and RAG

This chapter covers how to get useful work out of a model without training it: few-shot prompting, many-shot prompting on long-context models, prompt validation against held-out test sets, and a minimal retrieval-augmented generation (RAG) pipeline. The code in this folder backs the four numbered listings in the chapter.

Expand Down
1,132 changes: 566 additions & 566 deletions code/chapter05/README.md

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions code/chapter05/eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@
3. **Toy golden set** - Simple Q&A pairs to sanity-check model behavior.

Also includes loss/perplexity computation on held-out JSONL data, and report
generation (JSON + Markdown). Used by ``scripts/listing_5_4_evaluate.py``
(Listing 5.4) to compare base model vs. adapter variants.
generation (JSON + Markdown). Used by ``scripts/listing_5_3_evaluate.py``
(Listing 5.3) to compare base model vs. adapter variants.
"""
from __future__ import annotations

Expand Down Expand Up @@ -473,7 +473,7 @@ def write_report(path: str | Path, obj: Dict[str, Any]) -> None:
"""Write an evaluation results dict as a JSON file.

The JSON report is the machine-readable counterpart to the human-readable
Markdown summary generated by ``listing_5_4_evaluate.py``. Both are saved
Markdown summary generated by ``listing_5_3_evaluate.py``. Both are saved
to the same output directory (e.g., ``chapter05/runs/eval_report/``).

Args:
Expand Down
2 changes: 1 addition & 1 deletion code/chapter05/examples/README_INTERPRETING_RESULTS.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Understanding Your Evaluation Results

This guide helps you interpret the evaluation report from `listing_5_4_evaluate.py`.
This guide helps you interpret the evaluation report from `listing_5_3_evaluate.py`.

---

Expand Down
2 changes: 1 addition & 1 deletion code/chapter05/examples/example_data_prep_outcome_types.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
These illustrate the response types discussed in the chapter's "Data quality
iterations" section, using the Contoso IT-support assistant. Each is a single
training row in the same `messages` format produced by
`scripts/listing_5_2_prepare_dataset.py` (see `dolly_to_messages`).
`scripts/listing_5_1_prepare_dataset.py` (see `dolly_to_messages`).

> **These rows are illustrative.** The Dolly 15K subset used in this chapter
> contains no refusals and no tone tags, so these are examples of what you would
Expand Down
2 changes: 1 addition & 1 deletion code/chapter05/examples/example_qlora_evaluation_output.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ This file captures a typical run of the evaluation script when comparing the **b
## Command

```bash
python chapter05/scripts/listing_5_4_evaluate.py \
python chapter05/scripts/listing_5_3_evaluate.py \
--base Qwen/Qwen3-4B-Instruct-2507 \
--adapter chapter05/runs/dolly_lora \
--adapter_alt chapter05/runs/dolly_qlora \
Expand Down
220 changes: 110 additions & 110 deletions code/chapter05/generate.py
Original file line number Diff line number Diff line change
@@ -1,110 +1,110 @@
"""Inference script: generate text with the base model and an optional LoRA/QLoRA adapter (Listing 5.5).
Loads the base model, optionally attaches a LoRA or QLoRA adapter, and generates
a response for a single user prompt. Supports adapter merging for deployment.
Usage (base model only):
python -m chapter05.generate --base Qwen/Qwen3-4B-Instruct-2507 \\
--prompt "Explain how photosynthesis works in simple terms."
Usage (with LoRA adapter):
python -m chapter05.generate --base Qwen/Qwen3-4B-Instruct-2507 \\
--adapter chapter05/runs/dolly_lora \\
--prompt "Explain how photosynthesis works in simple terms."
Usage (with QLoRA adapter -- must use --quantized_4bit):
python -m chapter05.generate --base Qwen/Qwen3-4B-Instruct-2507 \\
--adapter chapter05/runs/dolly_qlora --quantized_4bit \\
--prompt "Explain how photosynthesis works in simple terms."
See Chapter 5, Section 5.1 (Step 4) and the README for full details.
"""
from __future__ import annotations
import argparse
from pathlib import Path
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM
from chapter05 import DEFAULT_MODEL_NAME
from chapter05.chat_template import DEFAULT_SYSTEM_PROMPT, build_prompt_text
from chapter05.modeling import load_base_model_lora, load_base_model_qlora, load_tokenizer
def parse_args() -> argparse.Namespace:
"""Parse command-line arguments for inference.
Returns:
Namespace with base model, adapter path, prompt, generation settings,
and optional merge/quantization flags.
"""
ap = argparse.ArgumentParser()
ap.add_argument("--base", default=DEFAULT_MODEL_NAME)
ap.add_argument("--adapter", default=None, help="Path to LoRA/QLoRA adapter folder")
ap.add_argument("--prompt", required=True, help="User prompt")
ap.add_argument("--system_prompt", default=DEFAULT_SYSTEM_PROMPT)
ap.add_argument("--max_new_tokens", type=int, default=128)
ap.add_argument("--do_sample", action="store_true")
ap.add_argument("--temperature", type=float, default=0.7)
ap.add_argument("--quantized_4bit", action="store_true", help="Load base in 4-bit (requires bitsandbytes)")
ap.add_argument("--merge_adapter", action="store_true", help="Merge adapter into base before generation")
ap.add_argument("--save_merged", default=None, help="If set, save merged model to this folder")
return ap.parse_args()
def main() -> None:
"""Load model, optionally attach adapter, and generate a response."""
args = parse_args()
tokenizer = load_tokenizer(args.base)
# Use --quantized_4bit when running a QLoRA-trained adapter so the base
# model is loaded in 4-bit (matching the precision used during training).
if args.quantized_4bit:
model = load_base_model_qlora(args.base, gradient_checkpointing=False)
else:
model = load_base_model_lora(args.base, gradient_checkpointing=False)
if args.adapter:
model = PeftModel.from_pretrained(model, args.adapter)
if args.merge_adapter:
# merge_and_unload() permanently folds LoRA weights into the base.
# This loses modularity (can't swap adapters) but can be faster
# for high-throughput serving. See Section 5.11 deployment options.
model = model.merge_and_unload()
if args.save_merged:
Path(args.save_merged).mkdir(parents=True, exist_ok=True)
model.save_pretrained(args.save_merged)
tokenizer.save_pretrained(args.save_merged)
model.eval()
messages = [
{"role": "system", "content": args.system_prompt},
{"role": "user", "content": args.prompt},
]
text = build_prompt_text(tokenizer, messages)
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device)
with torch.no_grad():
out = model.generate(
**inputs,
max_new_tokens=args.max_new_tokens,
do_sample=args.do_sample,
# Pass temperature=None when not sampling to avoid HF warnings
# about unused generation parameters.
temperature=args.temperature if args.do_sample else None,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
# skip_special_tokens=False to show the full chat template (system/user/assistant
# markers). This is useful for debugging and demonstrating the template structure.
decoded = tokenizer.decode(out[0], skip_special_tokens=False)
print(decoded)
if __name__ == "__main__":
main()
"""Inference script: generate text with the base model and an optional LoRA/QLoRA adapter (Listing 5.4).

Loads the base model, optionally attaches a LoRA or QLoRA adapter, and generates
a response for a single user prompt. Supports adapter merging for deployment.

Usage (base model only):
python -m chapter05.generate --base Qwen/Qwen3-4B-Instruct-2507 \\
--prompt "Explain how photosynthesis works in simple terms."

Usage (with LoRA adapter):
python -m chapter05.generate --base Qwen/Qwen3-4B-Instruct-2507 \\
--adapter chapter05/runs/dolly_lora \\
--prompt "Explain how photosynthesis works in simple terms."

Usage (with QLoRA adapter -- must use --quantized_4bit):
python -m chapter05.generate --base Qwen/Qwen3-4B-Instruct-2507 \\
--adapter chapter05/runs/dolly_qlora --quantized_4bit \\
--prompt "Explain how photosynthesis works in simple terms."

See Chapter 5, Section 5.1 (Step 4) and the README for full details.
"""
from __future__ import annotations

import argparse
from pathlib import Path

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM

from chapter05 import DEFAULT_MODEL_NAME
from chapter05.chat_template import DEFAULT_SYSTEM_PROMPT, build_prompt_text
from chapter05.modeling import load_base_model_lora, load_base_model_qlora, load_tokenizer


def parse_args() -> argparse.Namespace:
"""Parse command-line arguments for inference.

Returns:
Namespace with base model, adapter path, prompt, generation settings,
and optional merge/quantization flags.
"""
ap = argparse.ArgumentParser()
ap.add_argument("--base", default=DEFAULT_MODEL_NAME)
ap.add_argument("--adapter", default=None, help="Path to LoRA/QLoRA adapter folder")
ap.add_argument("--prompt", required=True, help="User prompt")
ap.add_argument("--system_prompt", default=DEFAULT_SYSTEM_PROMPT)
ap.add_argument("--max_new_tokens", type=int, default=128)
ap.add_argument("--do_sample", action="store_true")
ap.add_argument("--temperature", type=float, default=0.7)
ap.add_argument("--quantized_4bit", action="store_true", help="Load base in 4-bit (requires bitsandbytes)")
ap.add_argument("--merge_adapter", action="store_true", help="Merge adapter into base before generation")
ap.add_argument("--save_merged", default=None, help="If set, save merged model to this folder")
return ap.parse_args()


def main() -> None:
"""Load model, optionally attach adapter, and generate a response."""
args = parse_args()
tokenizer = load_tokenizer(args.base)

# Use --quantized_4bit when running a QLoRA-trained adapter so the base
# model is loaded in 4-bit (matching the precision used during training).
if args.quantized_4bit:
model = load_base_model_qlora(args.base, gradient_checkpointing=False)
else:
model = load_base_model_lora(args.base, gradient_checkpointing=False)

if args.adapter:
model = PeftModel.from_pretrained(model, args.adapter)
if args.merge_adapter:
# merge_and_unload() permanently folds LoRA weights into the base.
# This loses modularity (can't swap adapters) but can be faster
# for high-throughput serving. See Section 5.11 deployment options.
model = model.merge_and_unload()
if args.save_merged:
Path(args.save_merged).mkdir(parents=True, exist_ok=True)
model.save_pretrained(args.save_merged)
tokenizer.save_pretrained(args.save_merged)

model.eval()

messages = [
{"role": "system", "content": args.system_prompt},
{"role": "user", "content": args.prompt},
]
text = build_prompt_text(tokenizer, messages)
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device)

with torch.no_grad():
out = model.generate(
**inputs,
max_new_tokens=args.max_new_tokens,
do_sample=args.do_sample,
# Pass temperature=None when not sampling to avoid HF warnings
# about unused generation parameters.
temperature=args.temperature if args.do_sample else None,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)

# skip_special_tokens=False to show the full chat template (system/user/assistant
# markers). This is useful for debugging and demonstrating the template structure.
decoded = tokenizer.decode(out[0], skip_special_tokens=False)
print(decoded)


if __name__ == "__main__":
main()

Loading
Loading