Skip to content

Latest commit

 

History

History

README.md

Chapter 1 — The Python Data Model

Status Concepts

"The Python data model formalizes the interfaces of the building blocks of the language itself." — Luciano Ramalho, Fluent Python


🧠 What This Chapter Is Really About

Chapter 1 doesn't just show you __len__ and __getitem__. It introduces the most important design principle in Python: the Data Model as a framework of protocols and hooks.

When you implement special methods (dunders), you're not just adding functionality — you're making your objects first-class citizens of the Python ecosystem. They gain access to:

  • Built-in functions (len, abs, bool, repr, str)
  • Operators (+, *, [], in, <, >)
  • Language constructs (for, with, if, unpacking)
  • Standard library functions (random.choice, sorted, reversed)

The fundamental insight: Python doesn't check what you are — it checks what you can do. This is structural typing (protocols) rather than nominal typing (inheritance).


📐 Core Concepts

1. The Protocol-Based Design Model

Python built-in function / operator
            │
            ▼
    Checks for special method
    (e.g., __len__, __add__)
            │
      ┌─────┴──────┐
      │            │
  Built-in      User class
  C shortcut    Python dispatch
   (~50ns)        (~90ns)

Python's special methods are interceptors — they let your classes integrate into the language's core machinery without inheritance.

2. The Truthiness Protocol Chain

bool(obj) ──► __bool__() exists? ──YES──► return __bool__()
                    │
                   NO
                    │
              __len__() exists? ──YES──► return len(obj) != 0
                    │
                   NO
                    │
              return True  (all objects are truthy by default)

This fallback chain means: if you define __len__ but not __bool__, Python uses length to determine truthiness. An empty custom collection is automatically falsy.

3. The Representation Protocol

Method When Called Audience
__repr__ repr(obj), REPL, debugger Developer
__str__ str(obj), print(obj) End user
__format__ f"{obj:spec}", format(obj, spec) Custom formatting

Rule: Always implement __repr__. Implement __str__ only when the user-facing representation should differ from the developer-facing one. If __str__ is missing, Python falls back to __repr__.


🔬 CPython Internals

How len() Really Works

// Objects/abstract.c (simplified)
Py_ssize_t
PyObject_Size(PyObject *o)
{
    PySequenceMethods *m;
    if (o == NULL) { ... }

    m = Py_TYPE(o)->tp_as_sequence;
    if (m && m->sq_length) {
        Py_ssize_t len = m->sq_length(o);  // C-level direct call!
        ...
    }
    // Falls back to mp_length for mappings
    return PyMapping_Size(o);
}

For user-defined classes, Python calls __len__ through the type's tp_as_sequence slot populated during class creation. The key point: len() has a fast path for C built-ins that bypasses Python attribute lookup entirely.

This is why len([1,2,3]) is faster than [1,2,3].__len__() on lists — the former hits C directly, the latter goes through Python's attribute lookup machinery.


📊 Performance Analysis

Operation Time (ns) Notes
len(list) ~50 C-level direct call
list.__len__() ~80 Python attribute lookup overhead
len(custom_class) ~90 Python dispatch via type
repr(obj) ~100–500 Depends on __repr__ complexity

Measured on Python 3.12, results vary by hardware. See benchmarks.py for full methodology.


🏗️ Architecture Notes

Why Python Chose Protocols Over Abstract Base Classes

Python could have required from collections.abc import Sequence; class MySeq(Sequence) — and ABCs do exist. But the data model was designed for structural subtyping first. This means:

  1. Third-party code remains compatible — old code that predates ABCs works because Python checks for methods, not inheritance
  2. Less ceremony — you implement only what you need; no abstract method stubs required
  3. More Pythonic — "if it walks like a duck and quacks like a duck, it's a duck"

The ABCs in collections.abc complement this: they let you register existing classes as virtual subclasses and provide mixin implementations.


🎯 Real-World Applications

  1. Django ORM: QuerySet.__len__ and QuerySet.__iter__ make query results work seamlessly with len() and for loops
  2. NumPy arrays: Full data model implementation gives NumPy arrays operator overloading (+, *), slicing, and boolean evaluation
  3. Pandas DataFrames: __getitem__ enables df['column'] syntax; __len__ gives len(df) as row count
  4. SQLAlchemy: Result.__iter__ allows for row in query_result:

📁 Files in This Chapter

File Description
examples.py Annotated implementations: FrenchDeck++, Vector2D, custom protocol demos
exercises.py Original exercises extending data model concepts
mini_project.py Card game engine using the full data model
benchmarks.py Timing dunder dispatch, repr overhead, truthiness chain
notes.md Structured reference notes
pitfalls.md Common dunder method mistakes + production fixes
interview_questions.md Senior-level Q&A for this topic
architecture_notes.md Why Python made these design decisions

🔗 Related Chapters & Repos

  • Chapter 9 (this repo): Decorators — another protocol-based system built on __call__
  • Chapter 11: Building custom sequences — extends __len__ and __getitem__ further
  • Chapter 16: Operator overloading — __add__, __mul__, reflected operators
  • python-internals-playground: Deep dives into each dunder method family