Skip to content

krahd/modelito

Repository files navigation

modelito

Modelito is a compact, dependency-light Python library that provides provider- agnostic abstractions and connectors for large language models (LLMs). It offers lightweight shims for OpenAI, Claude, Gemini, oMLX, local Ollama deployments, and local OpenAI-compatible servers (llama.cpp, vLLM, LM Studio), plus utilities for token counting, timeout estimation, and small helpers to manage Ollama servers when needed. The library is designed for easy integration into applications and CI pipelines.

Quick start

Install

To install the latest released version from PyPI:

pip install modelito

pip install modelito does not install FastAPI/Uvicorn. Those are optional and only needed for modelito-serve.

Release publishing uses PyPI trusted publishing through .github/workflows/publish.yml. The PyPI project must have a matching trusted publisher configured for repository krahd/modelito, workflow publish.yml, and environment pypi. The workflow also checks that the tag version matches pyproject.toml before building or publishing.

For development / contributor setup (editable install and dev dependencies):

pip install -e .[dev]

# Optional: add runtime extras for full functionality
pip install -e .[ollama,tokenization,openai,anthropic,gemini,grok]

Run tests (for contributors):

pytest -q

Install from TestPyPI (preview builds)

If you need to test a preview build published to TestPyPI, use the TestPyPI index. TestPyPI packages are for testing only and may not be stable.

python -m pip install --index-url https://test.pypi.org/simple/ \
    --extra-index-url https://pypi.org/simple modelito==<version>

If installation from the index fails, download the wheel from the TestPyPI "Files" page and install it directly.

Build and install

To build a source distribution and wheel locally:

python -m pip install --upgrade build
python -m build

Install from the built wheel:

pip install dist/*.whl

See the docs/ folder for more details:

Architecture snapshot

The diagrams below are intentionally compact and current-state oriented. Detailed architecture policy lives in docs/ARCHITECTURE.md.

modelito architecture

modelito provider request flow

Providers

This package provides compatibility shims and small, dependency-light implementations for common provider interfaces. When optional extras are installed the package will attempt to use real SDK clients; otherwise the shims provide safe offline-friendly fallbacks suitable for testing.

Recommended entry point:

from modelito import Client
from modelito.messages import Message

client = Client(provider="auto", prefer=["omlx", "ollama"], model="omlx")
response = client.chat([Message(role="user", content="Hello")])
print(response.text)

Provided shims and utilities:

  • OpenAIProvider — hosted OpenAI / SDK-backed provider that can also target OpenAI-compatible APIs via base_url.
  • OpenAICompatibleHTTPProvider — shared HTTP base class for local or OpenAI-compatible runtimes.
  • RawChatProvider — protocol for preserving raw OpenAI chat completion payloads without collapsing them into text.
  • ClaudeProvider — will use the official Anthropic SDK when installed, falling back to deterministic behavior otherwise.
  • GeminiProvider, GrokProvider — lightweight shims.
  • OMLXProvider — thin oMLX preset built on OpenAICompatibleHTTPProvider.
  • OllamaProvider — HTTP-aware provider that can call a local Ollama HTTP API through stdlib helpers and can fall back to the local Ollama CLI or deterministic test behavior when needed. The modelito[ollama] extra installs optional support dependencies used by the broader Ollama service-management helpers.

The client layer recognises the same provider stack through ChatProvider, MessageInput, and structured response helpers such as Client.chat() and Client.chat_json().

OpenAICompatibleHTTPProvider, OMLXProvider, and OpenAIProvider also expose raw_complete() and raw_stream() for OpenAI-compatible passthrough. Client.chat_parsed() remains the structured JSON convenience path for Python applications.

For quick diagnostics, use the provider readiness API or CLI:

from modelito import check_provider_ready

status = check_provider_ready("omlx", model="omlx")
print(status.ready, status.reason)

The shared message normaliser is exported as flatten_message_inputs from both modelito and modelito.messages for callers that need OpenAI-style dict conversion.

python -m modelito doctor --provider omlx --model omlx

Server mode for non-Python clients:

pip install "modelito[serve]"
modelito-serve --provider omlx --port 11436 --host 127.0.0.1 --strict

--profile and --profile-path are currently treated as profile file paths; --profile-path takes precedence when both are provided.

modelito-serve exposes OpenAI-compatible /v1/models, /v1/chat/completions, and /v1/embeddings endpoints.

Pi integration uses HTTP only. Pi is a TypeScript/Node harness and does not import Modelito directly; point Pi at the modelito-serve base URL via its OpenAI-compatible custom provider configuration.

Example ~/.pi/agent/models.json provider entry:

{
  "providers": {
    "modelito": {
      "baseUrl": "http://localhost:11436/v1",
      "api": "openai-completions",
      "apiKey": "modelito",
      "authHeader": false,
      "compat": {
        "supportsDeveloperRole": false,
        "supportsReasoningEffort": false
      },
      "models": [
        {
          "id": "omlx",
          "name": "oMLX via Modelito",
          "reasoning": false,
          "input": ["text"],
          "contextWindow": 8192,
          "maxTokens": 4096,
          "cost": {
            "input": 0,
            "output": 0,
            "cacheRead": 0,
            "cacheWrite": 0
          }
        }
      ]
    }
  }
}

Tool-calling workflows require raw passthrough support. Modelito currently implements that on OpenAICompatibleHTTPProvider and the hosted OpenAIProvider; OMLXProvider inherits it automatically. OllamaProvider raw passthrough is deferred. For tool-calling integrations today, prefer oMLX/OpenAI-compatible raw providers or hosted OpenAI.

The package also exposes a small Ollama administration layer for local model operations, including install backend detection, remote catalog metadata, download lifecycle tracking, and explicit model readiness confirmation through helpers such as detect_install_method, list_remote_model_catalog, download_model_progress, and ensure_model_ready.

License / AS IS

This software is provided "AS IS" and without warranties of any kind. See the included LICENSE file for the full MIT license text.

CI / Integration Tests

This repository includes a consolidated GitHub Actions workflow at .github/workflows/ci.yml. It runs linting/type checks and unit tests for pull requests and pushes to main, and builds docs on non-PR runs.

Release publishing uses PyPI trusted publishing through .github/workflows/publish.yml. The PyPI project must have a matching trusted publisher configured for repository krahd/modelito, workflow publish.yml, and environment pypi. The workflow also checks that the tag version matches pyproject.toml before building or publishing.

Ollama integration tests are intentionally gated and will only run when you explicitly enable them. To run integration tests locally or in CI set the environment variable RUN_OLLAMA_INTEGRATION=1. Additional optional flags:

  • ALLOW_OLLAMA_INSTALL=1 — permit the integration tests to attempt installing Ollama when missing.
  • ALLOW_OLLAMA_DOWNLOAD=1 — permit downloading remote models during integration tests.
  • ALLOW_OLLAMA_UPDATE=1 — permit running update flows during integration tests.

Example (local):

RUN_OLLAMA_INTEGRATION=1 pytest tests/test_ollama_integration.py -q

Provider integration tests for external services (OpenAI, Anthropic, etc.) are intentionally not part of default hosted CI to keep pull requests fast and low noise. Use local/manual execution for those checks when needed.

There is a dedicated self-hosted Ollama workflow at .github/workflows/integration-ollama.yml for maintainers who want broader integration checks on controlled infrastructure.

Provider interface

modelito exposes a minimal structural Provider Protocol for the legacy synchronous surface, but the recommended application entry point is Client and its richer chat API. The Protocol is intentionally small to remain compatible with existing duck-typed providers — it requires only:

  • list_models() -> list[str]
  • summarize(messages, settings=None) -> str

All built-in providers shipped with the package (OpenAIProvider, ClaudeProvider, GeminiProvider, OMLXProvider, OllamaProvider, GrokProvider) satisfy the Provider protocol structurally. The Provider Protocol is decorated with @runtime_checkable, so you can use isinstance() checks at runtime when you need to enforce the contract in application code.

Example usage:

from modelito import Provider, OllamaProvider

p: Provider = OllamaProvider()
if isinstance(p, Provider):
    from modelito.messages import Message
    resp = p.summarize([Message(role="user", content="hello")])
    print(resp)

The package provides typed Message/Response dataclasses and exposes a small set of optional Protocols for provider surfaces:

  • SyncProvider (alias: Provider) — existing synchronous summarize()/list_models() surface.
  • AsyncProvider — async acomplete() surface for providers that support awaitable calls.
  • StreamingProvider — streaming stream() generator surface.
  • EmbeddingProviderembed() surface for vector embeddings.

Embeddings can also be selected at runtime through the dedicated Embedder wrapper when you only need the embedding surface instead of the full text generation client:

from modelito import Embedder

embedder = Embedder(provider="openai")
vectors = embedder.embed(["hello", "world"])
print(len(vectors), len(vectors[0]))
print(Embedder.available_embedders())

modelito exposes Message and Response dataclasses; client and provider helpers accept Message, plain strings, and OpenAI-style dict inputs.

Using bare Provider (recommended for most cases)

from modelito import Provider, Message, OllamaProvider

# Create a provider
provider: Provider = OllamaProvider()

# Single request
resp = provider.summarize([Message(role="user", content="hello")])
print(resp)

# Streaming
for chunk in provider.stream([Message(role="user", content="tell me a story")]):
    print(chunk, end="", flush=True)

Use a bare Provider when:

  • You manage conversation state yourself
  • You're doing single-shot or stateless inference
  • You need minimal abstraction
  • You're building a custom application architecture

Using OllamaConnector (for conversation management)

from modelito import Message, OllamaConnector, OllamaProvider

# Create a connector
conn = OllamaConnector(provider=OllamaProvider())

# Multi-turn conversation (state is tracked automatically)
res = conn.complete(
    conv_id="chat_session_1", 
    new_messages=[Message(role="user", content="what's 2+2?")]
)
print(res.text)

# Second turn (history is remembered)
res = conn.complete(
    conv_id="chat_session_1", 
    new_messages=[Message(role="user", content="and 3+3?")]
)
print(res.text)

Use OllamaConnector when:

  • You need automatic conversation history tracking
  • You're building a multi-turn chatbot
  • You want to manage per-conversation state without writing it yourself

For more details, see ARCHITECTURE.md

Streaming semantics

Modelito normalizes provider streaming into a simple incremental text stream. Providers may emit data at different granularities; the connector/streaming helpers attempt to normalize these into a sequence of text chunks that are safe to concatenate to form the final output. Common shapes you will encounter:

  • Token-level: Backends (e.g., OpenAI SDK) may stream individual token deltas. These are emitted as short text fragments; consumers should append fragments in order to reconstruct the full output.
  • Chunk-level: Some providers deliver logical chunks or events (for example, chunked JSON payloads). Modelito extracts the textual portion and yields it as incremental chunks.
  • Line-delimited / SSE: HTTP services (like Ollama's /api/generate) may send newline-delimited JSON or SSE frames. Modelito reads and normalizes the frames and yields textual content as it becomes available.

Behavioral notes:

  • The stream() generator yields str pieces; each yielded item is intended to be appended to reconstruct the response incrementally.
  • When you need token-level control (e.g., streaming token-by-token), prefer providers that expose token deltas (OpenAI SDK). Modelito will still yield those token deltas as text fragments.
  • Offline/deterministic fallbacks yield the full text in a single chunk.

Notes on additional modules

Modelito includes internal/helper modules such as local model management, API key helpers, mock providers, cache helpers, and batching utilities. These modules are not currently presented as stable top-level package exports in this README. Prefer the documented Client, provider adapters, connector, and server entrypoints for application integrations.

See the tests/ directory for comprehensive coverage and usage examples.

About

Lightweight Python abstractions and connectors for LLMs (OpenAI, Claude, Gemini, Ollama)

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors