Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
9afa246
feat: add memory ingest pipeline and remember command
May 13, 2026
5ce8636
feat: add recall, clean, conflict resolution and QA eval for memory
May 13, 2026
c9d61b2
fix: harden pdf insert pipeline
tulongshaonian771 May 13, 2026
bc9f5cf
feat: process insert sources sequentially
tulongshaonian771 May 13, 2026
455b104
fix: deduplicate wiki vector sync changes
tulongshaonian771 May 13, 2026
acc27d5
feat: add insert merge progress bar
tulongshaonian771 May 14, 2026
0bc9376
fix: start insert progress at one percent
tulongshaonian771 May 14, 2026
e1ca69f
feat: add insert planning toggle
tulongshaonian771 May 14, 2026
1bacb30
feat: support image insert captions
tulongshaonian771 May 14, 2026
7908935
feat: support audio insert transcripts
tulongshaonian771 May 14, 2026
0e93963
fix: tighten query source attribution
tulongshaonian771 May 14, 2026
55ed99a
fix: validate query used sources
tulongshaonian771 May 14, 2026
1eae861
feat: add ask command with outer agent loop over memory and KB
May 14, 2026
7b6fe52
feat: deduplicate kb insights via semantic search before write-back
May 15, 2026
2afb9b1
Revert "feat: deduplicate kb insights via semantic search before writ…
May 15, 2026
e5068b3
feat: merge feature/kb-insight-dedup into main
May 15, 2026
13bb2d9
feat: support code file insert
tulongshaonian771 May 15, 2026
8904673
fix: preserve code raw source links
tulongshaonian771 May 15, 2026
fe81e0d
fix: preserve code symbol signatures
tulongshaonian771 May 15, 2026
61bda57
fix: retry invalid query json answers
tulongshaonian771 May 15, 2026
b848a32
feat: merge feature/invalidate-kb-memory into main
May 15, 2026
dc26256
feat: merge feature/mem-show-insights into main
May 15, 2026
06860bf
feat: support html insert parsing
tulongshaonian771 May 15, 2026
e3ab278
fix: clean html wiki page extraction
tulongshaonian771 May 15, 2026
2e4d463
fix: improve html article extraction
tulongshaonian771 May 15, 2026
0cff8da
fix: improve html documentation extraction
tulongshaonian771 May 15, 2026
56e3b11
feat: merge feature/insight-distill-then-answer into main
May 15, 2026
cf1682e
feat: support mineru office documents
tulongshaonian771 May 15, 2026
cfc15c7
feat: unify little heta CLI styling on the Heta blue theme
tulongshaonian771 May 18, 2026
89cdc87
docs: reword CLI command help in plain language
tulongshaonian771 May 18, 2026
da20fc5
chore: prepare pypi release
tulongshaonian771 May 18, 2026
aef9d3b
docs: update workspace paths and mineru links
tulongshaonian771 May 18, 2026
b7fe002
docs: invite stars and issues
tulongshaonian771 May 18, 2026
ff3e16d
docs: clarify wiki foundation wording
tulongshaonian771 May 18, 2026
5f9320c
docs: update memory speed benchmark
tulongshaonian771 May 18, 2026
7d41934
docs: use static python version badge
tulongshaonian771 May 18, 2026
fab2440
docs: clarify KB/memory separation and four memory types in README
May 18, 2026
0acf38c
docs: use static pypi badge
tulongshaonian771 May 18, 2026
cde57c8
chore: remove uv lockfile
tulongshaonian771 May 18, 2026
6a40580
feat: track mineru image sources
tulongshaonian771 May 22, 2026
6b21956
Add hybrid wiki retrieval and provider clients
tulongshaonian771 Jun 1, 2026
8a398f3
chore: use english insert result labels
tulongshaonian771 Jun 2, 2026
2480bd2
feat: add static insert mode
tulongshaonian771 Jun 7, 2026
9b5ead8
chore: bump version to 0.2.1
tulongshaonian771 Jun 7, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions .github/workflows/pypi-publish.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
name: publish

on:
release:
types: [published]

permissions:
contents: read
id-token: write

jobs:
pypi:
name: build and publish to PyPI
runs-on: ubuntu-latest
environment: pypi

steps:
- name: Checkout
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"

- name: Install build tooling
run: python -m pip install --upgrade build twine

- name: Build distribution
run: python -m build

- name: Check distribution
run: python -m twine check dist/*

- name: Publish to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
34 changes: 34 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@ __pycache__/
.pytest_cache/
.mypy_cache/
.ruff_cache/
.coverage
coverage.xml
htmlcov/
.venv/
dist/
build/
Expand All @@ -11,3 +14,34 @@ workspace/
output/
.codex/
.agents/

# Local secrets/configuration
.env
.env.*
!.env.example

# Runtime data and generated indexes
*.db
*.db-*
*.sqlite
*.sqlite3
*.sqlite3-*

# Logs and temporary run output
*.log
*.out
*.tmp
*.bak

# Large local artifacts
*.tar
*.tar.gz
*.tgz
*.zip
*.7z

# Local staging and evaluation artifacts
staging/

# Local generated previews
docs/*preview*.html
249 changes: 195 additions & 54 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,106 +1,240 @@
# Little Heta

Little Heta is a lightweight command line tool for personal knowledge, memory,
and document intelligence workflows. It converts local documents into a
Markdown wiki, keeps wiki page identity stable, and can maintain a SQLite
vector index for faster semantic retrieval.
<p align="center">
<img src="docs/assets/little-heta-banner.png" alt="Little Heta banner">
</p>

<p align="center">
<a href="README.md">English</a> ·
<a href="docs/i18n/README.zh-CN.md">简体中文</a> ·
<a href="docs/i18n/README.zh-TW.md">繁體中文</a> ·
<a href="docs/i18n/README.ja.md">日本語</a> ·
<a href="docs/i18n/README.ko.md">한국어</a> ·
<a href="docs/i18n/README.es.md">Español</a> ·
<a href="docs/i18n/README.pt.md">Português</a> ·
<a href="docs/i18n/README.fr.md">Français</a> ·
<a href="docs/i18n/README.de.md">Deutsch</a>
</p>

<p align="center">
<a href="https://pypi.org/project/little-heta/"><img src="https://img.shields.io/badge/pypi-v0.1.0-3775A9?style=for-the-badge&logo=pypi&logoColor=white" alt="PyPI v0.1.0"></a>
<img src="https://img.shields.io/badge/python-3.10%2B-2B6CB0?style=for-the-badge&logo=python&logoColor=white" alt="Python 3.10+">
<a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-2EA44F?style=for-the-badge" alt="License: MIT"></a>
<a href="https://knowledgexlab.github.io/"><img src="https://img.shields.io/badge/KnowledgeXLab-Little%20Heta-111827?style=for-the-badge&logo=github&logoColor=white" alt="KnowledgeXLab"></a>
</p>

Little Heta is a local CLI knowledge infrastructure for personal documents,
agent memory, and document intelligence. It turns PDFs, Office files, images,
audio, code, HTML, Markdown, and notes into a stable Markdown wiki, adds
semantic vector retrieval, and lets agents reuse distilled knowledge through a
memory layer.

## Status

This repository is an early `v0.1.0` implementation. The current focus is a
fast local workflow for initialization, document insertion, wiki maintenance,
and optional vector indexing.

## Features
## Install

- Interactive first-time setup with `heta init`
- Provider configuration for Qwen, ChatGPT, or Gemini
- Optional MinerU integration for PDF parsing
- Markdown wiki generation under the Little Heta workspace
- Stable numeric wiki page ids in page filenames
- Optional SQLite + sqlite-vec wiki chunk index
- CLI status view with provider, MinerU, KB, wiki, and space usage summaries
Install from PyPI:

## Install
```bash
pip install little-heta
```

From a local checkout:

```bash
pip install -e .
```

For development dependencies:
For development:

```bash
pip install -e ".[dev]"
```

## Quick Start

Initialize Little Heta:
The package installs the `heta` command:

```bash
heta init
heta --help
```

The wizard writes configuration to:

```text
~/.heta/heta.yaml
```
## Initialize

Check the current workspace and provider status:
Run the first-time setup:

```bash
heta status
heta init
```

Insert one file or a directory:
You need to prepare:

```bash
heta insert ./docs
- An LLM API key for one provider: Qwen, ChatGPT, or Gemini.
- Optional MinerU access for PDF and Office parsing. Apply or learn more at
[MinerU](https://mineru.net/apiManage/docs).

`heta init` writes config and workspace data under:

```text
~/.heta/
```

Large PDFs are profiled and split before parsing by default. Little Heta gives a
lightweight PDF profile to a planning agent, validates the returned page ranges,
and falls back to fixed page windows when planning is unavailable. Disable this
behavior when you want to parse a PDF as one source file:
It also installs the Little Heta agent skill automatically into:

```bash
heta insert --no-pdf-planning ./large.pdf
```text
~/.codex/skills/heta
~/.claude/skills/heta
```

Ask a read-only question against the wiki:
## Use with Codex and Claude Code

After `heta init`, Codex and Claude Code can discover the Little Heta skill
globally. The skill tells the agent when to use:

```bash
heta query "What is HetaGen?"
heta ask "..."
heta query "..."
heta recall "..."
heta remember "..."
```

Clean wiki pages and the vector database while keeping raw files:
You can refresh or reinstall the skill at any time:

```bash
heta clean
heta skill
```

Manage vector indexing:
For other agent frameworks, copy these two files:

```bash
heta vector status
heta vector on
heta vector off
```text
~/.heta/skills/heta/SKILL.md
~/.heta/skills/heta/COMMANDS.md
```

## What You Get

Most personal knowledge bases eventually become a `/raw` folder: papers,
slides, screenshots, audio clips, code files, notes, and half-finished drafts
all pile up together. A normal agent can read those files directly, but every
question pays the same cost again: open the index, guess which page matters,
read long pages, and spend tokens rediscovering context it already found before.

Little Heta separates the external knowledge base from the agent's internal
memory. The KB remains the source of truth: a structured, versioned wiki built
from the user's files. Memory, by contrast, is the agent's persistent working
layer, storing reusable information that helps the agent reason, route, and
avoid repeated deep retrieval. This creates a memory-first, KB-grounded
retrieval loop.

Little Heta turns that pile into a persistent agent workspace:

- **Wiki foundation**: raw files are compiled into stable Markdown pages with
numeric page ids, clean `[[Wiki Links]]`, and Git history.
- **Vector Wiki**: each page is chunked by Markdown structure, so `heta query`
can jump to the right section instead of relying only on sparse `index.md`
summaries.
- **Memory-first retrieval**: `heta ask` stores distilled KB insights after
expensive lookups, allowing later questions to reuse prior KB understanding
instead of repeating the same deep wiki traversal.
- **Synchronized memory + KB management**: memory stays tied to the evolving
wiki. When KB content changes, related memories can be invalidated to prevent
stale cached insights from drifting away from the source of truth.
- **Agent reuse**: larger teams and multi-agent workflows benefit because useful
KB discoveries can be reused across later questions, sessions, and agents.

Heta's memory architecture stores four complementary types of information:

- **Raw dialogue memory**: original user-agent interaction history, preserving
full context and wording.
- **Atomic fact memory**: compact factual statements extracted from
interactions, useful for precise attribute or preference recall.
- **Episodic memory**: event-level summaries that capture tasks, decisions,
temporal context, and multi-step work sessions.
- **KB insight memory**: distilled insights produced after KB retrieval,
storing what the agent learned from external documents so future questions
can reuse that understanding without repeating the same expensive traversal.

Retrieval quality depends heavily on corpus structure. In corpora where
important details are buried deep inside long wiki pages and poorly represented
by summaries, index-only wiki navigation can suffer severe retrieval collapse.
In our initial stress scenarios, Vector Wiki and memory-backed retrieval
improved answer accuracy by roughly **1.25x-5x+**, with some cases recovering
from **0% to 100%** accuracy.

Memory-backed reuse used **82.1% fewer tokens** than index-only wiki query and
answered **2.58x faster** in a multi-page comparison setting. This gap is expected to
grow in larger or messier workspaces, because index-only wiki navigation scales
with the number and length of pages an agent may need to inspect, while
memory-backed reuse resolves repeated questions from previously distilled
insights. The main extra cost is the first pass that creates the reusable
insight.

## Core CLI

The main commands are:

- `heta init`: set up providers, workspace, and agent skills.
- `heta status`: show provider, MinerU, wiki, memory, and space status.
- `heta insert`: add files or folders to the knowledge base.
- `heta query`: ask a read-only question against inserted documents.
- `heta ask`: answer using memory and the document KB together.
- `heta remember`: save a fact, decision, or preference.
- `heta recall`: retrieve saved memory.
- `heta clean`: remove generated wiki pages and vector DB while keeping raw files.
- `heta vector`: turn document vector indexing on, off, or show status.
- `heta insert-planning`: turn smart insert planning on, off, or show status.
- `heta mem-show`: inspect stored KB memories.
- `heta mem-clean`: erase memory data.
- `heta skill`: install or refresh agent skills.

Detailed command docs:

- [init](docs/cli/init.md)
- [status](docs/cli/status.md)
- [insert](docs/cli/insert.md)
- [query](docs/cli/query.md)
- [ask](docs/cli/ask.md)
- [remember](docs/cli/remember.md)
- [recall](docs/cli/recall.md)
- [clean](docs/cli/clean.md)
- [vector](docs/cli/vector.md)
- [insert-planning](docs/cli/insert-planning.md)
- [mem-show](docs/cli/mem-show.md)
- [mem-clean](docs/cli/mem-clean.md)
- [skill](docs/cli/skill.md)

## Supported Files

Little Heta can insert:

- Markdown and text: `.md`, `.markdown`, `.txt`
- PDF and Office: `.pdf`, `.doc`, `.docx`, `.ppt`, `.pptx`, `.xls`, `.xlsx`
- Images: `.png`, `.jpg`, `.jpeg`, `.webp`, `.gif`, `.bmp`
- Audio and video transcripts: `.mp3`, `.wav`, `.m4a`, `.flac`, `.ogg`, `.mp4`
- Code and config files: `.py`, `.js`, `.ts`, `.tsx`, `.jsx`, `.java`, `.go`,
`.rs`, `.cpp`, `.c`, `.h`, `.hpp`, `.sh`, `.sql`, `.yaml`, `.yml`, `.json`,
`.toml`
- HTML: `.html`, `.htm`

PDF and Office parsing require MinerU. Images and audio/video require a
multimodal or transcription-capable LLM provider.

## Workspace

Little Heta stores local runtime data under:
Runtime data lives under:

```text
~/.heta/
```

The workspace contains raw source files, generated wiki pages, worktrees, and
the local database used by the vector index. Runtime workspace data is not
intended to be committed to this repository.
Important paths:

```text
~/.heta/heta.yaml config
~/.heta/workspace/kb/raw archived source files
~/.heta/workspace/kb/wiki/index.md wiki entry index
~/.heta/workspace/kb/wiki/pages/ generated Markdown wiki pages
~/.heta/workspace/kb/wiki/log.md wiki operation log
~/.heta/workspace/kb/db/wiki_vectors.sqlite3 local wiki vector database
~/.heta/workspace/mem/mem.sqlite3 local memory database
~/.heta/skills/heta/ portable Little Heta agent skill
```

## Development

Expand All @@ -113,11 +247,18 @@ pytest
Project layout:

```text
src/heta/ CLI, config, providers, and KB implementation
src/heta/ CLI, config, assistants, memory, and KB implementation
docs/ user and technical documentation
tests/ unit tests
pyproject.toml package metadata and dependencies
```

## Community

If Little Heta is useful to you, please consider giving the project a star. If
you run into bugs, rough edges, or missing workflows, open an issue and tell us
what happened.

## License

Little Heta is released under the MIT License. See [LICENSE](LICENSE).
Binary file added docs/assets/little-heta-banner.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading