A command line tool and Python library that automates the extraction of key information from invoices to support your accounting process. The library is very flexible and can be used on other types of business documents as well.
In essence, invoice2data simplifies getting data from invoices by:
- Automating text extraction — no more manual copying and pasting.
- Using templates for structure — handles different invoice layouts.
- Providing structured output — data ready for analysis or further processing.
This makes it a valuable tool for businesses and developers dealing with a large volume of invoices, saving time and reducing manual-entry errors. It:
- extracts text from PDF files with a pluggable, cascading backend —
pdfium(default, no system deps),pdftotext,text,pdfminer,pdfplumber, or OCR (tesseract,ocrmypdf,docTR,paddleocr,gvision). - searches for regex in the result using a YAML or JSON-based template system (with an optional AI fallback).
- saves results as CSV, JSON or XML, or renames PDF files to match the content.
With the flexible template system you can:
- precisely match content PDF files
- plugins available to match line items and tables
- define static fields that are the same for every invoice
- define custom fields needed in your organisation or process
- have multiple regex per field (if layout or wording changes)
- define currency
- extract invoice-items using the
lines-plugin developed by Holger Brunn
Go from PDF files to this:
{'issuer': 'QualityHosting', 'amount': 34.73, 'date': datetime.datetime(2014, 5, 7, 0, 0), 'invoice_number': '30064443', 'currency': 'EUR', 'desc': 'Invoice 30064443 from QualityHosting', 'template_name': 'com.qualityhosting.yml'}
{'issuer': 'Amazon EU', 'amount': 35.24, 'date': datetime.datetime(2014, 6, 4, 0, 0), 'invoice_number': 'EUVINS1-OF5-DE-120725895', 'currency': 'EUR', 'desc': 'Invoice EUVINS1-OF5-DE-120725895 from Amazon EU'}
{'issuer': 'Amazon Web Services', 'amount': 4.11, 'date': datetime.datetime(2014, 8, 3, 0, 0), 'invoice_number': '42183017', 'currency': 'USD', 'desc': 'Invoice 42183017 from Amazon Web Services'}
{'issuer': 'Envato', 'amount': 101.0, 'date': datetime.datetime(2015, 1, 28, 0, 0), 'invoice_number': '12429647', 'currency': 'USD', 'desc': 'Invoice 12429647 from Envato'}
pip install invoice2data
invoice2data invoice.pdf # extract -> CSV
invoice2data --output-format json invoice.pdf # or JSON / XMLAs a Python library:
from invoice2data import extract_data
result = extract_data("invoice.pdf")No system libraries are required by default — the pdfium backend bundles its own
engine. Optional backends and extras (poppler, OCR, AI, ...) are covered in the
installation guide.
Full documentation: https://invoice2data.readthedocs.io/
- How it works — the extraction pipeline
- Installation — backends, OCR and optional extras
- Usage — all CLI options and common tasks
- Template creation — write templates for your invoices
- Recommended fields — the canonical output schema
- AI features — optional LLM fallback & template generation
- FAQ — including a comparison with other tools
If you are interested in improving this project, have a look at our contributor guide to get you started quickly.
- integrate with online OCR?
- try to 'guess' parameters for new invoice formats.
- apply machine learning to guess new parameters / template creation
- Data cleanup per field
- advanced table parsing with pypdf_table_extraction
- Harshit Joshi: As Google Summer of Code student.
- Holger Brunn: Add support for parsing invoice items.
Contributions are very welcome. To learn more, see the Contributor Guide.
- Odoo, OCA module account_invoice_import_invoice2data
- OCR-Invoice (FOSS | C#)
- DeepLogic AI (Commercial | SaaS)
- Docparser (Commercial | Web Service)
- A-PDF (Commercial)
- PDFdeconstruct (Commercial)
- CVision (Commercial)