Mon Corpus Collection

A repository of Mon language (mnw) text data for NLP research, linguistic analysis, and model training.

Corpus Statistics

The collection contains data from news, encyclopedia entries, and social media.

Overview (2026-04-27)

Metric	Value
Total Files	8,783
Mon-related Characters	29,407,428
Raw Text Length	36,586,341
Language	Mon (mnw)
Script	Mon/Burmese Unicode

Public Dataset (Open Source)

Source	Files	Mon Chars	Content Type
Wikipedia	4,208	18,966,185	Encyclopedia articles
Mon News Agency	3,682	10,229,891	News and interviews
Telegram	889	53,131	Public social media
Total Public	8,779	29,249,207	Open Data Collection

Additional Sources

Source	Files	Mon Chars	Content Type
Custom Collections	4	158,221	Curated/Miscellaneous

Technical Specifications

Encoding and Normalization

Encoding: UTF-8.
Unicode Blocks: Myanmar block (U+1000–U+109F) and extended blocks.
Normalization: NFC normalization is required for all data.
Linguistic Variants: Distinguishes between standard Myanmar characters and Mon-specific variants (e.g., ၚ U+1021 vs င U+1004).

Data Quality

Metadata and UI boilerplate are removed during extraction.
Files under 50 characters are excluded from the core collection.

Project Structure

.
├── monnews/               # Mon News Agency (IMNA) data
├── wikipedia/             # Mon Wikipedia data
├── telegram_mot_tip/      # Telegram channel messages
├── custom/                # Curated and legacy data
├── results/               # Analysis outputs (CSV/JSON)
├── AGENTS.md              # Engineering standards and role context
└── corpus_counter.py      # Corpus analysis utility

Usage

Analyzing the Corpus

Use the analysis script to generate character and n-gram statistics.

# Basic analysis
python3 corpus_counter_normalized.py . --output-dir results

# Analysis with Mon-specific Nga normalization (င -> ၚ)
python3 corpus_counter_normalized.py . --output-dir results --normalize-mon-nga

Core Scripts

corpus_counter_normalized.py: Calculates character, bigram, and trigram frequencies.
mon_cluster_counter.py: Analyzes grapheme clusters.

License and Attribution

This corpus is released under the MIT License.

If you use this data, attribute the Mon Corpus Collection and the original sources (IMNA, Wikipedia).

Contributing

Ensure all text is NFC normalized.
Follow the character standards defined in AGENTS.md.
Provide source attribution for new data.

Contributors: Janakh Pon, Htaw Mon

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
custom		custom
monnews		monnews
results		results
telegram_mot_tip_ebook		telegram_mot_tip_ebook
wikipedia		wikipedia
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
corpus_counter_normalized.py		corpus_counter_normalized.py
mon_cluster_counter.py		mon_cluster_counter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mon Corpus Collection

Corpus Statistics

Overview (2026-04-27)

Public Dataset (Open Source)

Additional Sources

Technical Specifications

Encoding and Normalization

Data Quality

Project Structure

Usage

Analyzing the Corpus

Core Scripts

License and Attribution

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mon Corpus Collection

Corpus Statistics

Overview (2026-04-27)

Public Dataset (Open Source)

Additional Sources

Technical Specifications

Encoding and Normalization

Data Quality

Project Structure

Usage

Analyzing the Corpus

Core Scripts

License and Attribution

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages