A repository of Mon language (mnw) text data for NLP research, linguistic analysis, and model training.
The collection contains data from news, encyclopedia entries, and social media.
| Metric | Value |
|---|---|
| Total Files | 8,783 |
| Mon-related Characters | 29,407,428 |
| Raw Text Length | 36,586,341 |
| Language | Mon (mnw) |
| Script | Mon/Burmese Unicode |
| Source | Files | Mon Chars | Content Type |
|---|---|---|---|
| Wikipedia | 4,208 | 18,966,185 | Encyclopedia articles |
| Mon News Agency | 3,682 | 10,229,891 | News and interviews |
| Telegram | 889 | 53,131 | Public social media |
| Total Public | 8,779 | 29,249,207 | Open Data Collection |
| Source | Files | Mon Chars | Content Type |
|---|---|---|---|
| Custom Collections | 4 | 158,221 | Curated/Miscellaneous |
- Encoding: UTF-8.
- Unicode Blocks: Myanmar block (U+1000–U+109F) and extended blocks.
- Normalization: NFC normalization is required for all data.
- Linguistic Variants: Distinguishes between standard Myanmar characters and Mon-specific variants (e.g., ၚ U+1021 vs င U+1004).
- Metadata and UI boilerplate are removed during extraction.
- Files under 50 characters are excluded from the core collection.
.
├── monnews/ # Mon News Agency (IMNA) data
├── wikipedia/ # Mon Wikipedia data
├── telegram_mot_tip/ # Telegram channel messages
├── custom/ # Curated and legacy data
├── results/ # Analysis outputs (CSV/JSON)
├── AGENTS.md # Engineering standards and role context
└── corpus_counter.py # Corpus analysis utility
Use the analysis script to generate character and n-gram statistics.
# Basic analysis
python3 corpus_counter_normalized.py . --output-dir results
# Analysis with Mon-specific Nga normalization (င -> ၚ)
python3 corpus_counter_normalized.py . --output-dir results --normalize-mon-ngacorpus_counter_normalized.py: Calculates character, bigram, and trigram frequencies.mon_cluster_counter.py: Analyzes grapheme clusters.
This corpus is released under the MIT License.
If you use this data, attribute the Mon Corpus Collection and the original sources (IMNA, Wikipedia).
- Ensure all text is NFC normalized.
- Follow the character standards defined in AGENTS.md.
- Provide source attribution for new data.
Contributors: Janakh Pon, Htaw Mon