Pipelines for creating QA triplets from GSMA data for the Open Telecom LLMs project project.
This repository contains end-to-end data processing pipelines that transform GSMA technical specifications and reports into high-quality synthetic question-answer datasets for training telecom-focused language models. The pipeline processes hundreds of documents through stages including document conversion, semantic chunking, synthetic Q&A generation using large language models, similarity analysis, quality filtering, and LLM-based validation. The resulting datasets are published to HuggingFace Hub in both contrastive learning format (for embedding models) and Q&A format (for retrieval-augmented generation). The repository includes three main pipelines: PRD (technical specifications), Discover (reports and whitepapers), and Annotation (human validation workflows with domain expert workspaces).
This projects was developed fairly rapidly, and some design decisions we made at the outset we now consider to be sub-optimal. Owing to time constraints, and an unwillingness to recreate time consuming and expensive steps (such as question creation and validation), there is some technical debt that would ideally be resolved in a longer project. Overall, in retrospect, dvc was not a good fit for this workflow as there are very time consuming, and expensive processes which are not likely to be reproduced. Iterative (dvc developers) now have a tool called datachain which is worth evaluating as a better fit, and was designed to resolve many of the shortcomings we experienced with dvc which is otherwise well suited for creating reproducible AI (machine learning) pipelines.
Running the complete dvc pipelines will recreate the questions data and then the validation, both of which are tasks which take a considerable amount of time, and owing to the non-deterministic nature of LLMs, the questions created will be different from the data delivered to the Open Telco project so far. For this reason I (@ivyleavedtoadflax) recommened not attempting to re-run the whole pipeline, unless it is to completely recreate the datasets that were produced for the project, which is probably not desireable.
In addition, the existing PRD pipeline will register as being in need of reproduction if you were to run dvc status or dvc repro --dry. This is because we made changes to the pipeline components as we went, but did not recreate the PRD pipeline from the start as this would have invalidated results that had already been created and annotated.
Some tasks (such as assigning a working group) would be better completed as part of the pipeline and incorporated at an early stage. We did not do this as it would have required recreation of the questions data, as we did not receive the working group mapping until after this had been produced.
Initially we worked with a simple json file format with one document per chunk. Later this became a bottleneck, and we switched to parquet files. To avoid having to re-run the question creation task, we did not implement parquet at the beginning of the pipleine, so you will see the initial stages of the pipeline using json files, and the later stages using parquet. Ideally we would have used parquet throughout.
The filter stages of the workflows were implemented with binary classifiers distilled from larger models. We used the framework sieves to do this. Sieves is moving rapidly, and so as to not add instability to the dependencies of this project, we did not attempt to implement the distillation process in this repository. Instead we used a sieves script to train the models externally in their own virtual environment, and simply loaded the models in the pipeline for inference.
An example script is included in the examples/ folder.
In the validation stage we need to send several hundred thousands of requests to qwen for validation, which is slow and expensive. I recommend fixing the provider to Cerebras which is optimised for high throughput inference at reasonable costs. Running with 50 concurrent tasks seems to work without issue. Running with 100 generated many 429 errors. Somewhere in between might be optimal.
The validation step requires several hundred thousand API calls. In order to manage this process effectively, we implemented an SQLite job tracker that tracks the success of the API calls in an ephemeral database stored in .dvc/.tmp/validation_checkpoints/requests.db. This ensures that we can track succesful and failed tasks for repetition, something which was not achievable with dvc alone.
If you wish to run this step with dvc the approach is to:
- Delete the database prior to running the
validate_requestsstage for the first time. Setting the--forceparameter onuv run gsma validation validate-requestswill have the same effect. - Set a request limit (
50,000is reasonable) to reduce memory overhead - Run the
validate_requestsstage multiple times until no further request in the queue remain. If you run the stage withdvcit will show the stage as completed after the first run, since it has no knowledge of the checkpoints database, so you will need to rundvc repro -sf pipelines/prd/validate_requeststo force re-running that stage. Passing the-iparameter will make it interactive and allow you to confirm the run prior to execution. - You can monitor progress of the job by running
uv run scripts/check_validation_progress.py
- Install dependencies:
uv sync- Configure AWS credentials (if using remote DVC storage):
# Set up your Mantis AWS key
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
# This may fail for some artefacts. See note below for DVC limitations.- Pull latest data:
dvc pulldvc reprodata/
├── raw/ # Original source documents (DOCX, PDF)
├── prd/ # PRD pipeline outputs
│ ├── processed/ # Markdown files from DOCX conversion
│ ├── chunks_*/ # Chunked data at different token sizes (500, 1000, 2000, 3000, 4000)
│ ├── questions_*/ # Generated Q&A pairs per chunk size
│ ├── combined/ # Merged chunks + questions with working group classification
│ ├── similarity/ # Similarity analysis results (hashes, rankings, overlaps)
│ ├── exploded/ # Question-centric format with positive/negative chunks
│ ├── filtered/ # Quality-filtered questions and chunks
│ └── validation/ # LLM validation results and final datasets
├── discover/ # Discover pipeline outputs (similar structure to PRD)
└── gsma_prd_synthetic_with_subgroups/ # Annotated dataset with subgroup classifications
End-to-end pipeline for technical specifications:
-
process_documents:
- Converts DOCX → Markdown
- Removes GSMA template boilerplate
- Input:
data/raw/→ Output:data/prd/processed/
-
create_late_chunks (5 stages):
- Creates late chunks at 500/1000/2000/3000/4000 tokens
- Uses sentence-transformers/all-MiniLM-L6-v2 embeddings
- Output:
data/prd/chunks_{size}/
-
generate_questions (5 stages):
- Generates 5/10/20/30/40 synthetic Q&A pairs per chunk size
- Uses Cerebras GPT-OSS-120B via OpenRouter
- Output:
data/prd/questions_{size}/
-
data_combiner:
- Merges all chunks + questions with working group classification
- Output:
data/prd/combined/
-
similarity_hasher:
- Adds SHA-256 content hashes for deduplication
- Output:
data/prd/similarity/hashed/
-
similarity_ranker:
- FAISS IVFFlat similarity search (top-K=20, threshold=0.3)
- Output:
data/prd/similarity/ranked/
-
overlap_detector:
- Character offset-based text overlap detection (min 50 chars)
- Output:
data/prd/similarity/overlaps/
-
explode_questions:
- Question-centric format (min-similarity: 0.35, max: 0.95)
- Output:
data/prd/exploded/
-
apply_question_filter:
- External reference classifier (filters unavailable content)
- Output:
data/prd/filtered/questions/
-
apply_chunk_filter:
- Procedures classifier + keyword exclusion
- Filters: legal/procedural content, "prd@gsma.com" boilerplate
- Output:
data/prd/filtered/chunks/
-
filter_questions_by_chunk_quality:
- Combined quality filtering (min probability: 0.5)
- Output:
data/prd/filtered/combined/
-
validate_requests:
- LLM validation with Qwen 235B via Cerebras
- 50 concurrent requests, 50k question limit
- Output:
data/prd/validation/validated/
-
create_validation_dataset:
- Dual format: embedding (contrastive) + QA (RAG)
- Max 3 positives/negatives per question
- Output:
data/prd/validation/datasets/
-
upload_embedding_dataset:
- Uploads to HuggingFace:
mantisnlp/gsma_prd_synthetic_embedding
- Uploads to HuggingFace:
-
upload_qa_dataset:
- Uploads to HuggingFace:
mantisnlp/gsma_prd_synthetic_qa
- Uploads to HuggingFace:
Similar structure for reports/whitepapers (304 PDF/DOCX documents):
- Includes web scraping with Playwright
- PDF processing via PyMuPDF
- Same chunking → validation → dataset creation workflow
- Outputs:
mantisnlp/gsma_discover_synthetic_embeddingandmantisnlp/gsma_discover_synthetic_qa
Human validation workflow with subgroup-based tasks:
- add_subgroups: Adds working group/subgroup classifications to datasets
- upload_*_annotation: Creates Argilla workspaces for domain experts (TSG, FASG, NG, RCS, eSIM)
uv run pytest tests/ -vuv syncThe gsma CLI provides comprehensive tools for document processing, question generation, validation, filtering, and annotation management.
# Convert DOCX to Markdown
uv run gsma process <input_dir> <output_dir>
# Remove duplicate files
uv run gsma deduplicate <directory> [--execute]
# Create chunks from Markdown files
uv run gsma chunk <input_dir> <output_dir> [--chunker late] [--chunk-size 500]# Generate synthetic Q&A pairs from chunks
uv run gsma questions generate-from-chunks <input_path> <output_path> \
--num-questions 5 \
--model cerebras/llama3.1-70b
# Combine questions with chunks
uv run gsma questions combine-questions <questions_path> <chunks_path> <output_path># Combine data with working group classification
uv run gsma similarity combine <chunks_dir> <questions_dir> <output_path>
# Add SHA-256 content hashes
uv run gsma similarity hash <input_path> <output_path>
# FAISS similarity ranking
uv run gsma similarity rank <input_path> <output_path> --top-k 20
# Detect text overlaps
uv run gsma similarity detect-overlaps <input_path> <output_path># Apply chunk quality filter (procedures classifier)
uv run gsma filters apply-chunk-filter <input_path> <output_path>
# Apply question filter (external reference classifier)
uv run gsma filters apply-question-filter <input_path> <output_path>
# Filter questions by chunk quality
uv run gsma filters filter-questions-by-chunk-quality <input_path> <output_path># Explode questions to question-centric format
uv run gsma validation explode-questions <input_path> <output_path>
# Validate Q&A pairs with LLM
uv run gsma validation validate-requests <input_path> <output_path> \
--model cerebras/qwen-2.5-235b \
--max-concurrent 50# Create datasets from validation results
uv run gsma datasets create-from-validation <input_path> <output_dir>
# Upload to HuggingFace Hub
uv run gsma datasets upload <dataset_path> <repo_name># Upload dataset for annotation
uv run gsma argilla upload --dataset-path <path> -w <workspace>
# Upload by subgroup
uv run gsma argilla upload-by-subgroup \
--dataset-path <path> \
--subgroup TSG \
--sample-size 1000
# User management
uv run gsma argilla add-users -w TSG --count 10 --output-csv users.csv
uv run gsma argilla add-user -u alice -p secret123 -w TSG -w FASG
uv run gsma argilla list-users -w TSG
uv run gsma argilla list-workspaces
uv run gsma argilla list-datasets -w TSG
# Track annotation progress
uv run gsma argilla track-progress -w TSG
# Download annotated results
uv run gsma argilla download <dataset_name> --output-path <path> -w <workspace>
# Cleanup
uv run gsma argilla delete-user -u username
uv run gsma argilla delete-workspace TSG# Add subgroup classifications to dataset
uv run gsma add-subgroup-to-dataset \
--dataset-repo mantisnlp/gsma_prd_synthetic \
--working-groups data/working_groups_mapping.json \
--output data/gsma_prd_synthetic_with_subgroups# Get help for any command
uv run gsma --help
uv run gsma <command> --help
uv run gsma <command> <subcommand> --help