Ready-to-run example eval sets for the task families in the Pareta model marketplace. Each folder has an items.jsonl (and real source documents/ for document tasks) you can browse, download, or load via Try the example set in-app.
These are the same bundled examples the product ships — built from public benchmarks (synthetic / CC0 / licensed eval corpora), not customer data.
| Task | Metric | Source | Items | Docs |
|---|---|---|---|---|
agent-airline |
Successful task | τ-bench airline | 10 | — |
agent-retail |
Successful task | τ-bench retail | 10 | — |
code-generation |
pass@1 | MBPP+ | 10 | — |
contract-canonical-fields |
F1 | Kleister-NDA | 10 | — |
contract-clause-enumeration |
F1 | CUAD | 10 | — |
contract-key-fields |
F1 | CUAD | 10 | — |
contract-long-doc-fact |
F1 | Kleister-Charity | 10 | — |
contract-ma-deal-points |
F1 | MAUD | 10 | — |
doc-qa-abstractive |
ANLS | DUDE | 10 | 10 |
doc-qa-extractive |
ANLS | DUDE + MP-DocVQA | 10 | 10 |
doc-qa-list |
ANLS | DUDE | 10 | 10 |
doc-qa-refusal |
NA-acc | DUDE | 10 | 10 |
emotion-classification |
F1 | GoEmotions | 10 | — |
form-receipt-extraction |
F1 | CORD-v2 + FUNSD + SROIE | 10 | 10 |
function-completion |
pass@1 | HumanEval+ | 10 | — |
hate-offensive |
F1 | Davidson | 10 | — |
intent-classification |
F1 | Banking77 | 10 | — |
intent-in-scope |
F1 | CLINC150 | 10 | — |
intent-multilingual |
F1 | MASSIVE | 10 | — |
invoice-extraction |
F1 | katanaml + FATURA2 | 10 | 10 |
phi-redaction |
F1 | MTSamples | 10 | — |
pii-detection |
F1 | ai4privacy | 10 | — |
text-to-api |
Syntax Match Accuracy | BFCL v3 | 10 | — |
text-to-sql |
Execution Accuracy | BIRD-SQL | 10 | — |
toxic-binary |
F1 | toxic-chat | 10 | — |
toxic-content-multilabel |
F1 | Jigsaw | 10 | — |
unknown-intent |
AUROC | CLINC150 OOS | 10 | — |
Generated by scripts/build-example-datasets.py in the Pareta repo.