I work on AI training data and financial agent evaluation, with a focus on LLM data quality, trajectory-aware evaluation, annotation systems, preference data, synthetic data, data governance, and financial-domain AI evaluation.
My public work is intentionally centered on resources that can be reviewed, reused, and improved without relying on private company data or proprietary workflows.
- Financial agent evaluation: search, exact data lookup, filing QA, toy backtesting, forecasting cutoffs, tool-use traces, and compliance-boundary tasks
- 2026 agent evaluation: trajectory-aware grading, repeated-trial metrics, verifier evidence, and process-safety analysis
- Training data quality engineering for LLM systems
- Dataset cleaning, deduplication, inspection, and documentation
- Annotation quality, agreement, adjudication, and reviewer calibration
- Human preference data, RLHF / DPO data, and synthetic data evaluation
- Financial-domain LLM benchmarks, risk-aware evaluation, and data governance
- awesome-llm-training-data - A curated bilingual hub for LLM training data quality and financial agent evaluation, including Harbor workflows, Claw-style trajectory grading, public-data finance task specs, and deterministic verifier templates.
- Maintaining Awesome LLM Training Data & Agent Evaluation, including:
- Tracking upstream documentation proposals for LLM data and agent evaluation workflows:
- huggingface/datatrove#485 - dataset-audit example using filters, rejected-sample capture, metadata, and summary stats.
- argilla-io/argilla#5861 - annotation QA workflow using guidelines, suggestions, filters, and adjudication.
- harbor-framework/harbor#1700 - Claw-style trajectory-aware evaluation pattern with repeated attempts and safety evidence.
- Prefer primary sources, reproducible resources, and practical engineering value.
- Avoid private company data, real user data, and proprietary workflows.
- Treat financial-domain AI evaluation as a governance problem, not a leaderboard exercise.
- Make data quality work visible through documentation, checklists, issues, and small useful contributions.
我关注 AI 训练数据与金融 Agent 评测工程,重点方向包括 LLM 数据质量、轨迹感知评测、标注系统、偏好数据、合成数据、数据治理,以及金融搜索、查数、报表问答、回测、预测和合规边界评测。
我的公开项目会尽量使用可审查、可复用、可持续改进的公开资料,不包含私有公司数据、真实用户数据或专有工作流。
当前主要维护 Awesome LLM Training Data & Agent Evaluation,并逐步沉淀金融 Agent 评测课题框架、路线图、公开数据任务规格、Harbor 风格任务模板、确定性 verifier、Claw-style 轨迹评测笔记和多次运行指标示例。