Highlights
- Pro
Pinned Loading
-
SecurityLab-UCD/ContractBench
SecurityLab-UCD/ContractBench PublicContractBench: evaluating observation contract failures (validity + integrity) in LLM agents. 33 harbor-runnable API-contract tasks with deterministic programmatic evaluation.
Python 1
-
SecurityLab-UCD/FuzzAug
SecurityLab-UCD/FuzzAug Public[EMNLP'25] FuzzAug: Data Augmentation by Coverage-guided Fuzzing for Neural Test Generation
-
harbor-framework/harbor
harbor-framework/harbor PublicFramework for evaluating and improving agents
-
radixark/miles
radixark/miles PublicMiles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.
-
SecurityLab-UCD/UniTSyn
SecurityLab-UCD/UniTSyn Public[ISSTA'24] A Large-Scale Dataset Capable of Enhancing the Prowess of Large Language Models for Program Testing
-
UKGovernmentBEIS/inspect_evals
UKGovernmentBEIS/inspect_evals PublicCollection of evals for Inspect AI
If the problem persists, check the GitHub status page or contact support.



