Large Language Models (LLMs) are increasingly deployed in high-impact domains such as healthcare, legal analysis, governance, and enterprise decision-making. While these models generate fluent responses, they often produce hallucinations—confident but unsupported or incorrect claims.
The LLM Reliability Engine is a system designed to audit, evaluate, and score the trustworthiness of LLM-generated responses using retrieval-based evidence verification and explainable scoring mechanisms.
Unlike traditional RAG systems that support generation, this project focuses on post-hoc evaluation and reliability assessment.
- Claim-level verification of LLM responses
- Evidence retrieval using vector search (FAISS)
- Semantic alignment scoring between claims and evidence
- Explainable confidence score and hallucination risk labeling
- Structured, machine-readable reliability reports
- User Query: Input from the user.
- LLM Response Generator: The model generates a response.
- Evidence Retrieval: Relevant documents are fetched from the vector database.
- Claim Decomposition: The response is split into individual claims.
- Claim-Evidence Matching: Claims are compared against evidence.
- Reliability Aggregation: Scores are combined.
- Reliability Report: Final JSON output.
- Language: Python
- API Framework: FastAPI
- LLM Integration: OpenAI / Gemini
- Embeddings: SentenceTransformers
- Vector Store: FAISS
- ML Utilities: scikit-learn
- Frameworks: LangChain (minimal usage)
git clone https://github.com/maybemnv/llm-reliability-engine.git
cd llm-reliability-enginepython -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activatepip install -r requirements.txtPlace PDF or text files inside data/knowledge_base/.
uvicorn src.api.main:app --reloadPOST /analyze
Request:
{
"query": "What are the health impacts of air pollution?"
}Response:
{
"confidence_score": 0.78,
"hallucination_risk": "MEDIUM",
"unsupported_claims": ["Air pollution causes all forms of cancer"]
}- Semantic similarity does not guarantee factual correctness.
- Dependent on quality and coverage of the knowledge base.
- No real-time web verification.
- Sentence-level claim extraction may miss complex logic.
- Cross-claim logical consistency checks.
- Cross-encoder reranking for stronger verification.
- Domain-specific reliability calibration.
- Monitoring dashboards for enterprise usage.
Manav Kaushal
- GitHub: https://github.com/maybemnv
- LinkedIn: https://linkedin.com/in/maybmnv
This project is released for educational and research purposes.