A research-grade evaluation framework comparing Base Mistral 7B vs LoRA Fine-tuned Medical model across 20 medical QA questions in 5 clinical domains.
- Base Mistral 7B avg ROUGE: 0.1971
- Fine-tuned LoRA avg ROUGE: 0.2321
- Overall improvement: +17.8%
- Win rate: Fine-tuned wins 13/20 questions (65%)
- Pharmacology: Base 0.2280 → Fine-tuned 0.2764 (+0.0484)
- Pathophysiology: Base 0.1229 → Fine-tuned 0.1816 (+0.0587)
- Anatomy & Physiology: Base 0.1749 → Fine-tuned 0.2271 (+0.0522)
- Treatment: Base 0.2727 → Fine-tuned 0.2823 (+0.0096)
- Symptoms & Diagnosis: Base 0.1868 → Fine-tuned 0.1928 (+0.0060)
- llm_evaluation_framework.ipynb — Full evaluation notebook
- eval_results.csv — Per-question results for both models
- eval_summary.json — Aggregated results summary
- evaluation_dashboard.html — Interactive Plotly dashboard
- research_analysis.md — Full research write-up
- Consistent gains across all 5 domains — Fine-tuned model wins in every category
- Pathophysiology strongest (+47.8% relative improvement)
- ROUGE limitations revealed — 35% regression rate highlights lexical overlap limits
- LoRA efficiency — Only 0.36% of parameters trained yet +17.8% improvement
- Models: Mistral-7B-Instruct-v0.3 + LoRA adapter
- Evaluation: ROUGE-1, ROUGE-2, ROUGE-L
- Visualization: Plotly interactive dashboard
- Framework: HuggingFace Transformers + PEFT
- Hardware: T4 GPU — Google Colab Free Tier
- Fine-tuned Model: https://huggingface.co/samurvivor-07/medical-mistral-lora
- Project 3 Medical QA LoRA: https://github.com/Boatengs/medical-qa-lora
- Project 2 SPORTZBOT RAG: https://github.com/Boatengs/sports-rag-chatbot-
- Project 1 Sentiment Analyzer: https://github.com/Boatengs/sentiment-analyzer
- HuggingFace Space: https://huggingface.co/spaces/samurvivor-07/llm-evaluation-framework