I'm a Data Scientist with a PhD and deep expertise in machine learning, predictive modelling, and biological sequence data. I build end-to-end ML pipelines — from raw data and feature engineering through to deployed, production-ready applications. I've proven track record of turning complex, large-scale datasets into insights that shape strategic decision-making.
My domain is infectious disease: I develop models that turn genetic sequence data into actionable predictions, with real-world impact at scale. My work has been published in Nature Communications and adopted by the World Health Organization platform at GISAID.
- Predictive ML and NN models on high-dimensional biological data (ensemble methods, AdaBoost, Random Forest, XGBoost, MLP, CNN)
- End-to-end pipelines — data ingestion, feature engineering, model training, evaluation, and deployment
- Production web applications using Streamlit, deployed on cloud platforms with CI/CD
- Interpretable AI — feature importance, permutation importance, SHAP-style analysis, identifying what drives model predictions
SAP_H3N2_ML — Influenza Antigenic Prediction Model
An AdaBoost regression and classification model trained on genetic sequence data to predict antigenic properties of influenza viruses. Trained on historical seasonal data; evaluated prospectively on future seasons.
- 92% average AUROC across 14 held-out test seasons
- Nonlinear feature mapping from sequence mutations to phenotypic outcomes
- Adopted by the WHO for influenza surveillance and vaccine strain selection
- Published: Nature Communications, 2024
SAP_H3N2_ML_webapp — Deployed Prediction App
A production Streamlit application serving the AdaBoost model to end users — interactive single-sample prediction and CSV batch processing with downloadable results.
- 🌐 Live: Streamlit Cloud · Hugging Face Spaces
- Input validation, sequence encoding, and metadata handling baked in
- CI/CD deployment via GitHub Actions


