This project is a Bioinformatics and Machine Learning application developed to classify cancer types using gene expression microarray data.
The system predicts whether a sample belongs to:
- Breast Cancer
- Liver Cancer
- Lung Cancer
- Skin Cancer
- No Cancer (among the above four cancer types)
An ensemble learning approach combining SVM, Random Forest, KNN, and XGBoost models is used to improve prediction performance.
π Live Application:
The gene expression datasets used in this project were obtained from the NCBI Gene Expression Omnibus (GEO) database.
https://www.ncbi.nlm.nih.gov/geo/
| Cancer Type | GEO Accession | Dataset Link |
|---|---|---|
| Skin Cancer (Melanoma / Nevus) | GSE3189 | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE3189 |
| Liver Cancer (HCC) | GSE14520 | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE14520 |
| Breast Cancer | GSE15852 | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15852 |
| Lung Cancer | GSE19188 | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE19188 |
| Colon Cancer (Exploratory Dataset) | GSE44076 | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE44076 |
After merging and preprocessing:
-
Total Samples: 757
-
Original Gene Features: 22,268
-
Final Selected Features: 1,268
-
Cancer Classes:
- Breast Cancer
- Liver Cancer
- Lung Cancer
- Skin Cancer
- No Cancer
The final dataset was created by combining the GEO datasets, standardizing gene expression values, and applying feature selection using Variance Thresholding, Mutual Information, and LASSO.
| Item | Value |
|---|---|
| Total Samples | 757 |
| Original Gene Features | 22,268 |
| Metadata Columns | 5 |
| Final Selected Features | 1,268 |
- Breast Cancer
- Liver Cancer (HCC)
- Lung Cancer (ADC, SCC, LCC)
- Skin Cancer (Melanoma, Nevus)
- Healthy / Normal Samples
Removed low-variance genes that contribute little to classification.
| Threshold | Features Remaining |
|---|---|
| 1000 | 21,440 |
| 5000 | 19,192 |
| 10000 | 17,626 |
Selected Threshold = 5000
Gene expression values were normalized using StandardScaler:
- Mean = 0
- Standard Deviation = 1
This ensures all genes contribute equally during training.
Selected the most informative genes.
| Top Features Selected |
|---|
| 5,000 |
| 10,000 |
| 15,000 |
Selected:
Top 10,000 genes
Further reduced dimensionality using L1 Regularization.
| C Value | Features Selected |
|---|---|
| 0.1 | 60 |
| 0.3 | 129 |
| 0.5 | 189 |
| 1.0 | 457 |
| 2.0 | 1,268 |
| 5.0 | 4,124 |
| 10.0 | 6,405 |
Final Selection:
- C = 2.0
- Features = 1,268
Best Parameters:
- Kernel = Linear
- C = 100
- Gamma = Scale
Results:
- Accuracy = 91.45%
- F1 Score = 91.81%
Best Parameters:
- n_estimators = 200
- max_depth = 20
- min_samples_split = 2
- min_samples_leaf = 1
Results:
- Accuracy = 92.76%
- F1 Score = 92.95%
Best Parameters:
- n_neighbors = 9
- weights = distance
- p = 2
Results:
- Accuracy = 94.08%
- F1 Score = 94.13%
Best Parameters:
- n_estimators = 300
- max_depth = 6
- learning_rate = 0.3
- subsample = 0.8
- colsample_bytree = 0.8
Results:
- Accuracy = 93.42%
- F1 Score = 93.37%
Soft Voting Ensemble combining:
- SVM
- Random Forest
- KNN
- XGBoost
| Metric | Score |
|---|---|
| Accuracy | 95.39% |
| Precision | 95.55% |
| Recall | 95.39% |
| F1 Score | 95.44% |
Best Model: Voting Ensemble
A Flask-based web application was developed for deployment.
Features:
- Upload gene expression CSV file
- Automatic preprocessing
- Cancer type prediction
- Download sample input file
- Interactive prediction results page
- Python
- Flask
- Pandas
- NumPy
- Scikit-Learn
- XGBoost
- Joblib
- Bootstrap 5
- HTML/CSS
Bio/
β
βββ app.py
βββ requirements.txt
βββ merged_LASSO_c_2.csv
β
βββ final_models/
β βββ best_model.pkl
β βββ scaler.pkl
β βββ label_encoder.pkl
β
βββ templates/
β βββ index.html
β βββ result.html
β
βββ sample_input.csv
βββ README.md
Developed as part of a Bioinformatics Machine Learning project for cancer classification using gene expression data.