Skip to content

Pavithrareddy2702/ML-Based-Cancer-Classification-Using-Gene-Expression

Repository files navigation

ML-Based-Multi-Cancer-Classification-Using-Gene-Expression

Overview

This project is a Bioinformatics and Machine Learning application developed to classify cancer types using gene expression microarray data.

The system predicts whether a sample belongs to:

  • Breast Cancer
  • Liver Cancer
  • Lung Cancer
  • Skin Cancer
  • No Cancer (among the above four cancer types)

An ensemble learning approach combining SVM, Random Forest, KNN, and XGBoost models is used to improve prediction performance.


Live Demo (deployed in render)

πŸ”— Live Application:

image

image

Dataset Information

  • Dataset Sources

The gene expression datasets used in this project were obtained from the NCBI Gene Expression Omnibus (GEO) database.

NCBI GEO Repository

https://www.ncbi.nlm.nih.gov/geo/

GEO Datasets Used

Cancer Type GEO Accession Dataset Link
Skin Cancer (Melanoma / Nevus) GSE3189 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE3189
Liver Cancer (HCC) GSE14520 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE14520
Breast Cancer GSE15852 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15852
Lung Cancer GSE19188 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE19188
Colon Cancer (Exploratory Dataset) GSE44076 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE44076

Final Dataset

After merging and preprocessing:

  • Total Samples: 757

  • Original Gene Features: 22,268

  • Final Selected Features: 1,268

  • Cancer Classes:

    • Breast Cancer
    • Liver Cancer
    • Lung Cancer
    • Skin Cancer
    • No Cancer

The final dataset was created by combining the GEO datasets, standardizing gene expression values, and applying feature selection using Variance Thresholding, Mutual Information, and LASSO.

Dataset Statistics

Item Value
Total Samples 757
Original Gene Features 22,268
Metadata Columns 5
Final Selected Features 1,268

Cancer Categories

  • Breast Cancer
  • Liver Cancer (HCC)
  • Lung Cancer (ADC, SCC, LCC)
  • Skin Cancer (Melanoma, Nevus)
  • Healthy / Normal Samples

Data Preprocessing Pipeline

1. Variance Thresholding

Removed low-variance genes that contribute little to classification.

Threshold Features Remaining
1000 21,440
5000 19,192
10000 17,626

Selected Threshold = 5000


2. Standard Scaling

Gene expression values were normalized using StandardScaler:

  • Mean = 0
  • Standard Deviation = 1

This ensures all genes contribute equally during training.


3. Mutual Information Feature Selection

Selected the most informative genes.

Top Features Selected
5,000
10,000
15,000

Selected:

Top 10,000 genes


4. LASSO Feature Selection

Further reduced dimensionality using L1 Regularization.

C Value Features Selected
0.1 60
0.3 129
0.5 189
1.0 457
2.0 1,268
5.0 4,124
10.0 6,405

Final Selection:

  • C = 2.0
  • Features = 1,268

Machine Learning Models

Support Vector Machine (SVM)

Best Parameters:

  • Kernel = Linear
  • C = 100
  • Gamma = Scale

Results:

  • Accuracy = 91.45%
  • F1 Score = 91.81%

Random Forest

Best Parameters:

  • n_estimators = 200
  • max_depth = 20
  • min_samples_split = 2
  • min_samples_leaf = 1

Results:

  • Accuracy = 92.76%
  • F1 Score = 92.95%

K-Nearest Neighbors (KNN)

Best Parameters:

  • n_neighbors = 9
  • weights = distance
  • p = 2

Results:

  • Accuracy = 94.08%
  • F1 Score = 94.13%

XGBoost

Best Parameters:

  • n_estimators = 300
  • max_depth = 6
  • learning_rate = 0.3
  • subsample = 0.8
  • colsample_bytree = 0.8

Results:

  • Accuracy = 93.42%
  • F1 Score = 93.37%

Ensemble Learning

Soft Voting Ensemble combining:

  • SVM
  • Random Forest
  • KNN
  • XGBoost

Final Performance

Metric Score
Accuracy 95.39%
Precision 95.55%
Recall 95.39%
F1 Score 95.44%

Best Model: Voting Ensemble


Web Application

A Flask-based web application was developed for deployment.

Features:

  • Upload gene expression CSV file
  • Automatic preprocessing
  • Cancer type prediction
  • Download sample input file
  • Interactive prediction results page

Technologies Used

  • Python
  • Flask
  • Pandas
  • NumPy
  • Scikit-Learn
  • XGBoost
  • Joblib
  • Bootstrap 5
  • HTML/CSS

Project Structure

Bio/
β”‚
β”œβ”€β”€ app.py
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ merged_LASSO_c_2.csv
β”‚
β”œβ”€β”€ final_models/
β”‚   β”œβ”€β”€ best_model.pkl
β”‚   β”œβ”€β”€ scaler.pkl
β”‚   β”œβ”€β”€ label_encoder.pkl
β”‚
β”œβ”€β”€ templates/
β”‚   β”œβ”€β”€ index.html
β”‚   └── result.html
β”‚
β”œβ”€β”€ sample_input.csv
└── README.md

Academic Project

Developed as part of a Bioinformatics Machine Learning project for cancer classification using gene expression data.

About

Bioinformatics project for multi-cancer classification using gene expression data and machine learning. Uses ensemble learning (SVM, Random Forest, KNN, XGBoost) and a Flask web application for prediction.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors