Intelligent Digital Advertisement Classification System
Automatically categorises job ads, house listings, apartment rentals, retail postings, and banking ads using multiple machine learning algorithms.
- Project Overview
- Categories
- Dataset
- Algorithm Comparison
- Why Each Algorithm?
- Project Structure
- Setup & Installation
- Running the App
- Retraining the Model
- Results Summary
AdClassifier Pro is a full-stack NLP classification application that:
- Classifies digital advertisements into 5 categories in real-time
- Compares 6 different ML algorithms with metrics, charts, and confusion matrices
- Provides a Streamlit web dashboard for single-input classification, batch CSV/Excel processing, and algorithm comparison
- Handles class imbalance via synthetic data augmentation and
class_weight="balanced"
The best-performing model (LinearSVC) is automatically saved and used in production.
| Category | Description |
|---|---|
Jobs โ IT |
Software engineering, DevOps, cloud, sysadmin, cybersecurity roles |
Jobs โ Retail |
Cashier, sales associate, store manager, stock associate roles |
Banking |
Loan officer, financial advisor, credit analyst, accountant roles |
Sell โ House |
Property listings with bedrooms, bathrooms, price, features |
Rent โ Apartment |
Apartment rental ads with rent price, amenities, availability |
| Source | Samples |
|---|---|
Original (ConcatenatedDigitalAdData.xlsx) |
~1,541 |
Synthetic (generated via retrain_model.py) |
~1,270 |
| Total after merging | 2,805 |
Class distribution after balancing:
Sell โ House 607
Jobs โ Retail 552
Rent โ Apartment 550
Jobs โ IT 549
Banking 547
Train / Test split: 80% / 20% (stratified)
Six algorithms were trained and evaluated on identical data splits. Results are ranked by Test Accuracy:
| Rank | Algorithm | Test Accuracy | Macro F1 | Weighted F1 | CV Mean (5-fold) | CV Std | Train Time |
|---|---|---|---|---|---|---|---|
| ๐ฅ 1 | LinearSVC (Production) | 97.33% | 97.28% | 97.32% | 97.54% | ยฑ0.35% | 0.8s |
| ๐ฅ 2 | Logistic Regression | 96.97% | 96.91% | 96.97% | 97.54% | ยฑ0.58% | 1.5s |
| ๐ฅ 2 | Random Forest | 96.97% | 96.94% | 96.98% | 96.47% | ยฑ0.67% | 1.4s |
| 4 | Multinomial Naive Bayes | 96.79% | 96.73% | 96.79% | 97.08% | ยฑ0.49% | 0.3s |
| 5 | Gradient Boosting | 95.37% | 95.35% | 95.40% | 96.04% | ยฑ0.73% | 65.9s |
| 6 | K-Nearest Neighbors | 94.65% | 94.54% | 94.63% | 95.22% | ยฑ0.86% | 0.5s |
All algorithms use TF-IDF vectorization (bigrams, 20k max features) as input features.
Why chosen: Linear SVM is the gold standard for high-dimensional sparse text classification. TF-IDF produces very large sparse feature vectors where linear boundaries separate classes cleanly. CalibratedClassifierCV wraps it to produce probability estimates. It achieves the best accuracy (97.33%) with the second-fastest training time (0.8s), making it ideal for production.
Why included: A natural probabilistic alternative to SVM. Outputs well-calibrated class probabilities without wrapping. Tied for 2nd place at 96.97% with similar CV performance to LinearSVC (97.54%). Highly interpretable โ coefficients directly indicate which words drive each classification.
Why included: The classic NLP baseline algorithm. Assumes conditional independence of features given the class. Extremely fast (0.3s training) and achieved 96.79% โ impressively close to the top models. Best choice when training resources are severely limited or real-time retraining is needed.
Why included: A bagging ensemble of 200 decision trees. Handles non-linear feature interactions and is naturally resistant to overfitting. Tied for 2nd place (96.97%) but has a lower CV score (96.47%), suggesting it generalizes slightly less consistently than linear models on text data.
Why included: A powerful sequential boosting ensemble that iteratively corrects prior model errors. Strong on structured tabular data. Demonstrated here to show the accuracyโspeed trade-off: only 95.37% accuracy while taking 65.9 seconds to train โ far slower than linear models on text.
Why included: A non-parametric, instance-based learner. Classifies new samples by voting from the k=7 nearest training examples using cosine similarity. Included as a simple distance-based baseline. Performs weakest (94.65%) due to the curse of dimensionality in high-dimensional TF-IDF space.
Digital-Advertisement-Classification/
โ
โโโ streamlit_app.py # Main Streamlit web application
โโโ retrain_model.py # Generates synthetic data & retrains LinearSVC
โโโ compare_algorithms.py # Trains & compares all 6 algorithms
โโโ requirements.txt # Python dependencies
โ
โโโ data/
โ โโโ ConcatenatedDigitalAdData.xlsx # Original labelled dataset
โ โโโ synthetic_data.csv # Augmented synthetic training samples
โ โโโ comparison_report/ # Auto-generated by compare_algorithms.py
โ โโโ algorithm_comparison.csv # Numerical results table
โ โโโ algorithm_reasons.txt # Text explanations per algorithm
โ โโโ accuracy_comparison.png # Bar chart: accuracy vs CV
โ โโโ f1_comparison.png # Bar chart: Macro F1 vs Weighted F1
โ โโโ training_time.png # Horizontal bar: training speed
โ โโโ radar_comparison.png # Radar chart: multi-metric overview
โ โโโ cm_LinearSVC_Current.png # Confusion matrix per algorithm
โ โโโ cm_*.png # (one per algorithm)
โ
โโโ notebook/
โโโ model/
โโโ adv_model.sav # Active production model (best algorithm)
โโโ adv_model_backup.sav # Previous model backup
Requirements: Python 3.9+
# 1. Clone the repository
git clone https://github.com/kuxall/Digital-Advertisement-Classification.git
cd Digital-Advertisement-Classification
# 2. Create and activate virtual environment
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # Mac/Linux
# 3. Install dependencies
pip install -r requirements.txtrequirements.txt includes:
streamlit
scikit-learn
pandas
numpy
openpyxl
matplotlib
seaborn
# Start the Streamlit dashboard
streamlit run streamlit_app.pyThe app opens at http://localhost:8501 with 4 pages:
| Page | Description |
|---|---|
| Dashboard | Data distribution, total samples, category counts |
| Classifier | Real-time single-ad classification with confidence scores |
| Batch Processor | Upload CSV/Excel for bulk classification and download |
| Model Comparison | Interactive charts and rankings for all 6 algorithms |
python retrain_model.pypython compare_algorithms.pyThis will:
- Train all 6 algorithms on the same data split
- Print a full comparison table to the console
- Save charts and CSVs to
data/comparison_report/ - Automatically save the best model to
notebook/model/adv_model.sav - Back up the previous model to
adv_model_backup.sav
After running, restart the Streamlit app โ it will load the new best model automatically.
The system achieves production-grade accuracy with a highly balanced dataset:
- Best Model: LinearSVC + TF-IDF (bigrams)
- Test Accuracy: 97.33%
- 5-Fold CV Accuracy: 97.54% (ยฑ0.35%)
- Training Samples: 2,805 (real + synthetic)
- Categories: 5
- Training Time: < 1 second
The linear models (LinearSVC, Logistic Regression) consistently outperform ensemble and distance-based methods on TF-IDF text features, confirming well-established NLP research findings. Naive Bayes is the recommended lightweight alternative if inference speed is critical.
Built with Python ยท scikit-learn ยท Streamlit ยท TF-IDF ยท LinearSVC