Homework Assignment: GitHub Repository Intelligence with LLMs and BERT
Course: Machine Learning / NLP / Applied AI
Total Score: 20 points
Deadline: Friday — 11:59 PM
Description
The goal of this assignment is to build a complete weak-supervision NLP pipeline using:
You will create a system capable of analyzing GitHub repositories and classifying them according to one of the following project tracks:
Available Project Tracks
Track A — Hiring-Oriented Repository Intelligence
Build a system that evaluates whether a repository reflects work expected from:
Intern-level engineering
Junior-level engineering
Senior-level engineering
Lead/Architect-level engineering
Template/Boilerplate/Replica repository
Low-value repository not worth detailed review
The objective is NOT to judge the developer directly.
The objective is:
estimate the engineering maturity and complexity reflected by the repository itself.
This can help:
The challenge is determining:
Track B — Technology Innovation & Ecosystem Tracking
Build a system capable of identifying:
using GitHub repository activity and metadata.
The objective is NOT to predict whether code is “good.”
The objective is:
analyze repository and ecosystem signals to understand technological momentum and innovation trends.
Examples:
This can help:
investors,
researchers,
governments,
consulting firms,
and technology analysts.
The challenge is determining:
which GitHub signals represent innovation,
how to define “growth” or “decline,”
and how to convert repository behavior into measurable technological trends.
Main Objective
Build a complete pipeline that:
Collects repository information from GitHub API
Creates repository summaries/features
Uses an LLM to generate weak labels
Fine-tunes a BERT-based classifier
Evaluates model performance
Explains the business and analytical value of the system
You are NOT given:
fixed categories,
fixed prompts,
fixed features,
or fixed methodologies.
Those decisions are part of the assignment.
Expected Repository Structure
Create a repository named exactly:
For Track A:
github_hiring_repository_intelligence
For Track B:
github_technology_innovation_tracking
with the following structure:
repository_name/
│
├── app.py # Streamlit app
├── README.md # Project explanation
├── requirements.txt # Dependencies
│
├── src/
│ ├── github_collector.py # GitHub API extraction
│ ├── preprocessing.py # Cleaning and transformations
│ ├── summarization.py # Repository summary generation
│ ├── llm_labeling.py # Weak labeling with LLMs
│ ├── train.py # BERT fine-tuning
│ ├── evaluation.py # Metrics and validation
│ ├── visualization.py # Graphs and analysis
│ └── utils.py # Helper functions
│
├── data/
│ ├── raw/
│ ├── processed/
│ ├── labeled/
│ └── splits/
│
├── models/
│ └── trained_models/
│
├── output/
│ ├── figures/
│ ├── tables/
│ └── metrics/
│
└── video/
└── link.txt
Required Pipeline
Your project must contain the following stages.
Stage 1 — GitHub Data Collection
You must use the GitHub API to collect repository information.
You are free to choose repositories and sampling strategies.
You may use:
You must explain:
Minimum Required Features
You must extract at least 6 repository-level signals.
Examples include:
number of contributors
commits frequency
stars/forks
issue activity
pull request activity
release frequency
README characteristics
workflow/CI presence
dependency updates
repository topics/tags
repository age
last activity date
You are encouraged to experiment with additional signals.
Stage 2 — Repository Representation
You must convert repository information into a format usable by:
This may include:
Example:
Repository has 15 contributors, active CI/CD workflows,
weekly commits, regular releases, and extensive documentation.
You must justify:
Stage 3 — Weak Labeling with LLMs
You must use an LLM to generate labels for the training dataset.
Examples:
OpenAI
Claude
DeepSeek
Gemini
Mistral
Qwen
The LLM acts as:
the initial annotator of repository categories.
You must:
explain your prompt design,
justify your category definitions,
and discuss limitations of LLM-generated labels.
Stage 4 — Train / Validation / Test Split
You must create:
Train dataset
Validation dataset
Test dataset
Suggested split:
70% train
15% validation
15% test
The test dataset must remain unseen during training.
Stage 5 — Fine-Tuning a BERT-Based Model
You must fine-tune one lightweight transformer model.
Recommended options:
DistilBERT
ModernBERT
MiniLM
DeBERTa-v3-small
The objective is NOT massive-scale training.
The objective is:
learn how weak supervision pipelines work in realistic AI systems.
Input:
Output:
Stage 6 — Evaluation and Error Analysis
You must evaluate:
Accuracy
Precision
Recall
F1-score
You must also:
Track A — Required Analytical Questions
Question 1 — Engineering Maturity
Which repository signals appear most associated with:
intern-level repositories,
junior-level repositories,
senior-level repositories,
or lead-level repositories?
You must justify your reasoning.
Question 2 — Low-Value or Replica Repositories
How can repositories that are:
duplicated,
template-based,
unfinished,
or low-value
be differentiated from repositories showing meaningful engineering complexity?
You must define your logic.
Question 3 — Hiring Signal Interpretation
Why might your classification system be useful for:
You must explain:
Question 4 — Methodological Sensitivity
How do results change when:
You must compare:
Track B — Required Analytical Questions
Question 1 — Technology Momentum
Which repository signals appear associated with:
You must justify your reasoning.
Question 2 — Innovation Signals
What types of GitHub activity appear to indicate:
technological growth,
ecosystem expansion,
or declining interest?
You must explain:
Question 3 — Business and Economic Value
Why could this system be useful for:
You must explain:
business value,
practical applications,
and limitations.
Question 4 — Methodological Sensitivity
How do results change when:
repository features,
growth definitions,
or prompts
change?
You must compare:
Technical Requirements
Your project must include:
HuggingFace Transformers
pandas
scikit-learn
matplotlib
seaborn
Streamlit
Optional:
PyTorch
datasets
accelerate
plotly
Streamlit Application
Score: 4 points
Your Streamlit app must contain exactly 4 tabs.
Tab 1 — Problem & Methodology
Include:
Tab 2 — Exploratory Analysis
Include:
repository statistics
category distributions
signal comparisons
selected visualizations
You must explain:
Tab 3 — Model Results
Include:
Tab 4 — Interactive Repository Exploration
Include:
README.md Must Include
What does the project do?
Which track was selected?
What repositories were analyzed?
Which GitHub signals were used?
How were repository summaries created?
How were prompts designed?
How was the dataset split?
Which BERT model was used?
What were the final metrics?
What are the main limitations?
What are the possible business applications?
How to run the project?
How to run the Streamlit app?
Explanatory Video
Score: 8 points
Create a video of:
The video is NOT a coding walkthrough.
The video must be presented as:
a pitch of the idea and system.
The goal is to communicate:
The Video Must Explain
1. Problem Definition
What real-world problem are you solving?
2. Repository Signals
What GitHub information did you collect?
Examples:
contributors
commits
issue activity
releases
repository topics
workflow files
Why do you believe these are meaningful signals?
3. LLM Weak Labeling
What did you feed to the LLM?
Why do you think the LLM can help classify repositories?
4. Classification Logic
Which categories did you define?
Why are those categories useful?
5. Business Value
Who could use this system?
Why would it matter in reality?
Possible examples:
6. Model Performance
Show:
Important
The presentation should focus on:
ideas,
methodology,
reasoning,
and business usefulness.
Do NOT spend the presentation showing code line-by-line.
GitHub Workflow (MANDATORY)
❌ Do not work directly on main
✅ Create development branches
✅ Use descriptive commits
✅ Merge through Pull Requests
Example branches:
feature/github-scraping
feature/llm-labeling
feature/bert-training
feature/streamlit-dashboard
Grading Rubric
Technical Implementation — 12 points
Criteria | Points
-- | --
GitHub data collection pipeline | 2 pts
Repository representation and preprocessing | 2 pts
LLM weak labeling methodology | 2 pts
BERT fine-tuning pipeline | 2 pts
Evaluation and error analysis | 2 pts
Streamlit app completeness | 2 pts
Checklist Before Submitting
Repository has the correct name
GitHub API was used
At least 6 repository signals were extracted
LLM weak labeling was implemented
Train/validation/test split exists
BERT model was fine-tuned
Evaluation metrics are included
Streamlit app contains exactly 4 tabs
README explains methodology and findings
Video link exists in video/link.txt
Work was done using branches and Pull Requests
Repository is reproducible
Final Note
This assignment is intentionally designed to evaluate:
The most important part is NOT achieving the highest accuracy.
The most important part is being able to justify:
why you selected certain GitHub signals,
why your prompts make sense,
why your categories are meaningful,
and why your system could be useful in reality.
Homework Assignment: GitHub Repository Intelligence with LLMs and BERT
Course: Machine Learning / NLP / Applied AI
Total Score: 20 points
Technical implementation: 12 points
Presentation video: 8 points
Deadline: Friday — 11:59 PM
Description
The goal of this assignment is to build a complete weak-supervision NLP pipeline using:
GitHub API
Large Language Models (LLMs)
BERT-based models
Repository metadata
Open-source ecosystem signals
You will create a system capable of analyzing GitHub repositories and classifying them according to one of the following project tracks:
Available Project Tracks
Track A — Hiring-Oriented Repository Intelligence
Build a system that evaluates whether a repository reflects work expected from:
Intern-level engineering
Junior-level engineering
Senior-level engineering
Lead/Architect-level engineering
Template/Boilerplate/Replica repository
Low-value repository not worth detailed review
The objective is NOT to judge the developer directly.
The objective is:
This can help:
recruiters,
engineering managers,
startups,
accelerators,
and technical screening systems.
The challenge is determining:
which repository signals matter,
how to summarize them,
and how to define engineering maturity categories.
Track B — Technology Innovation & Ecosystem Tracking
Build a system capable of identifying:
emerging technologies,
mature ecosystems,
declining technologies,
and experimental or niche technical areas
using GitHub repository activity and metadata.
The objective is NOT to predict whether code is “good.”
The objective is:
Examples:
AI agents
vector databases
cybersecurity tooling
blockchain infrastructure
robotics frameworks
MLOps platforms
This can help:
investors,
researchers,
governments,
consulting firms,
and technology analysts.
The challenge is determining:
which GitHub signals represent innovation,
how to define “growth” or “decline,”
and how to convert repository behavior into measurable technological trends.
Main Objective
Build a complete pipeline that:
Collects repository information from GitHub API
Creates repository summaries/features
Uses an LLM to generate weak labels
Fine-tunes a BERT-based classifier
Evaluates model performance
Explains the business and analytical value of the system
You are NOT given:
fixed categories,
fixed prompts,
fixed features,
or fixed methodologies.
Those decisions are part of the assignment.
Expected Repository Structure
Create a repository named exactly:
For Track A:
For Track B:
with the following structure:
Required Pipeline
Your project must contain the following stages.
Stage 1 — GitHub Data Collection
You must use the GitHub API to collect repository information.
You are free to choose repositories and sampling strategies.
You may use:
REST API
GraphQL API
You must explain:
how repositories were selected,
why they were selected,
and how selection may affect the results.
Minimum Required Features
You must extract at least 6 repository-level signals.
Examples include:
number of contributors
commits frequency
stars/forks
issue activity
pull request activity
release frequency
README characteristics
workflow/CI presence
dependency updates
repository topics/tags
repository age
last activity date
You are encouraged to experiment with additional signals.
Stage 2 — Repository Representation
You must convert repository information into a format usable by:
LLMs
and BERT models
This may include:
textual summaries,
structured prompts,
concatenated metadata,
or hybrid representations.
Example:
You must justify:
why your representation is useful,
and why it may help classification.
Stage 3 — Weak Labeling with LLMs
You must use an LLM to generate labels for the training dataset.
Examples:
OpenAI
Claude
DeepSeek
Gemini
Mistral
Qwen
The LLM acts as:
You must:
explain your prompt design,
justify your category definitions,
and discuss limitations of LLM-generated labels.
Stage 4 — Train / Validation / Test Split
You must create:
Train dataset
Validation dataset
Test dataset
Suggested split:
70% train
15% validation
15% test
The test dataset must remain unseen during training.
Stage 5 — Fine-Tuning a BERT-Based Model
You must fine-tune one lightweight transformer model.
Recommended options:
DistilBERT
ModernBERT
MiniLM
DeBERTa-v3-small
The objective is NOT massive-scale training.
The objective is:
Input:
repository representations
Output:
repository category prediction
Stage 6 — Evaluation and Error Analysis
You must evaluate:
Accuracy
Precision
Recall
F1-score
You must also:
analyze common errors,
compare categories,
discuss weak points,
and explain possible improvements.
Track A — Required Analytical Questions
Question 1 — Engineering Maturity
Which repository signals appear most associated with:
intern-level repositories,
junior-level repositories,
senior-level repositories,
or lead-level repositories?
You must justify your reasoning.
Question 2 — Low-Value or Replica Repositories
How can repositories that are:
duplicated,
template-based,
unfinished,
or low-value
be differentiated from repositories showing meaningful engineering complexity?
You must define your logic.
Question 3 — Hiring Signal Interpretation
Why might your classification system be useful for:
recruiters,
startups,
technical interview pipelines,
or engineering managers?
You must explain:
business value,
limitations,
and ethical considerations.
Question 4 — Methodological Sensitivity
How do results change when:
repository features change,
prompts change,
or category definitions change?
You must compare:
one baseline approach
and one alternative approach.
Track B — Required Analytical Questions
Question 1 — Technology Momentum
Which repository signals appear associated with:
emerging technologies,
mature ecosystems,
declining technologies,
or experimental/niche areas?
You must justify your reasoning.
Question 2 — Innovation Signals
What types of GitHub activity appear to indicate:
technological growth,
ecosystem expansion,
or declining interest?
You must explain:
why you selected those signals,
and their limitations.
Question 3 — Business and Economic Value
Why could this system be useful for:
investors,
consulting firms,
governments,
or technology researchers?
You must explain:
business value,
practical applications,
and limitations.
Question 4 — Methodological Sensitivity
How do results change when:
repository features,
growth definitions,
or prompts
change?
You must compare:
one baseline approach
and one alternative approach.
Technical Requirements
Your project must include:
HuggingFace Transformers
pandas
scikit-learn
matplotlib
seaborn
Streamlit
Optional:
PyTorch
datasets
accelerate
plotly
Streamlit Application
Score: 4 points
Your Streamlit app must contain exactly 4 tabs.
Tab 1 — Problem & Methodology
Include:
project objective
repository selection methodology
GitHub signals used
prompt strategy
dataset construction
limitations
Tab 2 — Exploratory Analysis
Include:
repository statistics
category distributions
signal comparisons
selected visualizations
You must explain:
why those visualizations were selected,
and what analytical insight they provide.
Tab 3 — Model Results
Include:
evaluation metrics
confusion matrix
category performance
baseline vs alternative comparison
Tab 4 — Interactive Repository Exploration
Include:
repository search/filtering
category predictions
metadata exploration
model prediction examples
README.md Must Include
What does the project do?
Which track was selected?
What repositories were analyzed?
Which GitHub signals were used?
How were repository summaries created?
How were prompts designed?
How was the dataset split?
Which BERT model was used?
What were the final metrics?
What are the main limitations?
What are the possible business applications?
How to run the project?
How to run the Streamlit app?
Explanatory Video
Score: 8 points
Create a video of:
maximum 5 minutes
The video is NOT a coding walkthrough.
The video must be presented as:
The goal is to communicate:
analytical thinking,
business understanding,
AI pipeline design,
and practical usefulness.
The Video Must Explain
1. Problem Definition
What real-world problem are you solving?
2. Repository Signals
What GitHub information did you collect?
Examples:
contributors
commits
issue activity
releases
repository topics
workflow files
Why do you believe these are meaningful signals?
3. LLM Weak Labeling
What did you feed to the LLM?
Why do you think the LLM can help classify repositories?
4. Classification Logic
Which categories did you define?
Why are those categories useful?
5. Business Value
Who could use this system?
Why would it matter in reality?
Possible examples:
recruiting
investment analysis
technology trend analysis
startup evaluation
ecosystem monitoring
6. Model Performance
Show:
basic metrics,
confusion matrix,
examples of correct/incorrect predictions.
Important
The presentation should focus on:
ideas,
methodology,
reasoning,
and business usefulness.
Do NOT spend the presentation showing code line-by-line.
GitHub Workflow (MANDATORY)
❌ Do not work directly on
main✅ Create development branches
✅ Use descriptive commits
✅ Merge through Pull Requests
Example branches:
Grading Rubric
Technical Implementation — 12 points
Criteria | Points -- | -- GitHub data collection pipeline | 2 pts Repository representation and preprocessing | 2 pts LLM weak labeling methodology | 2 pts BERT fine-tuning pipeline | 2 pts Evaluation and error analysis | 2 pts Streamlit app completeness | 2 ptsChecklist Before Submitting
Repository has the correct name
GitHub API was used
At least 6 repository signals were extracted
LLM weak labeling was implemented
Train/validation/test split exists
BERT model was fine-tuned
Evaluation metrics are included
Streamlit app contains exactly 4 tabs
README explains methodology and findings
Video link exists in
video/link.txtWork was done using branches and Pull Requests
Repository is reproducible
Final Note
This assignment is intentionally designed to evaluate:
analytical reasoning,
AI system design,
weak supervision understanding,
and business thinking.
The most important part is NOT achieving the highest accuracy.
The most important part is being able to justify:
why you selected certain GitHub signals,
why your prompts make sense,
why your categories are meaningful,
and why your system could be useful in reality.