The main.py script is the primary entry point for OptiBlogAi's content research and analysis pipeline. It performs automated keyword research, web scraping, content extraction, and SEO analysis.
Before running main.py, ensure you have:
- Python Environment: Python 3.10+ with all dependencies installed
- Google API Credentials: Required for search functionality
- Environment Variables: Properly configured
.envfile
Create a .env file in the project root with:
GOOGLE_API_KEY=your_google_api_key_here
GOOGLE_CSE_ID=your_custom_search_engine_id_hereHow to get these credentials:
- Google API Key: Visit Google Cloud Console, enable Custom Search JSON API
- Google CSE ID: Create a Custom Search Engine at Google CSE
# Clone the repository
git clone <repository-url>
cd OptiBlogAi
# Create virtual environment
python -m venv myenv
source myenv/bin/activate # On Windows: myenv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install additional dependencies (if needed)
pip install python-dotenv# From the project root directory
cd /home/kalie/work/projects/OptiBlogAi
python src/main.pyThe script will prompt you for:
- Search Keyword: The topic you want to research
- Number of Results: How many top-ranking pages to analyze (recommended: 5-10)
$ python src/main.py
Enter a keyword to search: artificial intelligence in healthcare
Enter the number of results to fetch: 5Scenario: You want to write a blog post about "sustainable fashion trends"
Input:
- Keyword:
sustainable fashion trends - Number of results:
8
Expected Output:
Title: 10 Sustainable Fashion Trends That Are Changing the Industry
Readability: 65.4
Top keywords: ['sustainable', 'fashion', 'trends', 'eco-friendly', 'organic']
---
Title: The Future of Sustainable Fashion: 2024 Trends
Readability: 72.1
Top keywords: ['sustainable', 'fashion', 'eco', 'brands', 'materials']
---
...
Generated Files:
sustainable fashion trends_2025-07-15/(directory with crawled content)extracted_content/sustainable fashion trends_2025-07-15/(processed content)
Scenario: Analyzing competitors for "AI marketing tools"
Input:
- Keyword:
AI marketing tools - Number of results:
10
Expected Output:
Title: Best AI Marketing Tools for 2024: Complete Guide
Readability: 58.7
Top keywords: ['AI', 'marketing', 'tools', 'automation', 'software']
---
Title: 15 AI-Powered Marketing Tools Every Marketer Needs
Readability: 61.3
Top keywords: ['AI', 'marketing', 'tools', 'analytics', 'personalization']
---
...
Scenario: Exploring a specific niche like "indoor plant care for beginners"
Input:
- Keyword:
indoor plant care for beginners - Number of results:
6
Expected Output:
Title: Indoor Plant Care Guide: Complete Beginner's Handbook
Readability: 78.2
Top keywords: ['indoor', 'plants', 'care', 'beginners', 'watering']
---
Title: How to Care for Indoor Plants: A Step-by-Step Guide
Readability: 75.9
Top keywords: ['indoor', 'plants', 'care', 'light', 'watering']
---
...
project_root/
├── {keyword}_{date}/ # Raw crawled content
│ ├── page_1.html
│ ├── page_2.html
│ └── ...
└── extracted_content/
└── {keyword}_{date}/ # Processed content
├── analysis_results.json
├── extracted_text_1.txt
└── ...
For each analyzed page, you'll see:
- Title: The main title of the webpage
- Readability Score: Flesch Reading Ease score (0-100, higher = easier to read)
- Top Keywords: The 5 most frequent keywords found in the content
- 90-100: Very Easy (5th grade level)
- 80-90: Easy (6th grade level)
- 70-80: Fairly Easy (7th grade level)
- 60-70: Standard (8th-9th grade level)
- 50-60: Fairly Difficult (10th-12th grade level)
- 30-50: Difficult (College level)
- 0-30: Very Difficult (Graduate level)
You can modify the script to customize:
# In main.py, modify these values:
crawler = WebpageCrawler(respect_robots=False) # Change to True to respect robots.txt
crawl_results = crawler.batch_crawl(urls, delay=2.0) # Adjust delay between requestsFor processing multiple keywords, you could create a wrapper script:
# batch_process.py
keywords = [
"artificial intelligence",
"machine learning",
"deep learning"
]
for keyword in keywords:
# Modify main.py to accept keyword as parameter
# Then call it programmatically-
API Key Errors
Error: Invalid API key Solution: Check your .env file and Google API key -
No Results Found
Error: No search results Solution: Try different keywords or check CSE configuration -
Permission Errors
Error: Permission denied creating directory Solution: Ensure write permissions in the project directory -
Import Errors
Error: Module not found Solution: Run from project root and check sys.path configuration
- Keyword Selection: Use specific, long-tail keywords for better results
- Result Limits: Start with 5-10 results to avoid rate limiting
- Delay Settings: Keep crawler delay at 2+ seconds to be respectful
- Storage Management: Regularly clean up generated directories
This script is designed to work with:
- LLM Generator: Pass analyzed content to content generation pipeline
- SEO Tools: Use extracted keywords for SEO optimization
- Content Processor: Further refine and structure the content
After running main.py, you can:
- Analyze Results: Review the readability scores and keywords
- Generate Content: Use the research as input for AI content generation
- SEO Optimization: Apply the discovered keywords to your content strategy
- Competitive Analysis: Compare your content against the analyzed competitors
For issues or questions:
- Check the logs in the generated directories
- Review the extracted content for data quality
- Ensure all dependencies are properly installed
- Verify API credentials and quotas