This project automates the extraction, processing, and organization of scholarship data from websites using Selenium for web scraping and Google's Gemini AI for data structuring.
- Web Scraping: Extracts scholarship data from target websites
- AI Processing: Uses Gemini AI to structure raw data
- Excel Export: Saves processed data in organized Excel format
- Automatic Class Extraction: Identifies relevant HTML classes dynamically
- Error Handling: Robust retry mechanisms for API and web operations
- Python 3.8+
- Chrome browser
- Google Gemini API key
-
Clone the repository:
git clone https://github.com/qed42/scholarship-data-extractor cd scholarship-scraper
-
Install dependencies:
pip install -r requirements.txt
-
Set up your Gemini API key:
- Create a
.env
file in the project root - Add your API key:
GEMINI_API_KEY=your_api_key_here
- Create a
Run the main script:
python scholarship_scraper.py
The script will:
- Launch Chrome browser
- Extract scholarship data
- Process with Gemini AI
- Save results to
scholarship_data.xlsx
Edit scholarship_scraper.py
to customize:
WEBSITE_URL
: Target scholarship websiteKEYWORDS
: Class name keywords for filteringOUTPUT_EXCEL
: Output file path
The requirements.txt
file includes:
selenium>=4.0
google-generativeai>=0.3.0
openpyxl>=3.0
webdriver-manager>=3.0
python-dotenv>=0.19
ChromeDriver Issues:
- Ensure Chrome is updated
- Run
python -m webdriver_manager update
API Rate Limits:
- Script includes exponential backoff
- Consider upgrading API quota if needed
Website Changes:
- Update
KEYWORDS
if class names change - Adjust wait times if website is slow
- Fork the repository
- Create your feature branch:
git checkout -b feature/AmazingFeature
- Commit your changes:
git commit -m 'Add some AmazingFeature'
- Push to the branch:
git push origin feature/AmazingFeature
- Open a pull request