A Streamlit-based web application that leverages Firecrawl to scrape multiple websites using a user-defined schema. Input your API key, specify the data fields you want to extract (e.g., strings, numbers, booleans), list the URLs, and retrieve structured JSON results—all through an intuitive interface.
- Dynamic Schema Creation: Define custom extraction fields on the fly.
- Multi-URL Scraping: Scrape multiple websites with a single schema in one operation.
- Interactive UI: Real-time schema preview and easy input via Streamlit.
- Error Handling: Per-URL error reporting ensures robust scraping.
- JSON Output: Structured results for easy data processing.
- Clone the repository:
git clone https://github.com/yourusername/firecrawl-website-scraper.git cd firecrawl-website-scraper
- Install dependencies:
pip install streamlit firecrawl-py pydantic
- Ensure you have a valid Firecrawl API key (sign up at Firecrawl if needed).
- Run the app:
streamlit run app.py
- Open your browser to
http://localhost:8501
. - Enter your Firecrawl API key.
- Define your schema by adding field names and types (e.g., "title" as String, "price" as Number).
- Click "Update Schema" to save your schema.
- Enter URLs (one per line) in the text area.
- Click "Scrape URLs" to extract data.
- View the JSON results, with each URL’s data or errors displayed.
Input URLs:
https://example.com
https://anothersite.com
Schema:
- title: String
- description: String
Output:
{
"https://example.com": {
"title": "Example Site",
"description": "This is an example"
},
"https://anothersite.com": {
"error": "404 Not Found"
}
}
Feel free to fork this repository, submit pull requests, or open issues for bugs and feature requests. Contributions to enhance functionality (e.g., more field types, export options) are welcome!