This is the repository for the scraper for the DocSearch project. You can run it on your own, or ask us to crawl your documentation.
DocSearch is in fact 3 different projects.
- The front-end of DocSearch: https://github.com/algolia/docsearch
- The scraper which browses & indexes web pages: https://github.com/algolia/docsearch-scraper
- The configurations for the scraper: https://github.com/algolia/docsearch-configs
This project is a collection of submodules, each one in its own directory:
- cli: A command line tool to manage DocSearch. Run
./docsearch
and follow the steps - deployer: Tool used by Algolia to deploy the configuration in our mesos infrastructure
- doctor: A monitoring/repair tool to check if the indices built by the scraper are in good shape
- playground: An HTML page to easily test DocSearch indices
- scraper: The core of the scraper. It reads the configuration file, fetches the web pages and indexes them in Algolia.
- Install python
brew install python # will install pip
apt-get install python
- Or any other way
git clone [email protected]:algolia/documentation-scraper.git
cd documentation-scraper
pip install -r requirements.txt
- Download geckodriver from https://github.com/mozilla/geckodriver/releases, extract it
- rename the
geckodriver
executable towires
and make it accessible in the path - Depending on what you want to do you might also need to install docker, especially to run tests.
Create a file named .env
file at the root of the project:
APPLICATION_ID=
API_KEY=
To have the APPLICATION_ID and API_KEY, you need to create an [https://www.algolia.com/users/sign_up](Algolia account).
You should be able to do everything with the docsearch CLI tool:
$ ./docsearch
Docsearch CLI
Usage:
./docsearch command [options] [arguments]
Options:
--help Display help message
Available commands:
test Run tests
playground Launch the playground
run Run a config
config
config:bootstrap Boostrap a docsearch config
config:docker-run Run a config using docker
docker
docker:build-scraper Build scraper images (dev, prod, test)
To use DocSearch the first thing you need is to create the config for the crawler. For more details about configs, check out https://github.com/algolia/docsearch-configs, you'll have a list of options you can use and a lot of live and working examples.
Without docker:
$ ./docsearch run /path/to/your/config
With docker:
$ ./docsearch docker:build-scraper #Build the docker file
$ ./docsearch config:docker-run /path/to/your/config #run the docker container
Open ./playground/index.html
in your browser, enter your credentials, your index name, and type some queries to make sure everything is ok.
Just add this snippet to your documentation:
<link rel="stylesheet" href="//cdn.jsdelivr.net/docsearch.js/2/docsearch.min.css" />
<script type="text/javascript" src="//cdn.jsdelivr.net/docsearch.js/2/docsearch.min.js"></script>
var search = docsearch({
apiKey: '<API_KEY>',
indexName: '<INDEX_NAME>',
inputSelector: '<YOUR_INPUT_DOM_SELECTOR>',
debug: false
});
And you are good to go!
If you are Algolia employee and want to manage a DocSearch account,
you'll need to add the following variables in your .env
file:
WEBSITE_USERNAME=
WEBSITE_PASSWORD=
SLACK_HOOK=
SCHEDULER_USERNAME=
SCHEDULER_PASSWORD=
DEPLOY_KEY=
The cli will then have more commands for you to run.
For some actions like deploying you might need to use different credentials than the ones in the .env file. To do this you need to override them when running the cli tool:
APPLICATION_ID= API_KEY= ./docsearch deploy:configs