-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add allowed_domains variable to spider definition #59
Add allowed_domains variable to spider definition #59
Conversation
I did the following changes:
|
@@ -192,6 +193,9 @@ Attributes: | |||
start_urls : list of strings | |||
The list of URLs the spider will start crawling from | |||
|
|||
allowed_domains : list of strings : optional | |||
This variable defines the list of domains that can be crawled. It can have the following values: "all" will crawl any domain or it can be a list of domains. If this variable is not set then the list of allowed domains is extracted from the start urls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"This variable defines" is a bit redundant, if you look at how the other attributes are described.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any
sounds better than all
to me, for allowing any domain.
But I have another suggestion: what about supporting wildcards?. Using *
to denote all domains, and would also allow things like *.scrapinghub.com
. This way we'd also keep allowed_domains
always a list (single value types add calrity IMO).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this suggestion would involve changes to Scrapy's OffsiteMiddleware, but it would a welcome addition I think, and a relatively easy one. Perhaps it's easier (and more flexible) to support regexes instead of wildcard matching?
I finally got time to get to this. I made the following changes to the PR:
|
Add allowed_domains variable to spider definition
The allowed_domains settings allow us to control the domains filtered by the OffsiteMiddleware. If not then it does the previous behavior (extract allowed_domains from start_urls)