Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add allowed_domains variable to spider definition #59

Merged
merged 1 commit into from
May 14, 2013

Conversation

andresp99999
Copy link
Contributor

The allowed_domains settings allow us to control the domains filtered by the OffsiteMiddleware. If not then it does the previous behavior (extract allowed_domains from start_urls)

@andresp99999
Copy link
Contributor Author

I did the following changes:

  • Add test for previous behavior.
  • Add test to check for 'all' value in allowed_domains.
  • Refactor code a bit as suggested.
  • Refactor line in tests to make it shorter than 80 characters.

@@ -192,6 +193,9 @@ Attributes:
start_urls : list of strings
The list of URLs the spider will start crawling from

allowed_domains : list of strings : optional
This variable defines the list of domains that can be crawled. It can have the following values: "all" will crawl any domain or it can be a list of domains. If this variable is not set then the list of allowed domains is extracted from the start urls.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This variable defines" is a bit redundant, if you look at how the other attributes are described.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any sounds better than all to me, for allowing any domain.

But I have another suggestion: what about supporting wildcards?. Using * to denote all domains, and would also allow things like *.scrapinghub.com. This way we'd also keep allowed_domains always a list (single value types add calrity IMO).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this suggestion would involve changes to Scrapy's OffsiteMiddleware, but it would a welcome addition I think, and a relatively easy one. Perhaps it's easier (and more flexible) to support regexes instead of wildcard matching?

@andresp99999
Copy link
Contributor Author

I finally got time to get to this. I made the following changes to the PR:

  • Simplify the documentation as suggested.
  • Changed the definition a bit based in Pablo suggestion, it is currently like this:
    • If not set then the previous behavior is kept (use the list of start_urls)
    • If set to an empty list then all domains are allowed (similarly to scrapy behavior when is empty).
    • If set set to a list, use the domains in the list.
  • I like the idea of using wildcards, etc, but that would require changes to the scrapy middleware. I think that can be done a little later (it should be a separate PR in scrapy anyway first). I'll take a crack at it some time soon. For now I'd like to keep this PR simple.
  • Added the line to the validation schema as suggested.
  • Refactor the names of the test spiders to make them more meaningful.
  • Squashed all the commits into a single one

kalessin added a commit that referenced this pull request May 14, 2013
Add allowed_domains variable to spider definition
@kalessin kalessin merged commit 2e4d514 into scrapy:master May 14, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants