Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalization of link extractors (added classes and allow to define through specs) and added support for feed start urls #73

Merged
merged 11 commits into from
Aug 12, 2013

Conversation

kalessin
Copy link
Contributor

@kalessin kalessin commented Aug 1, 2013

No description provided.

type : string
Defines how to interpret the string in the 'value' attribute. Current supported values for this attribute are:

* ``column`` - value is an integer index indicating a column number. Link source is regarded as csv formatted ``scrapy.http.TextResponse``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using the name "csv" or "csv_column" for this type? Just "column" seems a bit too generic. I think we can also make the column optional with a default value of 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to csv_column. Only csv does not describe what the value field contains, which is the idea of the type 'field' (except for 'html' shortcut, which the given value is ignored)

About the value to pass, it cannot be optional for consistence with the rest. The link extractor is described using two required fields: type and value. And that cannot depend on the specific type passed. In case of 'html' shortcut, which does not need a value, it is ignored. But we cannot ignore the value in case of csv_column.

@kalessin
Copy link
Contributor Author

kalessin commented Aug 2, 2013

I added three new commits. Please check.

@andresp99999
Copy link
Contributor

LGTM

@@ -193,6 +194,9 @@ Attributes:
start_urls : list of strings
The list of URLs the spider will start crawling from

feed_start_urls : list of objects : optional
A list of feed objects. Used to define non-HTML feeds start urls.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the field description I only get it's for non-HTML urls, I don't know which types are supported nor how it's different than the standard start_url. Shouldn't it be called nohtml_start_urls?. Can slybot not get the type based on the Content-Type header?.

I'm talking specifically about the documentation, without yet looking at the code. But that's how documentation should work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think nohtml_start_urls is a good name. The thing to fix is the text, and remove the non-HTML part (as anyway the field accepts also html ones) In fact start_urls could be considered as a backward compatibility, but in new usage can be merged with feed_start_urls, which should in that case change its name. May be start_requests is a better naming (as it is not only an url, but a link extractor).

About using the Content-Type, it is not useful, because we still need to define which link extractor to apply.

@kalessin
Copy link
Contributor Author

kalessin commented Aug 2, 2013

There already exists a field called 'init_requests', for adding 'login' and 'form' requests. Seems natural to add the new start requests there, and add a new type of request for this kind ('start') and add a new field to the 'Request' spec, which would be link_extractor and would be optional, for compatibility, and because seems natural to leave as default value the html one.

@kalessin
Copy link
Contributor Author

kalessin commented Aug 5, 2013

I have moved feed_start_urls to init_requests, by adding a new request type, 'start'. This is more convenient and simpler from specifications point of view. Please check new commit changes and let me know what do you think.

This new start requests admits an optional link_extractor parameter. In case of being present, the callback for the request will be one generated to extract links only, using the defined link extractor. In case of not being present, the request callback will be the default parse method (extracts links/items), thus will behave exactly as a request created from start urls. What do you think about this approach?

@@ -86,15 +78,32 @@ def __init__(self, name, spec, item_schemas, all_extractors, **kw):

self.login_requests = []
self.form_requests = []
for rdata in spec.get("init_requests", []):
self._start_requests = []
self._create_init_requests(spec.get("init_requests", []), **kw)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't propagate **kw deep, that always end up in more complex and harder to read code. Define which [keyword] arguments each function receives and pass those.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Error was to move a line from init inside this method, that should have been conserved inside init.

@pablohoffman
Copy link
Member

I like the new approach more, specially because it reuses the already existing init_requests. I still wonder if there could be a way to merge those (start & init requests). I also left a couple of comments.

@kalessin
Copy link
Contributor Author

kalessin commented Aug 8, 2013

«I still wonder if there could be a way to merge those (start & init requests). »

Not sure what do you mean with this. init_requests is a list that contains requests of different types: login, form, and now, 'start'. So in fact 'start' requests are part of init_requests.

@kalessin
Copy link
Contributor Author

kalessin commented Aug 8, 2013

Pushed two new commits fixing observations made by Pablo

pablohoffman added a commit that referenced this pull request Aug 12, 2013
Generalization of link extractors (added classes and allow to define through specs) and added support for feed start urls
@pablohoffman pablohoffman merged commit 31a3bb8 into scrapy:master Aug 12, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants