-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalization of link extractors (added classes and allow to define through specs) and added support for feed start urls #73
Conversation
type : string | ||
Defines how to interpret the string in the 'value' attribute. Current supported values for this attribute are: | ||
|
||
* ``column`` - value is an integer index indicating a column number. Link source is regarded as csv formatted ``scrapy.http.TextResponse``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about using the name "csv" or "csv_column" for this type? Just "column" seems a bit too generic. I think we can also make the column optional with a default value of 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to csv_column. Only csv does not describe what the value field contains, which is the idea of the type 'field' (except for 'html' shortcut, which the given value is ignored)
About the value to pass, it cannot be optional for consistence with the rest. The link extractor is described using two required fields: type and value. And that cannot depend on the specific type passed. In case of 'html' shortcut, which does not need a value, it is ignored. But we cannot ignore the value in case of csv_column.
I added three new commits. Please check. |
LGTM |
@@ -193,6 +194,9 @@ Attributes: | |||
start_urls : list of strings | |||
The list of URLs the spider will start crawling from | |||
|
|||
feed_start_urls : list of objects : optional | |||
A list of feed objects. Used to define non-HTML feeds start urls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the field description I only get it's for non-HTML urls, I don't know which types are supported nor how it's different than the standard start_url. Shouldn't it be called nohtml_start_urls?. Can slybot not get the type based on the Content-Type header?.
I'm talking specifically about the documentation, without yet looking at the code. But that's how documentation should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think nohtml_start_urls is a good name. The thing to fix is the text, and remove the non-HTML part (as anyway the field accepts also html ones) In fact start_urls could be considered as a backward compatibility, but in new usage can be merged with feed_start_urls, which should in that case change its name. May be start_requests is a better naming (as it is not only an url, but a link extractor).
About using the Content-Type, it is not useful, because we still need to define which link extractor to apply.
There already exists a field called 'init_requests', for adding 'login' and 'form' requests. Seems natural to add the new start requests there, and add a new type of request for this kind ('start') and add a new field to the 'Request' spec, which would be link_extractor and would be optional, for compatibility, and because seems natural to leave as default value the html one. |
I have moved feed_start_urls to init_requests, by adding a new request type, 'start'. This is more convenient and simpler from specifications point of view. Please check new commit changes and let me know what do you think. This new start requests admits an optional link_extractor parameter. In case of being present, the callback for the request will be one generated to extract links only, using the defined link extractor. In case of not being present, the request callback will be the default parse method (extracts links/items), thus will behave exactly as a request created from start urls. What do you think about this approach? |
@@ -86,15 +78,32 @@ def __init__(self, name, spec, item_schemas, all_extractors, **kw): | |||
|
|||
self.login_requests = [] | |||
self.form_requests = [] | |||
for rdata in spec.get("init_requests", []): | |||
self._start_requests = [] | |||
self._create_init_requests(spec.get("init_requests", []), **kw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't propagate **kw deep, that always end up in more complex and harder to read code. Define which [keyword] arguments each function receives and pass those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Error was to move a line from init inside this method, that should have been conserved inside init.
I like the new approach more, specially because it reuses the already existing |
«I still wonder if there could be a way to merge those (start & init requests). » Not sure what do you mean with this. init_requests is a list that contains requests of different types: login, form, and now, 'start'. So in fact 'start' requests are part of init_requests. |
Pushed two new commits fixing observations made by Pablo |
extractor as subclass of xml one
for api compatibility with rest of link extractors
specs reference and schemas.json
Generalization of link extractors (added classes and allow to define through specs) and added support for feed start urls
No description provided.