Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
""" This module implements the XMLFeedSpider which is the recommended spider to use for scraping from an XML feed.
See documentation in docs/topics/spiders.rst """
""" This class intends to be the base class for spiders that scrape from XML feeds.
You can choose whether to parse the file using the 'iternodes' iterator, an 'xml' selector, or an 'html' selector. In most cases, it's convenient to use iternodes, since it's a faster and cleaner. """
"""This overridable method is called for each result (item or request) returned by the spider, and it's intended to perform any last time processing required before returning the results to the framework core, for example setting the item GUIDs. It receives a list of results and the response which originated that results. It must return a list of results (Items or Requests). """ return results
"""You can override this function in order to make any changes you want to into the feed before parsing it. This function must return a response. """ return response
"""This method must be overriden with your custom spider functionality""" if hasattr(self, 'parse_item'): # backward compatibility return self.parse_item(response, selector) raise NotImplementedError
"""This method is called for the nodes matching the provided tag name (itertag). Receives the response and an XPathSelector for each node. Overriding this method is mandatory. Otherwise, you spider won't work. This method must return either a BaseItem, a Request, or a list containing any of them. """
for selector in nodes: ret = self.parse_node(response, selector) if isinstance(ret, (BaseItem, Request)): ret = [ret] if not isinstance(ret, (list, tuple)): raise TypeError('You cannot return an "%s" object from a spider' % type(ret).__name__) for result_item in self.process_results(response, ret): yield result_item
if not hasattr(self, 'parse_node'): raise NotConfigured('You must define parse_node method in order to scrape this XML feed')
response = self.adapt_response(response) if self.iterator == 'iternodes': nodes = xmliter(response, self.itertag) elif self.iterator == 'xml': selector = XmlXPathSelector(response) self._register_namespaces(selector) nodes = selector.select('//%s' % self.itertag) elif self.iterator == 'html': selector = HtmlXPathSelector(response) self._register_namespaces(selector) nodes = selector.select('//%s' % self.itertag) else: raise NotSupported('Unsupported node iterator')
return self.parse_nodes(response, nodes)
for (prefix, uri) in self.namespaces: selector.register_namespace(prefix, uri)
"""Spider for parsing CSV feeds. It receives a CSV file in a response; iterates through each of its rows, and calls parse_row with a dict containing each field's data.
You can set some options regarding the CSV file, such as the delimiter and the file's headers. """
"""This method has the same purpose as the one in XMLFeedSpider""" return results
"""This method has the same purpose as the one in XMLFeedSpider""" return response
"""This method must be overriden with your custom spider functionality""" raise NotImplementedError
"""Receives a response and a dict (representing each row) with a key for each provided (or detected) header of the CSV file. This spider also gives the opportunity to override adapt_response and process_results methods for pre and post-processing purposes. """
for row in csviter(response, self.delimiter, self.headers): ret = self.parse_row(response, row) if isinstance(ret, (BaseItem, Request)): ret = [ret] if not isinstance(ret, (list, tuple)): raise TypeError('You cannot return an "%s" object from a spider' % type(ret).__name__) for result_item in self.process_results(response, ret): yield result_item
if not hasattr(self, 'parse_row'): raise NotConfigured('You must define parse_row method in order to scrape this CSV feed') response = self.adapt_response(response) return self.parse_rows(response)
|