self.crawl详解

官方文档

此处要介绍的PySpider的获取网络请求的函数是:self.crawl,功能很强大。

具体详细的解释,可以参考:

GET的请求添加查询参数

self.crawl中给params传递对应字典变量,PySpider内部会自动把字典编码为url的查询参数 query string.

官方实例:

self.crawl('http://httpbin.org/get', callback=self.callback, params={'a': 123, 'b': 'c'})

等价于:

self.crawl('http://httpbin.org/get?a=123&b=c', callback=self.callback)

自己之前用的例子有:

  topSignTopParam = {
      "start": 0,
      "rows": 20
  }
  self.crawl(TopSignTopUrl,
      callback=self.getMoreUserCallback,
      params=topSignTopParam,
      save={
          "baseUrl": TopSignTopUrl,
          "isNeedCheckNextPage": True,
          "curPageParam": topSignTopParam
      }
  )

给callback函数加上额外的参数

使用self.crawlsave参数即可,然后callback中用response.save获取传入的值

举例:

    def getUserDetail(self, userId):
        self.crawl(UserDetailUrl,
            callback=self.userDetailCallback,
            params={"member_id": userId },
            save=userId
        )

    def userDetailCallback(self, response):
        userId = response.save
        print("userId=%s" % userId)

当请求出错时也执行callback回调函数

需要给callback回调函数加上修饰符@catch_status_code_error

举例:

    def picSeriesPage(self, response):
      ...
      self.crawl(curSerieDict["url"], callback=self.carModelSpecPage, save=curSerieDict)

    @catch_status_code_error
    def carModelSpecPage(self, response):
        curSerieDict = response.save
        print("curSerieDict=%s", curSerieDict)
        ...
        return curSerieDict

results matching ""

    No results matching ""