1.1.Scrapy简介

Scrapy是一个开源框架,用一种快速简单可扩展的方式从网站中提取你需要的数据。

Scrapy特点

  • 快速且强大:你仅仅需要编写用于提取数据的规则,Scrapy会为你完成其余工作

  • 扩展容易:通过设计扩展,无需接触核心即可轻松插入新功能

  • 便携:基于python语言,各平台兼容

构建并运行你的网络爬虫

$ pip install scrapy 
$ cat > myspider.py <<EOF
import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        for title in response.css('.post-header>h2'):
            yield {'title': title.css('a ::text').get()}

        for next_page in response.css('a.next-posts-link'):
            yield response.follow(next_page, self.parse)
EOF
$ scrapy runspider myspider.py

部署网络爬虫

Scrapy支持多种部署部署方式,你可根据你的实际情况决定选用哪一种方式

  • 部署爬虫到Scrapy Cloud

    $ pip install shub
    $ shub login
    Insert your Scrapinghub API Key: <API_KEY>
    
    # Deploy the spider to Scrapy Cloud
    $ shub deploy
    
    # Schedule the spider for execution
    $ shub schedule blogspider 
    Spider blogspider scheduled, watch it running here:
    https://app.scrapinghub.com/p/26731/job/1/8
    
    # Retrieve the scraped data
    $ shub items 26731/1/8
    {"title": "Improved Frontera: Web Crawling at Scale with Python 3 Support"}
    {"title": "How to Crawl the Web Politely with Scrapy"}
    ...
  • 或者使用 Scrapyd 在你自己的服务器中托管爬虫

Last updated

Was this helpful?