1.1.Scrapy简介

Scrapy是一个开源框架，用一种快速、简单、可扩展的方式从网站中提取你需要的数据。

Scrapy特点

快速且强大：你仅仅需要编写用于提取数据的规则，Scrapy会为你完成其余工作
扩展容易：通过设计扩展，无需接触核心即可轻松插入新功能
便携：基于python语言，各平台兼容

构建并运行你的网络爬虫

$ pip install scrapy 
$ cat > myspider.py <<EOF
import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        for title in response.css('.post-header>h2'):
            yield {'title': title.css('a ::text').get()}

        for next_page in response.css('a.next-posts-link'):
            yield response.follow(next_page, self.parse)
EOF
$ scrapy runspider myspider.py

部署网络爬虫

Scrapy支持多种部署部署方式，你可根据你的实际情况决定选用哪一种方式

部署爬虫到Scrapy Cloud

$ pip install shub
$ shub login
Insert your Scrapinghub API Key: <API_KEY>

# Deploy the spider to Scrapy Cloud
$ shub deploy

# Schedule the spider for execution
$ shub schedule blogspider 
Spider blogspider scheduled, watch it running here:
https://app.scrapinghub.com/p/26731/job/1/8

# Retrieve the scraped data
$ shub items 26731/1/8
{"title": "Improved Frontera: Web Crawling at Scale with Python 3 Support"}
{"title": "How to Crawl the Web Politely with Scrapy"}
...

或者使用 Scrapyd 在你自己的服务器中托管爬虫

PreviousScrapy Next2.1虚拟环境的安装与配置

Last updated 4 years ago

Was this helpful?