1.1.Scrapy简介
Scrapy是一个开源框架,用一种快速、简单、可扩展的方式从网站中提取你需要的数据。
Scrapy特点
快速且强大:你仅仅需要编写用于提取数据的规则,Scrapy会为你完成其余工作
扩展容易:通过设计扩展,无需接触核心即可轻松插入新功能
便携:基于python语言,各平台兼容
构建并运行你的网络爬虫
$ pip install scrapy
$ cat > myspider.py <<EOF
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://blog.scrapinghub.com']
def parse(self, response):
for title in response.css('.post-header>h2'):
yield {'title': title.css('a ::text').get()}
for next_page in response.css('a.next-posts-link'):
yield response.follow(next_page, self.parse)
EOF
$ scrapy runspider myspider.py
部署网络爬虫
Scrapy支持多种部署部署方式,你可根据你的实际情况决定选用哪一种方式
部署爬虫到Scrapy Cloud
$ pip install shub $ shub login Insert your Scrapinghub API Key: <API_KEY> # Deploy the spider to Scrapy Cloud $ shub deploy # Schedule the spider for execution $ shub schedule blogspider Spider blogspider scheduled, watch it running here: https://app.scrapinghub.com/p/26731/job/1/8 # Retrieve the scraped data $ shub items 26731/1/8 {"title": "Improved Frontera: Web Crawling at Scale with Python 3 Support"} {"title": "How to Crawl the Web Politely with Scrapy"} ...
或者使用 Scrapyd 在你自己的服务器中托管爬虫
Last updated
Was this helpful?