获取文章列表页中的文章url并交给scrapy下载,并随之进行解析
获取下一页的url并交给scrapy进行下载,下载后交给parse解析
爬取单个url
爬取多个url
爬虫步骤
获取需要爬取的url列表
下载url列表中的每一个url,并从下载结果中解析出我们感兴趣的内容,将非结构的数据转化为结构性的数据
将解析结果存取
Copy import scrapy
from urllib import parse
from scrapy import Request
Copy for post_url in post_urls:
yield Request(url=parse.urljoin(response.url, post_url), callback=self.parseDetail)
解析:
Request():创建一个请求(包含需要下载的url,网页下载下来后的回调)
yield:将Reques()构建好的请求进行下载
parse.urljoin(response.url,post_url):当post_url为相对路径(如/china.html)时,可通过parse.urljoin( )拼接出完整url(http://hello.com/hello/china.html),若post_url为完整url,则不再拼接url
callback=self.parseDetail:当url下载完成后,scapy将调用函数名为parseDetail的函数解析下载结果
完整内容
爬取类型不同的多个url
Copy post_nodes = response.css("#archive .post-thumb a")
for post_node in post_nodes:
image_url =post_node.css("img::attr(src)").extract_first("")
post_url = post_node.css("::attr(href)").extract_first("")
yield Request(url=parse.urljoin(response.url, post_url), callback=self.parseDetail,
meta={"image_url": image_url})
从response中获取post_url和image_url
Copy front_image_url=response.meta.get('front_image_url')
定义结构化数据
ArticleSpider/items.py
Copy # -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class ArticlespiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
class JobBoleArticleItem(scrapy.Item):
title = scrapy.Field()
author = scrapy.Field()
pub_time = scrapy.Field()
content = scrapy.Field()
like_num = scrapy.Field()
favorite_num = scrapy.Field()
comment_num = scrapy.Field()
image_url = scrapy.Field()
pass
提示:scrapy.Field()为数据类型,支持元组、字符串、等
使用item
Copy from ArticleSpider.items import JobBoleArticleItem
class JobboleSpider(scrapy.Spider):
def parseDetail(self, response):
article_item=JobBoleArticleItem()
...
article_item["title"]=title
article_item["url"]=response.url
article_item['image_url'] = [image_url]
...
yield article_item
[image_url]表示传递的为列表
调用yield article_item将把封装好的article_item数据传递到pipelines.py中的item中
配置item pipelines
ArticleSpider/settings.py
Copy # Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'ArticleSpider.pipelines.ArticlespiderPipeline': 300
}
提示:300为pipeline执行顺序,但有多个pipeline时及其有用,数值越小,越先执行
下载图片
配置item pipelines
Copy # Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'ArticleSpider.pipelines.ArticlespiderPipeline': 300,
'scrapy.pipelines.images.ImagesPipeline': 1
}
# 指定item中哪个字段为image的url
IMAGES_URLS_FIELD = "image_url"
# 定义图片存储的位置
project_dir = os.path.abspath(os.path.dirname(__file__))
IMAGES_STORE = os.path.join(project_dir, 'images')
# 指定图片的尺寸
IMAGES_MIN_HEIGHT=100
IMAGES_MIN_WIDTH=100
提示:scrapy.pipelines.images.ImagesPipeline
为scrapy内置的pipeline,除ImagePipeline外,scrapy中还内置了FilesPipeline、MediaPipeline,可在scrapy的源码scrapy.pipelines包下查看
需求
获取图片的地址,填充到item的front_image_path字段中
自定义图片pipeline
ArticleSpider/pipelines.py
Copy # -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.pipelines.images import ImagesPipeline
class ArticleImagePipeline(ImagesPipeline):
def item_completed(self, results, item, info):
for ok, value in results:
image_file_path = value["path"]
item["front_image_path"] = image_file_path
return item
提示:在ImagePipeline的基础上自定义我们的pipeline ,关于ImagePipeline细节请见解析ImagePipeline
注意:自定义的Pipeline需要将item返回,传递给下一个pipeline处理
注册pipeline
ArticleSpider/settings.py
Copy ...
ITEM_PIPELINES = {
...
'ArticleSpider.pipelines.ArticleImagePipeline'1
}
...
自定义工具类
python2和python3的区别
python3中所有的字符都采用了unicode