4.8编写spider完成抓取过程

获取文章列表页中的文章url并交给scrapy下载,并随之进行解析

获取下一页的url并交给scrapy进行下载,下载后交给parse解析

爬取单个url

爬取多个url

爬虫步骤

获取需要爬取的url列表

下载url列表中的每一个url,并从下载结果中解析出我们感兴趣的内容,将非结构的数据转化为结构性的数据

将解析结果存取

import scrapy
from urllib import parse
from scrapy import Request
for post_url in post_urls:
    yield Request(url=parse.urljoin(response.url, post_url), callback=self.parseDetail)

解析:

Request():创建一个请求(包含需要下载的url,网页下载下来后的回调)

yield:将Reques()构建好的请求进行下载

parse.urljoin(response.url,post_url):当post_url为相对路径(如/china.html)时,可通过parse.urljoin( )拼接出完整url(http://hello.com/hello/china.html),若post_url为完整url,则不再拼接url

callback=self.parseDetail:当url下载完成后,scapy将调用函数名为parseDetail的函数解析下载结果

完整内容

爬取类型不同的多个url

post_nodes = response.css("#archive .post-thumb a")
for post_node in post_nodes:
    image_url =post_node.css("img::attr(src)").extract_first("")
    post_url = post_node.css("::attr(href)").extract_first("")
    yield Request(url=parse.urljoin(response.url, post_url), callback=self.parseDetail,
                  meta={"image_url": image_url})

从response中获取post_url和image_url

front_image_url=response.meta.get('front_image_url')

定义结构化数据

ArticleSpider/items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class ArticlespiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

class JobBoleArticleItem(scrapy.Item):
    title = scrapy.Field()
    author = scrapy.Field()
    pub_time = scrapy.Field()
    content = scrapy.Field()
    like_num = scrapy.Field()
    favorite_num = scrapy.Field()
    comment_num = scrapy.Field()
    image_url = scrapy.Field()
    pass

提示:scrapy.Field()为数据类型,支持元组、字符串、等

使用item

from ArticleSpider.items import JobBoleArticleItem

class JobboleSpider(scrapy.Spider):

    def parseDetail(self, response):
        article_item=JobBoleArticleItem()
        ...
        article_item["title"]=title
        article_item["url"]=response.url
        article_item['image_url'] = [image_url]
        ...
        yield article_item

[image_url]表示传递的为列表

调用yield article_item将把封装好的article_item数据传递到pipelines.py中的item中

配置item pipelines

ArticleSpider/settings.py

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'ArticleSpider.pipelines.ArticlespiderPipeline': 300
}

提示:300为pipeline执行顺序,但有多个pipeline时及其有用,数值越小,越先执行

下载图片

配置item pipelines

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'ArticleSpider.pipelines.ArticlespiderPipeline': 300,
    'scrapy.pipelines.images.ImagesPipeline': 1
}

# 指定item中哪个字段为image的url
IMAGES_URLS_FIELD = "image_url" 
# 定义图片存储的位置
project_dir = os.path.abspath(os.path.dirname(__file__))
IMAGES_STORE = os.path.join(project_dir, 'images')
# 指定图片的尺寸
IMAGES_MIN_HEIGHT=100
IMAGES_MIN_WIDTH=100

提示:scrapy.pipelines.images.ImagesPipeline为scrapy内置的pipeline,除ImagePipeline外,scrapy中还内置了FilesPipeline、MediaPipeline,可在scrapy的源码scrapy.pipelines包下查看

需求

获取图片的地址,填充到item的front_image_path字段中

自定义图片pipeline

ArticleSpider/pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy.pipelines.images import ImagesPipeline


class ArticleImagePipeline(ImagesPipeline):
    def item_completed(self, results, item, info):
        for ok, value in results:
            image_file_path = value["path"]
        item["front_image_path"] = image_file_path
        return item

提示:在ImagePipeline的基础上自定义我们的pipeline ,关于ImagePipeline细节请见解析ImagePipeline

注意:自定义的Pipeline需要将item返回,传递给下一个pipeline处理

注册pipeline

ArticleSpider/settings.py

...
ITEM_PIPELINES = {
    ...
    'ArticleSpider.pipelines.ArticleImagePipeline'1
}
...

自定义工具类

python2和python3的区别

python3中所有的字符都采用了unicode

Last updated

Was this helpful?