Incremental reptile

I Incremental reptile

  • Concept: monitor the data update of a website through the crawler so that you can crawl to the new data updated by the website.
  • How to perform incremental crawling:
    • Determine whether the URL has been crawled before sending the request
    • After parsing the content, judge whether this part of the content has been crawled before
    • Judge whether the content already exists in the media when writing to the storage media
      • analysis:

        It is not difficult to find that the core of incremental crawling is de duplication. As for which step the de duplication operation works, it can only be said that each has its own advantages and disadvantages. In my opinion, the first two ideas need to be selected according to the actual situation (or both). The first idea is suitable for websites with new pages constantly appearing, such as new chapters of novels, the latest news every day, etc; The second idea is suitable for websites with updated page content. The third idea is the last line of defense. This can achieve the goal of weight removal to the greatest extent.

  • De duplication method
    • Store the url generated in the crawling process in the set of redis. When crawling data next time, first judge the url corresponding to the request to be initiated in the set of stored URLs. If it exists, no request will be made. Otherwise, the request will be made.
    • Make a unique identifier for the crawled web page content, and then store the unique representation in the set of redis. When crawling to web page data next time, you can first determine whether the unique identifier of the data exists in the redis set before persistent storage.

II Project case

-Demand: crawl all movie details data in 4567tv website.

Crawler file:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from redis import Redis
from incrementPro.items import IncrementproItem
class MovieSpider(CrawlSpider):
    name = 'movie'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.4567tv.tv/frim/index7-11.html']

    rules = (
        Rule(LinkExtractor(allow=r'/frim/index7-\d+\.html'), callback='parse_item', follow=True),
    )
    #Create redis link object
    conn = Redis(host='127.0.0.1',port=6379)
    def parse_item(self, response):
        li_list = response.xpath('//li[@class="p1 m1"]')
        for li in li_list:
            #Get the url of the details page
            detail_url = 'http://www.4567tv.tv'+li.xpath('./a/@href').extract_first()
            #Save the url of the details page into the set of redis
            ex = self.conn.sadd('urls',detail_url)
            if ex == 1:
                print('The url Data can be crawled if it has not been crawled')
                yield scrapy.Request(url=detail_url,callback=self.parst_detail)
            else:
                print('The data has not been updated. There is no new data to crawl!')

    #Resolve the movie name and type in the details page for persistent storage
    def parst_detail(self,response):
        item = IncrementproItem()
        item['name'] = response.xpath('//dt[@class="name"]/text()').extract_first()
        item['kind'] = response.xpath('//div[@class="ct-c"]/dl/dt[4]//text()').extract()
        item['kind'] = ''.join(item['kind'])
        yield item

Pipeline file:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

from redis import Redis
class IncrementproPipeline(object):
    conn = None
    def open_spider(self,spider):
        self.conn = Redis(host='127.0.0.1',port=6379)
    def process_item(self, item, spider):
        dic = {
            'name':item['name'],
            'kind':item['kind']
        }
        print(dic)
        self.conn.lpush('movieData',dic)
        return item

-Requirements: crawl the passages and author data in the embarrassment encyclopedia.

Crawler file:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from incrementByDataPro.items import IncrementbydataproItem
from redis import Redis
import hashlib
class QiubaiSpider(CrawlSpider):
    name = 'qiubai'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.qiushibaike.com/text/']

    rules = (
        Rule(LinkExtractor(allow=r'/text/page/\d+/'), callback='parse_item', follow=True),
        Rule(LinkExtractor(allow=r'/text/$'), callback='parse_item', follow=True),
    )
    #Create redis link object
    conn = Redis(host='127.0.0.1',port=6379)
    def parse_item(self, response):
        div_list = response.xpath('//div[@id="content-left"]/div')

        for div in div_list:
            item = IncrementbydataproItem()
            item['author'] = div.xpath('./div[1]/a[2]/h2/text() | ./div[1]/span[2]/h2/text()').extract_first()
            item['content'] = div.xpath('.//div[@class="content"]/span/text()').extract_first()

            #Generate a unique identifier from the parsed data value for redis storage
            source = item['author']+item['content']
            source_id = hashlib.sha256(source.encode()).hexdigest()
            #Store the unique representation of the parsed content in the redis data_ In ID
            ex = self.conn.sadd('data_id',source_id)

            if ex == 1:
                print('This data has not been crawled and can be crawled......')
                yield item
            else:
                print('This data has been crawled, so it is unnecessary to crawl again!!!')



Pipeline file:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

from redis import Redis
class IncrementbydataproPipeline(object):
    conn = None

    def open_spider(self, spider):
        self.conn = Redis(host='127.0.0.1', port=6379)

    def process_item(self, item, spider):
        dic = {
            'author': item['author'],
            'content': item['content']
        }
        # print(dic)
        self.conn.lpush('qiubaiData', dic)
        return item

Tags: crawler

Posted by Panz3r on Fri, 03 Jun 2022 12:38:50 +0530