I Incremental reptile
- Concept: monitor the data update of a website through the crawler so that you can crawl to the new data updated by the website.
- How to perform incremental crawling:
- Determine whether the URL has been crawled before sending the request
- After parsing the content, judge whether this part of the content has been crawled before
- Judge whether the content already exists in the media when writing to the storage media
- analysis:
It is not difficult to find that the core of incremental crawling is de duplication. As for which step the de duplication operation works, it can only be said that each has its own advantages and disadvantages. In my opinion, the first two ideas need to be selected according to the actual situation (or both). The first idea is suitable for websites with new pages constantly appearing, such as new chapters of novels, the latest news every day, etc; The second idea is suitable for websites with updated page content. The third idea is the last line of defense. This can achieve the goal of weight removal to the greatest extent.
- analysis:
- De duplication method
- Store the url generated in the crawling process in the set of redis. When crawling data next time, first judge the url corresponding to the request to be initiated in the set of stored URLs. If it exists, no request will be made. Otherwise, the request will be made.
- Make a unique identifier for the crawled web page content, and then store the unique representation in the set of redis. When crawling to web page data next time, you can first determine whether the unique identifier of the data exists in the redis set before persistent storage.
II Project case
-Demand: crawl all movie details data in 4567tv website.
Crawler file:
# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from redis import Redis from incrementPro.items import IncrementproItem class MovieSpider(CrawlSpider): name = 'movie' # allowed_domains = ['www.xxx.com'] start_urls = ['http://www.4567tv.tv/frim/index7-11.html'] rules = ( Rule(LinkExtractor(allow=r'/frim/index7-\d+\.html'), callback='parse_item', follow=True), ) #Create redis link object conn = Redis(host='127.0.0.1',port=6379) def parse_item(self, response): li_list = response.xpath('//li[@class="p1 m1"]') for li in li_list: #Get the url of the details page detail_url = 'http://www.4567tv.tv'+li.xpath('./a/@href').extract_first() #Save the url of the details page into the set of redis ex = self.conn.sadd('urls',detail_url) if ex == 1: print('The url Data can be crawled if it has not been crawled') yield scrapy.Request(url=detail_url,callback=self.parst_detail) else: print('The data has not been updated. There is no new data to crawl!') #Resolve the movie name and type in the details page for persistent storage def parst_detail(self,response): item = IncrementproItem() item['name'] = response.xpath('//dt[@class="name"]/text()').extract_first() item['kind'] = response.xpath('//div[@class="ct-c"]/dl/dt[4]//text()').extract() item['kind'] = ''.join(item['kind']) yield item
Pipeline file:
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html from redis import Redis class IncrementproPipeline(object): conn = None def open_spider(self,spider): self.conn = Redis(host='127.0.0.1',port=6379) def process_item(self, item, spider): dic = { 'name':item['name'], 'kind':item['kind'] } print(dic) self.conn.lpush('movieData',dic) return item
-Requirements: crawl the passages and author data in the embarrassment encyclopedia.
Crawler file:
# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from incrementByDataPro.items import IncrementbydataproItem from redis import Redis import hashlib class QiubaiSpider(CrawlSpider): name = 'qiubai' # allowed_domains = ['www.xxx.com'] start_urls = ['https://www.qiushibaike.com/text/'] rules = ( Rule(LinkExtractor(allow=r'/text/page/\d+/'), callback='parse_item', follow=True), Rule(LinkExtractor(allow=r'/text/$'), callback='parse_item', follow=True), ) #Create redis link object conn = Redis(host='127.0.0.1',port=6379) def parse_item(self, response): div_list = response.xpath('//div[@id="content-left"]/div') for div in div_list: item = IncrementbydataproItem() item['author'] = div.xpath('./div[1]/a[2]/h2/text() | ./div[1]/span[2]/h2/text()').extract_first() item['content'] = div.xpath('.//div[@class="content"]/span/text()').extract_first() #Generate a unique identifier from the parsed data value for redis storage source = item['author']+item['content'] source_id = hashlib.sha256(source.encode()).hexdigest() #Store the unique representation of the parsed content in the redis data_ In ID ex = self.conn.sadd('data_id',source_id) if ex == 1: print('This data has not been crawled and can be crawled......') yield item else: print('This data has been crawled, so it is unnecessary to crawl again!!!')
Pipeline file:
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html from redis import Redis class IncrementbydataproPipeline(object): conn = None def open_spider(self, spider): self.conn = Redis(host='127.0.0.1', port=6379) def process_item(self, item, spider): dic = { 'author': item['author'], 'content': item['content'] } # print(dic) self.conn.lpush('qiubaiData', dic) return item