5. The crawler Scrapy framework Redis database manually requests sending

Reptile

1 Scrapy framework

1.1 INTRODUCTION

Scrapy framework is an application framework based on asynchronous crawler, which is used for high-performance data parsing, high-performance persistent storage, whole site data crawling, incremental crawler and distributed crawler.

1.2 environmental installation

Windows

1. pip install wheel
2. download twisted 
   http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
3. enter twisted Directory, execute pip3 install twisted file name
   for example pip install Twisted-20.3.0-cp38-cp38-win_amd64.whl
4. pip install pywin32
5. pip install scrapy

Linux

pip install scrapy

explain
The twisted plug-in is a third-party component of the script framework that implements asynchronous operations.

1.3 basic use

operation command
Create project Scratch startproject XXX (crawler project name)
Create crawler file Scratch genspider xxx Com (crawl domain)
Generate file Scratch crawl XXX -o xxx JSON (generate a certain type of file)
Run crawler Scratch crawl XXX (reptile name)
List all Crawlers scrapy list
Get configuration information scrapy settings [options]
1.3.1 create project
scrapy startproject DemoScrapyProject
1.3.2 project directory
DemoScrapyProject
│  scrapy.cfg
│
└─DemoScrapyProject
    │  items.py
    │  middlewares.py
    │  pipelines.py
    │  settings.py
    │  __init__.py
    │
    ├─spiders
    │  │  __init__.py
    │  │
    │  └─__pycache__
    └─__pycache__

Under the project name folder in the project directory, there is a folder named spiders, which is equivalent to the crawler package.
At least one crawler file needs to be created in the spiders crawler folder.

1.3.3 create crawler file
cd DemoScrapyProject
scrapy genspider test www.test.com 

Executing the create command under the project directory will automatically create a crawler file named test in the spiders crawler folder.

import scrapy

class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['www.test.com']
    start_urls = ['http://www.test.com/']

    def parse(self, response):
        pass
Class properties explain
name The crawler file name is the unique identifier of the crawler file and cannot be duplicated.
allowed_domains Allow domain name, and only crawl the web pages under the restricted domain name.
start_urls The starting url list is used to store the target url of the upcoming request.
parse Used for data parsing. The parameter response is a response object automatically passed in.
import scrapy

class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['www.test.com']
    start_urls = ['https://www.sougou.com/', 'https://www.baidu.com/']

    def parse(self, response):
        pass
1.3.4 modifying configuration files

settings.py

  1. Modify log output level
LOG_LEVEL = 'ERROR'
  1. robots protocol is not required
ROBOTSTXT_OBEY = False
  1. UA camouflage
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
1.3.5 execution of works
scrapy crawl test
1.3.6 data analysis

Target: crawl Embarrassment Encyclopedia Paragraph in.
Method: use xpath to locate the tag, and then extract the data.

There are two ways to extract data:
extract() is used to get the data in the Selector object;
extract_first() returns a string and takes the data in the first Selector object in the list.

selector_obj.extract()  For removal selector_obj Medium data,Return string;
[selector_obj1, selector_obj2...].extract()  Used to fetch each selector_obj Medium data,Return to the list.

[selector_obj1, selector_obj2...].extract_first() Used to fetch the first element in the list selector_obj1 Medium data,Returns a string.
def parse(self, response):
    div_list = response.xpath('//*[@id="content"]/div/div[2]/div')
    for each_div in div_list:
        # The Selector object is returned
        # Extraction method 1: use extract()
        author_selector_list1 = each_div.xpath('./div[1]/a[2]/h2/text()')  # [<selector xpath='./div[1]/a[2]/h2/text() 'data='\n eat two bowls and serve \n'>]
        author_list1 = author_selector_list1.extract()  # ['\n after eating two bowls \n']
		author_str1 = author_list1[0]  # 'eat two bowls and serve again'
		
        author_selector_obj2 = each_div.xpath('./div[1]/a[2]/h2/text()')[0]  # <selector xpath='/ Div[1]/a[2]/h2/text() 'data='\n eat two bowls and serve \n'>
        author_str2 = author_selector_obj2.extract()  # 'eat two bowls and serve again'
		
		# Extraction method 2: use extract_first()
        author_str2 = each_div.xpath('./div[1]/a[2]/h2/text()').extract_first()
        print(author_str2)
import scrapy

class TestSpider(scrapy.Spider):
    name = 'test'
    # allowed_domains = ['www.test.com']
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        div_list = response.xpath('//*[@id="content"]/div/div[2]/div')
        for each_div in div_list:
            author_str = each_div.xpath('./div[1]/a[2]/h2/text()').extract_first()
            content_list = each_div.xpath('./a/div/span//text()').extract()
            content_str = ''.join(content_list)
1.3.7 persistent storage

There are two ways:

  1. Persistent storage based on terminal instructions;
  2. Pipeline based persistent storage.
1.3.7.1 persistent storage based on terminal instructions

Limitations:

  1. Only the return value of the parse method can be persisted;
  2. It can only be stored in files of specified types, including json, jsonlines, jl, csv, xml, Marshall, pickle;
  3. Writing data to the database is not supported.
import scrapy

class TestSpider(scrapy.Spider):
    name = 'test'
    # allowed_domains = ['www.test.com']
    start_urls = ['https://www.qiushibaike.com/text/']
    
    def parse(self, response):
    	result_list = []
        div_list = response.xpath('//*[@id="content"]/div/div[2]/div')
        for each_div in div_list:
            author_str = each_div.xpath('./div[1]/a[2]/h2/text()').extract_first()
            content_list = each_div.xpath('./a/div/span//text()').extract()
            content_str = ''.join(content_list)
            temp_dict = {
				'author': author_str,
				'content': content_str
			}
            result_list.append(temp_dict)
        return result_list

Terminal instruction

scrapy crawl test -o result.csv
1.3.7.2 pipeline based persistent storage

Steps:

  1. Data analysis;
  2. In items Py, and the number of attributes must be consistent with the number of fields parsed;
  3. Store the parsed data in an Item type object;
  4. Submit the Item object to the pipeline;
  5. Receive an Item object in the pipeline, and make any form of persistent storage of the data stored in the Item object;
  6. Turn on the pipeline mechanism in the configuration file.

Step 1 perform data analysis.

test.py

import scrapy

class TestSpider(scrapy.Spider):
    name = 'test'
    # allowed_domains = ['www.test.com']
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        div_list = response.xpath('//*[@id="content"]/div/div[2]/div')
        for each_div in div_list:
            author_str = each_div.xpath('./div[1]/a[2]/h2/text()').extract_first()
            content_list = each_div.xpath('./a/div/span//text()').extract()
            content_str = ''.join(content_list)

Step 2 in items Py.

items.py

import scrapy

class DemoscrapyprojectItem(scrapy.Item):
    author = scrapy.Field()
    content = scrapy.Field()

Step 3 stores the parsed data in an Item type object.

test.py

import scrapy
from DemoScrapyProject.items import DemoscrapyprojectItem

class TestSpider(scrapy.Spider):
    name = 'test'
    # allowed_domains = ['www.test.com']
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        div_list = response.xpath('//*[@id="content"]/div/div[2]/div')
        for each_div in div_list:
            author_str = each_div.xpath('./div[1]/a[2]/h2/text()').extract_first()
            content_list = each_div.xpath('./a/div/span//text()').extract()
            content_str = ''.join(content_list)
            
			# Instantiate Item object
            item_obj = DemoscrapyprojectItem()
            item_obj['author'] = author_str
            item_obj['content'] = content_str

Step 4 submit the Item object to the pipeline.

test.py

import scrapy
from DemoScrapyProject.items import DemoscrapyprojectItem

class TestSpider(scrapy.Spider):
    name = 'test'
    # allowed_domains = ['www.test.com']
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        div_list = response.xpath('//*[@id="content"]/div/div[2]/div')
        for each_div in div_list:
            author_str = each_div.xpath('./div[1]/a[2]/h2/text()').extract_first()
            content_list = each_div.xpath('./a/div/span//text()').extract()
            content_str = ''.join(content_list)

            # Instantiate Item object
            item_obj = DemoscrapyprojectItem()
            item_obj['author'] = author_str
            item_obj['content'] = content_str

            # Submit Item object to pipeline
            yield item_obj

Step 5 receive the Item object in the pipeline and persist the stored data.

pipelines.py

class DemoscrapyprojectPipeline:
    fp = None

    # Override the parent method to open the file.
    # This method will only be executed once when the crawler starts during the execution of the whole project.
    def open_spider(self, spider):
        self.fp = open('./result.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        '''
        For receiving Item Object, the number of times this method is called depends on the submission of the crawler file to the pipeline Item Number of times the object.
        :param item: Received Item object
        :param spider: An object instantiated by a crawler class
        :return:
        '''
        author = item['author']
        content = item['content']
        self.fp.write(author)
        self.fp.write(content)
        return item

    # Override the parent method to close the file.
    # This method will only be executed once at the end of the crawler during the execution of the whole project.
    def close_spider(self, spider):
        self.fp.close()

The parameter spider in the method refers to test The object instantiated by the crawler class in. Py is used to access the data of the crawler class in the pipeline file.

Step 6 enable the pipeline mechanism in the configuration file.

settings.py

ITEM_PIPELINES = {
   'DemoScrapyProject.pipelines.DemoscrapyprojectPipeline': 300,
}
1.3.7.3 pipeline detail analysis
  1. priority
    The number 300 indicates the priority of the pipeline. The smaller the number, the higher the priority of the pipeline.
    The higher the priority of a pipeline, the higher the priority of the pipeline.

settings.py

ITEM_PIPELINES = {
   'DemoScrapyProject.pipelines.DemoscrapyprojectPipeline': 300,
}
  1. Multiple pipe classes
    Multiple pipeline classes are generally used for data backup, and each pipeline class represents the storage of data in a form of carrier.
    If you need to store data in MySQL and Redis at the same time, you need two pipeline classes to implement it.

  2. return item
    The crawler file will only submit the Item object to the pipeline with the highest priority.
    Method process of pipe class_ return item in item can hand over the item object to the next pipeline class to be executed.

For example, encapsulate a pipe class to store data in a MySQL database.

pipelines.py

import pymysql

class MySQLPipeline:
    conn = None  # Database connection object
    cursor = None  # Cursor object 

    def open_spider(self, spider):
        self.conn = pymysql.Connect(
            host='127.0.0.1',
            port=3306,
            user='root',
            password='123456',
            db='spider_db',
            charset='utf8'
        )
        self.cursor = self.conn.cursor()

    def process_item(self, item, spider):
        author = item['author']
        content = item['content']
        sql = 'insert into qiushi values ("%s", "%s")' % (author, content)
        try:
            self.cursor.execute(sql)
            self.conn.commit()
        except Exception as e:
            print(e)
            self.conn.rollback()
            
        return item  # Used to pass the item object to the next pipe class.

    def close_spider(self, spider):
        self.cursor.close()
        self.conn.close()

Register this pipe class in the configuration file

settings.py

ITEM_PIPELINES = {
   'DemoScrapyProject.pipelines.DemoscrapyprojectPipeline': 300,
   'DemoScrapyProject.pipelines.MySQLPipeline': 301,
}

2. Introduction to redis database

2.1 introduction

Redis is a non relational database.

Start the server redis-server
 Start client redis-cli

2.2 basic storage unit

  1. Set set
  2. List list
2.2.1 set set
  1. insert data
    Syntax: sadd collection name value
    Note: collections cannot store duplicate data.
    Store the strings Ben and Elliot in the set name_set.
127.0.0.1:6379> sadd name_set Ben
(integer) 1

127.0.0.1:6379> sadd name_set Elliot
(integer) 1

127.0.0.1:6379> sadd name_set Ben
(integer) 0
  1. View data
    Syntax: smembers collection name
    View collection name_ Data in set.
127.0.0.1:6379> smembers name_set
1) "Elliot"
2) "Ben"
2.2.2 list
  1. insert data
    Syntax: lpush list name value
    Lists allow duplicate data to be stored.
    Store the strings Ben and Elliot in the list name_list.
127.0.0.1:6379> lpush name_list Ben
(integer) 1

127.0.0.1:6379> lpush name_list Elliot
(integer) 2

127.0.0.1:6379> lpush name_list Ben
(integer) 3
  1. View list length
    Syntax: len list name
    View list name_ The length of the list.
127.0.0.1:6379> llen name_list
(integer) 3
  1. View data
    Syntax: lrange list name start subscript end subscript
127.0.0.1:6379> lrange name_list 0 1
1) "Ben"
2) "Elliot"

127.0.0.1:6379> lrange name_list 0 -1
1) "Ben"
2) "Elliot"
3) "Ben"

127.0.0.1:6379> lrange name_list 1 1
1) "Elliot"
2.2.3 general commands
  1. View all data
127.0.0.1:6379> keys *
1) "name_list"
2) "name_set"
  1. Delete all data
127.0.0.1:6379> flushall
OK

2.3 cooperation between scripy and Redis

To install the redis module, you need to specify a version of 2.10.6.
Other versions of redis modules cannot directly store dictionary type data into the list of redis databases.

pip install redis-2.10.6
2.3.1 encapsulating Redis' pipeline class

pipelines.py

import redis

class RedisPipeline:
    conn = None

    def open_spider(self, spider):
        self.conn = redis.Redis(
            host='127.0.0.1',
            port=6379
        )

    def process_item(self, item, spider):
        self.conn.lpush('qiushiData', item)
        return item
2.3.2 register the pipeline class in the configuration file

settings.py

ITEM_PIPELINES = {
   'DemoScrapyProject.pipelines.DemoscrapyprojectPipeline': 300,
   'DemoScrapyProject.pipelines.MySQLPipeline': 301,
   'DemoScrapyProject.pipelines.RedisPipeline': 302,
}

3. Data crawling of the whole station

Objective: crawl and store the data of all page numbers.

3.1 crawl the first page of data

>>> scrapy startproject duanziProject
>>> cd duanziProject
>>> scrapy genspider duanzi www.duanziwang.com/

settings.py

BOT_NAME = 'duanziProject'

SPIDER_MODULES = ['duanziProject.spiders']
NEWSPIDER_MODULE = 'duanziProject.spiders'

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'

ROBOTSTXT_OBEY = False

LOG_LEVEL = 'ERROR'

ITEM_PIPELINES = {
   'duanziProject.pipelines.DuanziprojectPipeline': 300,
}

items.py

import scrapy

class DuanziprojectItem(scrapy.Item):
    title = scrapy.Field()
    content = scrapy.Field()

duanzi.py

import scrapy
from duanziProject.items import DuanziprojectItem

class DuanziSpider(scrapy.Spider):
    name = 'duanzi'
    # allowed_domains = ['www.duanziwang.com/']
    start_urls = ['https://Duanziwang Com/category/ one sentence paragraph /1/']

    def parse(self, response):
        article_list = response.xpath('/html/body/section/div/div/main/article')
        for each_article in article_list:
            title_str = each_article.xpath('./div[1]/h1/a/text()').extract_first()
            content_str = each_article.xpath('./div[2]/p/text()').extract_first()
            item = DuanziprojectItem()
            item['title'] = title_str
            item['content'] = content_str
            yield item

piplines.py

class DuanziprojectPipeline:
    fp = None

    def open_spider(self, spider):
        self.fp = open('./duanzi.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        self.fp.write(item['title'])
        self.fp.write(item['content'])
        return item

    def close_spider(self, spider):
        self.fp.close()

3.2 manual request sending

Instead of using the built-in method of Scrapy, send the request in the form of code.

3.2.1 manual request for get sending
yield scrapy.Request(url, callback)

Send a get request to the specified url, and parse the response data through the callback function callback.

duanzi.py

import scrapy
from duanziProject.items import DuanziprojectItem

class DuanziSpider(scrapy.Spider):
    name = 'duanzi'
    # allowed_domains = ['www.duanziwang.com/']
    start_urls = ['https://Duanziwang Com/category/ one sentence paragraph /1/']
    url_model = 'https://Duanziwang Com/category/ one sentence paragraph /%d/'
    page_num = 2

    def parse(self, response):
        article_list = response.xpath('/html/body/section/div/div/main/article')
        for each_article in article_list:
            title_str = each_article.xpath('./div[1]/h1/a/text()').extract_first()
            content_str = each_article.xpath('./div[2]/p/text()').extract_first()
            item = DuanziprojectItem()
            item['title'] = title_str
            item['content'] = content_str
            yield item

        # End recursion condition
        if self.page_num < 10:
            new_url = format(self.url_model % self.page_num)
            self.page_num += 1

            # Send the request manually and call the callback function to parse the requested data.
            # The callback function is parse, which sends recursive calls.
            yield scrapy.Request(url=new_url, callback=self.parse)
3.2.2 manually request post sending
yield scrapy.FormRequest(url, formdata, callback)

The parameter formdata refers to the request parameters carried by the post request. The format is dictionary.

3.2.3 parent method start_requests

Question: how to start_ Send a post request for each url in the url list?
Key: override parent method start_requests
Parent method start_requests to start_ Each url in the url list sends a get request by default.

duanzi.py

import scrapy
from duanziProject.items import DuanziprojectItem

class DuanziSpider(scrapy.Spider):
    name = 'duanzi'
    start_urls = ['https://Duanziwang Com/category/ one sentence paragraph /1/']
	...
	
	# The original implementation of the parent method. The request method is get.
	def start_requests(self):
		for url in start_urls:
			yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        ...

Override parent class method start_requests

duanzi.py

import scrapy
from duanziProject.items import DuanziprojectItem

class DuanziSpider(scrapy.Spider):
    name = 'duanzi'
    start_urls = ['https://Duanziwang Com/category/ one sentence paragraph /1/']
	...
	
	# Override the parent method and change the request mode to post.
	def start_requests(self):
		for url in start_urls:
			yield scrapy.FormRequest(url=url, callback=self.parse)

    def parse(self, response):
        ...

Tags: Database Python Redis

Posted by MitchEvans on Mon, 30 May 2022 13:43:04 +0530