python crawler scrapy case--crawling ancient poetry Web

Crawl Ancient Poems and Articles Web

demand

Crawl the data of the poems on the web, crawl the name, author, Dynasty and content of each poem

Page Analysis

To crawl the poems on the page and copy the contents of any poem, you can find them in the source code of the page, indicating that the page is loaded statically, indicating that the URL displayed is the target of the crawl, and you can obtain data directly with the url. Target url:https://www.gushiwen.cn/.
Select any title of a poem, right-click to check, you will find that the content of the title is stored in the a tag under the p tag.

Fold the tag inside the P tag to see that the first p tag stores the title of the poem, the second P tag stores the author and dynasty, and the div tag below stores the content of the poem.

Continuing to fold the label up, we will find that each poem is stored in the label of div[@class="sons"], and all poems are stored in the label of div[@class="left"]

Turn to the bottom of the page and find that you need to page. Click on the next page to find the url. Try changing the second page to 1 to enter the interface of the first page.
https://www.gushiwen.cn/default_2.aspx Page 2
https://www.gushiwen.cn/default_1.aspx First page
You can also jump pages by entering numbers below, either way
https://www.gushiwen.cn/default.aspx?page=2 Page 2
https://www.gushiwen.cn/default.aspx?page=1 First page

code implementation

1. Create a scrapy project

My pycham project directory here is on drive D, enter cmd, enter cd carriage return, enter D: carriage return, switch the path to drive D, copy the project directory into the project file directory, enter scrapy startproject +project name, enter cd +project name into the project after creation, enter scrapy genspider +crawler file name+Web site domain name, create crawler file.

2. Configuring files

settings.py
LOG_LEVEL = 'WARNING'
BOT_NAME = 'prose'

SPIDER_MODULES = ['prose.spiders']
NEWSPIDER_MODULE = 'prose.spiders'
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36 Edg/92.0.902.62'
}
ITEM_PIPELINES = {
   'prose.pipelines.ProsePipeline': 300,
}

If you want to use variables and methods from another py file, you can import them into the PY file you are using by importing classes
Method 1: Direct import. The default pycharm open directory is the root directory, and first-level import. from day22.mySpider.mySpider.items import MyspiderItem
Method 2: Set the root directory yourself. Right-click on the folder where you want to set the root directory, Mark Directory as - > Sources Root, and use the import folder name when importing, such as from mySpider.items import MyspiderItem

Right-click on the next page and find that the next page is in the a tag, where the href value is the address of the next url, we can page through the href property value

3. Other running files

Start Program

start.py
from scrapy import cmdline

# cmdline.execute(['scrapy', 'crawl', 'gsw'])  # Method One
cmdline.execute('scrapy crawl gsw'.split(" "))  # Method 2

Pipeline File

pipelines.py
import json

class ProsePipeline:
    # First method:
    # def process_item(self, item, spider):
        # # print(item)  # Test if the pipeline receives data and print when it receives it
        # item passed in as an object, forcing it into a dictionary and then into a json-formatted string. nsure_ascii=False handles Chinese
        # item_json = json.dumps(dict(item), ensure_ascii=False)
        # with open('prose.txt', 'a', encoding='utf-8') as f:
        #     f.write(item_json + '\n')
        # return item

        # Second method
    def open_spider(self, spider):
        self.f = open('prose.txt', 'w', encoding='utf-8')
    def process_item(self, item, spider):
        # print(item)  # Test if the pipeline receives data and print when it receives it
        item_json = json.dumps(dict(item), ensure_ascii=False)
        self.f.write(item_json + '\n')
        return item
    def close_spider(self, spider):
        self.f.close()
items.py
import scrapy
class ProseItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    author = scrapy.Field()
    dynasty = scrapy.Field()
    comtents = scrapy.Field()
    pass

Crawler Files

gsw.py
import scrapy

from prose.items import ProseItem  # Import ProseItem class from items
class GswSpider(scrapy.Spider):
    name = 'gsw'
    allowed_domains = ['gushiwen.cn']
    start_urls = ['http://gushiwen.cn/']

    def parse(self, response):
        div_sons = response.xpath('.//div[@class="left"]/div[@class="sons"]')
        for div_son in div_sons:
            # Title of Ancient Poems
            title = div_son.xpath('.//b/text()').get()
            # Authors of Ancient Poetry and Dynasties
            # getall() returns a list with values based on the index
            # Note that if you do not add "." to the path of the xpath, the data will be abnormal and will be printed repeatedly
            # There will be something else between the poems, and empty values will be found
            try:
                name = div_son.xpath('.//p/a/text()').getall()
                author = name[0]  # author
                dynasty = name[1]  # Dynasty
                # extract() is equivalent to getall()
                # The data from xpath is a list with spaces at the beginning and end of the list elements, strip() to remove the first space
                # Stitching list elements with an empty string returns a string
                contson = div_son.xpath('.//div[@class="contson"]/text()').extract()
                comtents = ''.join(contson).strip()
                # The first method is in the form of a dictionary
                # item = {}
                # item['title'] = title
                # item['author'] = author
                # item['dynasty'] = dynasty
                # item['comtents'] = comtents
                # Second method: how to instantiate an object. Import the ProseItem class in items and create several methods in advance in the class
                item = ProseItem(title=title, author=author, dynasty=dynasty, comtents=comtents)
                # print(item)
                yield item  # Send to Pipeline
            except:
                print(title)
        next_page = response.xpath('.//a[@id="amore"]/@href').get()
        # print(next_page)
        # If you get the data on the next page, turn it over
        if next_page:
            # First method:
            # req = scrapy.Request(next_page)
            # yield req  # Return to engine, repeat to scheduler, downloader, crawler, pipeline, etc.
            yield scrapy.Request(
                url=next_page,  # Which function does the url get resolved to, using callback
                # callback=self.parse  # Callback function, the default current function, can be omitted if it is also parsed with the current function.
            )

The print interface will have a blank value, because there are pictures and other content between each poem in the web page, and what you find is blank.

Tags: Python crawler Python crawler

Posted by alba on Mon, 20 Sep 2021 16:22:29 +0530