Crawl Ancient Poems and Articles Web
demand
Crawl the data of the poems on the web, crawl the name, author, Dynasty and content of each poem
Page Analysis
To crawl the poems on the page and copy the contents of any poem, you can find them in the source code of the page, indicating that the page is loaded statically, indicating that the URL displayed is the target of the crawl, and you can obtain data directly with the url. Target url:https://www.gushiwen.cn/.
Select any title of a poem, right-click to check, you will find that the content of the title is stored in the a tag under the p tag.
Fold the tag inside the P tag to see that the first p tag stores the title of the poem, the second P tag stores the author and dynasty, and the div tag below stores the content of the poem.
Continuing to fold the label up, we will find that each poem is stored in the label of div[@class="sons"], and all poems are stored in the label of div[@class="left"]
Turn to the bottom of the page and find that you need to page. Click on the next page to find the url. Try changing the second page to 1 to enter the interface of the first page.
https://www.gushiwen.cn/default_2.aspx Page 2
https://www.gushiwen.cn/default_1.aspx First page
You can also jump pages by entering numbers below, either way
https://www.gushiwen.cn/default.aspx?page=2 Page 2
https://www.gushiwen.cn/default.aspx?page=1 First page
code implementation
1. Create a scrapy project
My pycham project directory here is on drive D, enter cmd, enter cd carriage return, enter D: carriage return, switch the path to drive D, copy the project directory into the project file directory, enter scrapy startproject +project name, enter cd +project name into the project after creation, enter scrapy genspider +crawler file name+Web site domain name, create crawler file.
2. Configuring files
settings.py LOG_LEVEL = 'WARNING' BOT_NAME = 'prose' SPIDER_MODULES = ['prose.spiders'] NEWSPIDER_MODULE = 'prose.spiders' ROBOTSTXT_OBEY = False DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36 Edg/92.0.902.62' } ITEM_PIPELINES = { 'prose.pipelines.ProsePipeline': 300, }
If you want to use variables and methods from another py file, you can import them into the PY file you are using by importing classes
Method 1: Direct import. The default pycharm open directory is the root directory, and first-level import. from day22.mySpider.mySpider.items import MyspiderItem
Method 2: Set the root directory yourself. Right-click on the folder where you want to set the root directory, Mark Directory as - > Sources Root, and use the import folder name when importing, such as from mySpider.items import MyspiderItem
Right-click on the next page and find that the next page is in the a tag, where the href value is the address of the next url, we can page through the href property value
3. Other running files
Start Program
start.py from scrapy import cmdline # cmdline.execute(['scrapy', 'crawl', 'gsw']) # Method One cmdline.execute('scrapy crawl gsw'.split(" ")) # Method 2
Pipeline File
pipelines.py import json class ProsePipeline: # First method: # def process_item(self, item, spider): # # print(item) # Test if the pipeline receives data and print when it receives it # item passed in as an object, forcing it into a dictionary and then into a json-formatted string. nsure_ascii=False handles Chinese # item_json = json.dumps(dict(item), ensure_ascii=False) # with open('prose.txt', 'a', encoding='utf-8') as f: # f.write(item_json + '\n') # return item # Second method def open_spider(self, spider): self.f = open('prose.txt', 'w', encoding='utf-8') def process_item(self, item, spider): # print(item) # Test if the pipeline receives data and print when it receives it item_json = json.dumps(dict(item), ensure_ascii=False) self.f.write(item_json + '\n') return item def close_spider(self, spider): self.f.close()
items.py import scrapy class ProseItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() author = scrapy.Field() dynasty = scrapy.Field() comtents = scrapy.Field() pass
Crawler Files
gsw.py import scrapy from prose.items import ProseItem # Import ProseItem class from items class GswSpider(scrapy.Spider): name = 'gsw' allowed_domains = ['gushiwen.cn'] start_urls = ['http://gushiwen.cn/'] def parse(self, response): div_sons = response.xpath('.//div[@class="left"]/div[@class="sons"]') for div_son in div_sons: # Title of Ancient Poems title = div_son.xpath('.//b/text()').get() # Authors of Ancient Poetry and Dynasties # getall() returns a list with values based on the index # Note that if you do not add "." to the path of the xpath, the data will be abnormal and will be printed repeatedly # There will be something else between the poems, and empty values will be found try: name = div_son.xpath('.//p/a/text()').getall() author = name[0] # author dynasty = name[1] # Dynasty # extract() is equivalent to getall() # The data from xpath is a list with spaces at the beginning and end of the list elements, strip() to remove the first space # Stitching list elements with an empty string returns a string contson = div_son.xpath('.//div[@class="contson"]/text()').extract() comtents = ''.join(contson).strip() # The first method is in the form of a dictionary # item = {} # item['title'] = title # item['author'] = author # item['dynasty'] = dynasty # item['comtents'] = comtents # Second method: how to instantiate an object. Import the ProseItem class in items and create several methods in advance in the class item = ProseItem(title=title, author=author, dynasty=dynasty, comtents=comtents) # print(item) yield item # Send to Pipeline except: print(title) next_page = response.xpath('.//a[@id="amore"]/@href').get() # print(next_page) # If you get the data on the next page, turn it over if next_page: # First method: # req = scrapy.Request(next_page) # yield req # Return to engine, repeat to scheduler, downloader, crawler, pipeline, etc. yield scrapy.Request( url=next_page, # Which function does the url get resolved to, using callback # callback=self.parse # Callback function, the default current function, can be omitted if it is also parsed with the current function. )
The print interface will have a blank value, because there are pictures and other content between each poem in the web page, and what you find is blank.