Python crawler tutorial! Teach you to crawl web data hand in hand

In fact, in today's society, the network is full of a large number of useful data. We only need to observe patiently, plus some technical means, we can obtain a large number of valuable data. The "technical means" here is web crawler. Today I will share with you a basic knowledge and introductory tutorial of reptiles:

What is a reptile?

Crawler is a program that automatically obtains web content, such as search engine, Google, Baidu, etc. a huge crawler system is running every day to crawl data from websites all over the world for users to use when searching.

Crawler process

In fact, to abstract the web crawler, it contains the following steps

  • Simulate a request page. Simulate the browser and open the target website.
  • Get data. After opening the website, we can automatically obtain the website data we need.
  • Save data. After you get the data, you need to persist it to storage devices such as local files or databases.

So how can we use Python to write our own crawler? Here I want to focus on a python Library: Requests.

Requests use

The Requests library is a library that initiates HTTP Requests in Python, which is very convenient and simple to use.

Simulate sending HTTP request

Send GET request

When we open the Douban homepage with a browser, the original request we send is actually a GET request

import requests
res = requests.get('http://www.douban.com')
print(res)
print(type(res))
>>>
<Response [200]>
<class 'requests.models.Response'>

As you can see, what we get is a Response object

If we want to get the data returned by the website, we can use the text or content attribute to get it

text: returns data as a string

content: returns data in binary mode

print(type(res.text))
print(res.text)
>>>
<class 'str'> <!DOCTYPE HTML>
<html lang="zh-cmn-Hans" class="">
<head>
<meta charset="UTF-8">
<meta name="google-site-verification" content="ok0wCgT20tBBgo9_zat2iAcimtN4Ftf5ccsh092Xeyw" />
<meta name="description" content="Provide recommendations, comments and price comparisons of books, films and music records, as well as the unique cultural life of the city.">
<meta name="keywords" content="Watercress,radio broadcast,Landing watercress">.....

Send POST request

For POST requests, a form is usually submitted

r?=?requests.post('http://www.xxxx.com',?data={"key":?"value"})

Data is the form information that needs to be transferred, which is a dictionary type of data.

header enhancement

For some websites, requests without headers will be rejected, so some header enhancements are needed. For example: UA, Cookie, host and other information.

header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36",
         "Cookie": "your cookie"}

Python crawler tutorial! Teach you to crawl web data hand in hand

2020-10-25 14:56ยทData analysis is not a thing

In fact, in today's society, the network is full of a large number of useful data. We only need to observe patiently, plus some technical means, we can obtain a large number of valuable data. The "technical means" here is web crawler. Today I will share with you a basic knowledge and introductory tutorial of reptiles:

What is a reptile?

Crawler is a program that automatically obtains web content, such as search engine, Google, Baidu, etc. a huge crawler system is running every day to crawl data from websites all over the world for users to use when searching.

Crawler process

In fact, to abstract the web crawler, it contains the following steps

  • Simulate a request page. Simulate the browser and open the target website.
  • Get data. After opening the website, we can automatically obtain the website data we need.
  • Save data. After you get the data, you need to persist it to storage devices such as local files or databases.

So how can we use Python to write our own crawler? Here I want to focus on a python Library: Requests.

Requests use

The Requests library is a library that initiates HTTP Requests in Python, which is very convenient and simple to use.

Simulate sending HTTP request

Send GET request

When we open the Douban homepage with a browser, the original request we send is actually a GET request

import?requests
res?=?requests.get('http://www.douban.com')
print(res)
print(type(res))
>>>
<Response?[200]>
<class?'requests.models.Response'>

As you can see, what we get is a Response object

If we want to get the data returned by the website, we can use the text or content attribute to get it

text: returns data as a string

content: returns data in binary mode

print(type(res.text))
print(res.text)
>>>
<class?'str'>?<!DOCTYPE?HTML>
<html?lang="zh-cmn-Hans"?class="">
<head>
<meta?charset="UTF-8">
<meta?name="google-site-verification"?content="ok0wCgT20tBBgo9_zat2iAcimtN4Ftf5ccsh092Xeyw"?/>
<meta?name="description"?content="Provide recommendations, comments and price comparisons of books, films and music records, as well as the unique cultural life of the city.">
<meta?name="keywords"?content="Watercress,radio broadcast,Landing watercress">.....

Send POST request

For POST requests, a form is usually submitted

r?=?requests.post('http://www.xxxx.com',?data={"key":?"value"})

Data is the form information that needs to be transferred, which is a dictionary type of data.

header enhancement

For some websites, requests without headers will be rejected, so some header enhancements are needed. For example: UA, Cookie, host and other information.

header?=?{"User-Agent":?"Mozilla/5.0?(Windows?NT?10.0;?Win64;?x64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/76.0.3809.100?Safari/537.36",
?????????"Cookie":?"your?cookie"}
res?=?requests.get('http://www.xxx.com',?headers=header)

Parsing HTML

Now that we have obtained the data returned by the web page, that is, the HTML code, we need to parse the HTML to extract the valid information.

BeautifulSoup

BeautifulSoup is a library of Python. Its main function is to parse data from web pages.

from bs4 import BeautifulSoup  # Method of importing BeautifulSoup
# You can pass in a string or a file handle. Generally, you will first use the requests library to get the web content, and then use the soup to parse it.
soup = BeautifulSoup(html_doc,'html.parser')  # The parser must be specified here. You can use the default html or lxml.
print(soup.prettify())  # Output the obtained soup content in a standard indented format.

Some simple uses of BeautifulSoup

print(soup.title)  # Get the title of the document
print(soup.title.name)  # Get the name attribute of title
print(soup.title.string)  # Get the content of the title
print(soup.p)  # Get the first p node in the document
print(soup.p['class'])  # Get the class content of the first p node
print(soup.find_all('a'))  # Get all a nodes in the document and return a list
print(soup.find_all('span', attrs={'style': "color:#ff0000"}))  # Get all span and style compliant nodes in the document, and return a list

The specific usage and effect will be explained in detail later in the actual battle.

XPath positioning

XPath is the path language of XML, which navigates and locates through elements and attributes. Several commonly used expressions

Meaning of expression node select all child nodes of node / select from root node / / select all current nodes. Current node... Parent node @ attribute select text() the text content under the current path

Some simple examples

xpath('node')  # Select all child nodes of node node
xpath('/div')  # Select div element from root node
xpath('//Div') # select all div elements
xpath('./div')  # Select the div element under the current node
xpath('//@id') # select all nodes with id attribute

Of course, xpath is very powerful, but the syntax is also relatively complex, but we can quickly locate the xpath of elements through Chrome developer tools, as shown in the following figure

The resulting xpath is

//*[@id="anony-nav"]/div[1]/ul/li[1]/a

In the actual use process, whether to use BeautifulSoup or XPath completely depends on your personal preferences. Whichever is more skilled and convenient to use, you can use whichever.

Reptile practice: crawling watercress Poster

We can enter the corresponding movie person picture page of Du movie person from Douban movie person page. For example, take Liu Tao as an example, her movie person picture page address is

https://movie.douban.com/celebrity/1011562/photos/

Now let's analyze this page

Target website page analysis

Note: the composition of website pages on the network will always change, so here you need to learn the method of analysis, and so on to other websites. That is why it is better to teach people to fish than to teach them to fish.

Chrome developer tools

The Chrome developer tool (press F12 to open it) is a great tool for analyzing web pages. You must use it well.

Right click on any image and select "check". You can see that the "developer tool" is also opened, and the location of the image is automatically located

It can be clearly seen that each picture is saved in the li tag, and the address of the picture is saved in the img in the li tag.

Knowing these rules, we can parse HTML pages through BeautifulSoup or XPath to obtain the image address in them.

Code writing

We only need a few lines of code to complete the extraction of image url

import?requests
from?bs4?import?BeautifulSoup?

url?=?'https://movie.douban.com/celebrity/1011562/photos/'
res?=?requests.get(url).text
content?=?BeautifulSoup(res,?"html.parser")
data?=?content.find_all('div',?attrs={'class':?'cover'})
picture_list?=?[]
for?d?in?data:
????plist?=?d.find('img')['src']
????picture_list.append(plist)
print(picture_list)
>>>
['https://img1.doubanio.com/view/photo/m/public/p2564834267.jpg',?'https://img1.doubanio.com/view/photo/m/public/p860687617.jpg',?'https://img1.doubanio.com/view/photo/m/public/p2174001857.jpg',?'https://img1.doubanio.com/view/photo/m/public/p1563789129.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2363429946.jpg',?'https://img1.doubanio.com/view/photo/m/public/p2382591759.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2363269182.jpg',?'https://img1.doubanio.com/view/photo/m/public/p1959495269.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2356638830.jpg',?'https://img3.doubanio.com/view/photo/m/public/p1959495471.jpg',?'https://img3.doubanio.com/view/photo/m/public/p1834379290.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2325385303.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2361707270.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2325385321.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2196488184.jpg',?'https://img1.doubanio.com/view/photo/m/public/p2186019528.jpg',?'https://img1.doubanio.com/view/photo/m/public/p2363270277.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2325240501.jpg',?'https://img1.doubanio.com/view/photo/m/public/p2258657168.jpg',?'https://img1.doubanio.com/view/photo/m/public/p2319710627.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2319710591.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2311434791.jpg',?'https://img1.doubanio.com/view/photo/m/public/p2363270708.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2258657185.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2166193915.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2363265595.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2312085755.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2311434790.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2276569205.jpg',?'https://img1.doubanio.com/view/photo/m/public/p2165332728.jpg']

As you can see, it is a very clean list, in which the poster address is stored.
But here is only the data of a poster. We observe the page and find that it has many pages. How to deal with pages.

Paging processing

Let's click on the second page to see the changes of browser url

https://movie.douban.com/celebrity/1011562/photos/type=C&start=30&sortby=like&size=a&subtype=a

It is found that several parameters have been added to the browser url

Click the third page again to continue to observe the url

https://movie.douban.com/celebrity/1011562/photos/type=C&start=60&sortby=like&size=a&subtype=a

Through observation, we can see that only start is a variable, and the other parameters can be handled according to common sense

At the same time, we can also know that this start parameter should play a role similar to page. start = 30 is the second page, start = 60 is the third page, and so on. The last page is start = 420.

So we are ready to deal with the code of paging

First, encapsulate the above code processing HTML pages into functions

def?get_poster_url(res):
????content?=?BeautifulSoup(res,?"html.parser")
????data?=?content.find_all('div',?attrs={'class':?'cover'})
????picture_list?=?[]
????for?d?in?data:
????????plist?=?d.find('img')['src']
????????picture_list.append(plist)
????return?picture_list

Then we process paging and call the above function in another function

def?fire():
????page?=?0
????for?i?in?range(0,?450,?30):
????????print("Start crawling?%s page"?%?page)
????????url?=?'https://movie.douban.com/celebrity/1011562/photos/?type=C&start={}&sortby=like&size=a&subtype=a'.format(i)
????????res?=?requests.get(url).text
????????data?=?get_poster_url(res)
????????page?+=?1

At this time, all our poster data is saved in the data variable. Now we need a downloader to save the poster

def?download_picture(pic_l):
????if?not?os.path.exists(r'picture'):
????????os.mkdir(r'picture')
????for?i?in?pic_l:
????????pic?=?requests.get(i)
????????p_name?=?i.split('/')[7]
????????with?open('picture\'?+?p_name,?'wb')?as?f:
????????????f.write(pic.content)

Add the downloader to the fire function. At this time, in order not to affect the normal access of Douban network due to too frequent requests, set the sleep time to 1 second

def?fire():
????page?=?0
????for?i?in?range(0,?450,?30):
????????print("Start crawling?%s page"?%?page)
????????url?=?'https://movie.douban.com/celebrity/1011562/photos/?type=C&start={}&sortby=like&size=a&subtype=a'.format(i)
????????res?=?requests.get(url).text
????????data?=?get_poster_url(res)
????????download_picture(data)
????????page?+=?1
????????time.sleep(1)

Next, execute the fire function. After the program is completed, a picture folder will be generated in the current directory, which saves all the posters we downloaded

Core code explanation

Now let's take a look at the complete code

import?requests
from?bs4?import?BeautifulSoup
import?time
import?osdef?fire():
????page?=?0
????for?i?in?range(0,?450,?30):
????????print("Start crawling?%s page"?%?page)
????????url?=?'https://movie.douban.com/celebrity/1011562/photos/?type=C&start={}&sortby=like&size=a&subtype=a'.format(i)
????????res?=?requests.get(url).text
????????data?=?get_poster_url(res)
????????download_picture(data)
????????page?+=?1
????????time.sleep(1)def?get_poster_url(res):
????content?=?BeautifulSoup(res,?"html.parser")
????data?=?content.find_all('div',?attrs={'class':?'cover'})
????picture_list?=?[]
????for?d?in?data:
????????plist?=?d.find('img')['src']
????????picture_list.append(plist)
????return?picture_listdef?download_picture(pic_l):
????if?not?os.path.exists(r'picture'):
????????os.mkdir(r'picture')
????for?i?in?pic_l:
????????pic?=?requests.get(i)
????????p_name?=?i.split('/')[7]
????????with?open('picture\'?+?p_name,?'wb')?as?f:
????????????f.write(pic.content)if?__name__?==?'__main__':
????fire()

fire function

This is a main execution function that uses the range function to handle paging.

  • The range function can quickly create an integer list, which is very easy to use in the for loop. In the function, 0 represents counting from 0, 450 represents iteration to 450, excluding 450, and 30 represents step size, that is, the numerical interval of each increment. range(0, 450, 30), will output: 0, 30, 60, 90
  • The format function is a string formatting method
  • time.sleep(1) means pause for 1 second

get_poster_url function

This is the function of parsing HTML, using BeautifulSoup

  • Through find_ The all method finds all div elements with class "cover" and returns a list
  • Use the for loop to loop the list obtained in the previous step, take out the contents of src, and append it to the list
  • Append is a method of a list, which can append elements to the end of the list

download_picture function

Simple picture downloader

  • First, judge whether there is a picture folder in the current directory, os.path.exists
  • OS library is a very common command library for operating system, os MKDIR is to create a folder
  • split is used to cut the string and take out the element with the corner mark of 7 as the name of the stored picture
  • The with method is used to quickly open a file. The open process can close the file handle by itself, instead of manually executing f.close() to close the file

summary

This section explains the basic process of crawler and the Python libraries and methods that need to be used, and completes the whole process from web page analysis to data storage through a practical example. In fact, crawlers are nothing more than simulating requests, parsing data, and saving data.

Of course, sometimes, the website will set up various anti crawling mechanisms, such as cookie verification, request frequency check, non browser access restrictions, JS confusion, etc. at this time, anti crawling technology is needed, such as grabbing cookies into headers, using proxy IP access, and using Selenium to simulate the browser waiting mode.

Since this course is not a special reptile course, these skills are left to you to explore.

As a passer-by, I would like to talk about my self-study experience with you, hoping to help you avoid detours and step on holes.

More Python, crawler, artificial intelligence supporting video tutorials + books can be +v free of charge.

If you have problems in direction selection, learning planning, learning route and career development, you can add groups: 809160367

First of all, I would like to introduce myself. I graduated from Jiaotong University in 13 years. I once worked in a small company, went to large factories such as Huawei OPPO, and joined Alibaba in 18 years, until now. I know that most junior and intermediate Java engineers who want to improve their skills often need to explore and grow by themselves or sign up for classes, but there is a lot of pressure on training institutions to pay nearly 10000 yuan in tuition fees. The self-study efficiency of their own fragmentation is very low and long, and it is easy to encounter the ceiling technology to stop. Therefore, I collected a "full set of learning materials for java development" and gave it to you. The original intention is also very simple. I hope to help friends who want to learn by themselves and don't know where to start, and reduce everyone's burden at the same time. Add the business card below to get a full set of learning materials

Tags: Android Interview Back-end Front-end

Posted by Cazrin on Tue, 09 Aug 2022 13:16:02 +0530