In fact, in today's society, the network is full of a large number of useful data. We only need to observe patiently, plus some technical means, we can obtain a large number of valuable data. The "technical means" here is web crawler. Today I will share with you a basic knowledge and introductory tutorial of reptiles:
What is a reptile?
Crawler is a program that automatically obtains web content, such as search engine, Google, Baidu, etc. a huge crawler system is running every day to crawl data from websites all over the world for users to use when searching.
Crawler process
In fact, to abstract the web crawler, it contains the following steps
- Simulate a request page. Simulate the browser and open the target website.
- Get data. After opening the website, we can automatically obtain the website data we need.
- Save data. After you get the data, you need to persist it to storage devices such as local files or databases.
So how can we use Python to write our own crawler? Here I want to focus on a python Library: Requests.
Requests use
The Requests library is a library that initiates HTTP Requests in Python, which is very convenient and simple to use.
Simulate sending HTTP request
Send GET request
When we open the Douban homepage with a browser, the original request we send is actually a GET request
import requests res = requests.get('http://www.douban.com') print(res) print(type(res)) >>> <Response [200]> <class 'requests.models.Response'>
As you can see, what we get is a Response object
If we want to get the data returned by the website, we can use the text or content attribute to get it
text: returns data as a string
content: returns data in binary mode
print(type(res.text)) print(res.text) >>> <class 'str'> <!DOCTYPE HTML> <html lang="zh-cmn-Hans" class=""> <head> <meta charset="UTF-8"> <meta name="google-site-verification" content="ok0wCgT20tBBgo9_zat2iAcimtN4Ftf5ccsh092Xeyw" /> <meta name="description" content="Provide recommendations, comments and price comparisons of books, films and music records, as well as the unique cultural life of the city."> <meta name="keywords" content="Watercress,radio broadcast,Landing watercress">.....
Send POST request
For POST requests, a form is usually submitted
r?=?requests.post('http://www.xxxx.com',?data={"key":?"value"})
Data is the form information that needs to be transferred, which is a dictionary type of data.
header enhancement
For some websites, requests without headers will be rejected, so some header enhancements are needed. For example: UA, Cookie, host and other information.
header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36", "Cookie": "your cookie"}
Python crawler tutorial! Teach you to crawl web data hand in hand
2020-10-25 14:56ยทData analysis is not a thing
In fact, in today's society, the network is full of a large number of useful data. We only need to observe patiently, plus some technical means, we can obtain a large number of valuable data. The "technical means" here is web crawler. Today I will share with you a basic knowledge and introductory tutorial of reptiles:
What is a reptile?
Crawler is a program that automatically obtains web content, such as search engine, Google, Baidu, etc. a huge crawler system is running every day to crawl data from websites all over the world for users to use when searching.
Crawler process
In fact, to abstract the web crawler, it contains the following steps
- Simulate a request page. Simulate the browser and open the target website.
- Get data. After opening the website, we can automatically obtain the website data we need.
- Save data. After you get the data, you need to persist it to storage devices such as local files or databases.
So how can we use Python to write our own crawler? Here I want to focus on a python Library: Requests.
Requests use
The Requests library is a library that initiates HTTP Requests in Python, which is very convenient and simple to use.
Simulate sending HTTP request
Send GET request
When we open the Douban homepage with a browser, the original request we send is actually a GET request
import?requests res?=?requests.get('http://www.douban.com') print(res) print(type(res)) >>> <Response?[200]> <class?'requests.models.Response'>
As you can see, what we get is a Response object
If we want to get the data returned by the website, we can use the text or content attribute to get it
text: returns data as a string
content: returns data in binary mode
print(type(res.text)) print(res.text) >>> <class?'str'>?<!DOCTYPE?HTML> <html?lang="zh-cmn-Hans"?class=""> <head> <meta?charset="UTF-8"> <meta?name="google-site-verification"?content="ok0wCgT20tBBgo9_zat2iAcimtN4Ftf5ccsh092Xeyw"?/> <meta?name="description"?content="Provide recommendations, comments and price comparisons of books, films and music records, as well as the unique cultural life of the city."> <meta?name="keywords"?content="Watercress,radio broadcast,Landing watercress">.....
Send POST request
For POST requests, a form is usually submitted
r?=?requests.post('http://www.xxxx.com',?data={"key":?"value"})
Data is the form information that needs to be transferred, which is a dictionary type of data.
header enhancement
For some websites, requests without headers will be rejected, so some header enhancements are needed. For example: UA, Cookie, host and other information.
header?=?{"User-Agent":?"Mozilla/5.0?(Windows?NT?10.0;?Win64;?x64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/76.0.3809.100?Safari/537.36", ?????????"Cookie":?"your?cookie"} res?=?requests.get('http://www.xxx.com',?headers=header)
Parsing HTML
Now that we have obtained the data returned by the web page, that is, the HTML code, we need to parse the HTML to extract the valid information.
BeautifulSoup
BeautifulSoup is a library of Python. Its main function is to parse data from web pages.
from bs4 import BeautifulSoup # Method of importing BeautifulSoup # You can pass in a string or a file handle. Generally, you will first use the requests library to get the web content, and then use the soup to parse it. soup = BeautifulSoup(html_doc,'html.parser') # The parser must be specified here. You can use the default html or lxml. print(soup.prettify()) # Output the obtained soup content in a standard indented format.
Some simple uses of BeautifulSoup
print(soup.title) # Get the title of the document print(soup.title.name) # Get the name attribute of title print(soup.title.string) # Get the content of the title print(soup.p) # Get the first p node in the document print(soup.p['class']) # Get the class content of the first p node print(soup.find_all('a')) # Get all a nodes in the document and return a list print(soup.find_all('span', attrs={'style': "color:#ff0000"})) # Get all span and style compliant nodes in the document, and return a list
The specific usage and effect will be explained in detail later in the actual battle.
XPath positioning
XPath is the path language of XML, which navigates and locates through elements and attributes. Several commonly used expressions
Meaning of expression node select all child nodes of node / select from root node / / select all current nodes. Current node... Parent node @ attribute select text() the text content under the current path
Some simple examples
xpath('node') # Select all child nodes of node node xpath('/div') # Select div element from root node xpath('//Div') # select all div elements xpath('./div') # Select the div element under the current node xpath('//@id') # select all nodes with id attribute
Of course, xpath is very powerful, but the syntax is also relatively complex, but we can quickly locate the xpath of elements through Chrome developer tools, as shown in the following figure
The resulting xpath is
//*[@id="anony-nav"]/div[1]/ul/li[1]/a
In the actual use process, whether to use BeautifulSoup or XPath completely depends on your personal preferences. Whichever is more skilled and convenient to use, you can use whichever.
Reptile practice: crawling watercress Poster
We can enter the corresponding movie person picture page of Du movie person from Douban movie person page. For example, take Liu Tao as an example, her movie person picture page address is
https://movie.douban.com/celebrity/1011562/photos/
Now let's analyze this page
Target website page analysis
Note: the composition of website pages on the network will always change, so here you need to learn the method of analysis, and so on to other websites. That is why it is better to teach people to fish than to teach them to fish.
Chrome developer tools
The Chrome developer tool (press F12 to open it) is a great tool for analyzing web pages. You must use it well.
Right click on any image and select "check". You can see that the "developer tool" is also opened, and the location of the image is automatically located
It can be clearly seen that each picture is saved in the li tag, and the address of the picture is saved in the img in the li tag.
Knowing these rules, we can parse HTML pages through BeautifulSoup or XPath to obtain the image address in them.
Code writing
We only need a few lines of code to complete the extraction of image url
import?requests from?bs4?import?BeautifulSoup? url?=?'https://movie.douban.com/celebrity/1011562/photos/' res?=?requests.get(url).text content?=?BeautifulSoup(res,?"html.parser") data?=?content.find_all('div',?attrs={'class':?'cover'}) picture_list?=?[] for?d?in?data: ????plist?=?d.find('img')['src'] ????picture_list.append(plist) print(picture_list) >>> ['https://img1.doubanio.com/view/photo/m/public/p2564834267.jpg',?'https://img1.doubanio.com/view/photo/m/public/p860687617.jpg',?'https://img1.doubanio.com/view/photo/m/public/p2174001857.jpg',?'https://img1.doubanio.com/view/photo/m/public/p1563789129.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2363429946.jpg',?'https://img1.doubanio.com/view/photo/m/public/p2382591759.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2363269182.jpg',?'https://img1.doubanio.com/view/photo/m/public/p1959495269.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2356638830.jpg',?'https://img3.doubanio.com/view/photo/m/public/p1959495471.jpg',?'https://img3.doubanio.com/view/photo/m/public/p1834379290.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2325385303.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2361707270.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2325385321.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2196488184.jpg',?'https://img1.doubanio.com/view/photo/m/public/p2186019528.jpg',?'https://img1.doubanio.com/view/photo/m/public/p2363270277.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2325240501.jpg',?'https://img1.doubanio.com/view/photo/m/public/p2258657168.jpg',?'https://img1.doubanio.com/view/photo/m/public/p2319710627.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2319710591.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2311434791.jpg',?'https://img1.doubanio.com/view/photo/m/public/p2363270708.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2258657185.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2166193915.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2363265595.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2312085755.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2311434790.jpg',?'https://img3.doubanio.com/view/photo/m/public/p2276569205.jpg',?'https://img1.doubanio.com/view/photo/m/public/p2165332728.jpg']
As you can see, it is a very clean list, in which the poster address is stored.
But here is only the data of a poster. We observe the page and find that it has many pages. How to deal with pages.
Paging processing
Let's click on the second page to see the changes of browser url
https://movie.douban.com/celebrity/1011562/photos/type=C&start=30&sortby=like&size=a&subtype=a
It is found that several parameters have been added to the browser url
Click the third page again to continue to observe the url
https://movie.douban.com/celebrity/1011562/photos/type=C&start=60&sortby=like&size=a&subtype=a
Through observation, we can see that only start is a variable, and the other parameters can be handled according to common sense
At the same time, we can also know that this start parameter should play a role similar to page. start = 30 is the second page, start = 60 is the third page, and so on. The last page is start = 420.
So we are ready to deal with the code of paging
First, encapsulate the above code processing HTML pages into functions
def?get_poster_url(res): ????content?=?BeautifulSoup(res,?"html.parser") ????data?=?content.find_all('div',?attrs={'class':?'cover'}) ????picture_list?=?[] ????for?d?in?data: ????????plist?=?d.find('img')['src'] ????????picture_list.append(plist) ????return?picture_list
Then we process paging and call the above function in another function
def?fire(): ????page?=?0 ????for?i?in?range(0,?450,?30): ????????print("Start crawling?%s page"?%?page) ????????url?=?'https://movie.douban.com/celebrity/1011562/photos/?type=C&start={}&sortby=like&size=a&subtype=a'.format(i) ????????res?=?requests.get(url).text ????????data?=?get_poster_url(res) ????????page?+=?1
At this time, all our poster data is saved in the data variable. Now we need a downloader to save the poster
def?download_picture(pic_l): ????if?not?os.path.exists(r'picture'): ????????os.mkdir(r'picture') ????for?i?in?pic_l: ????????pic?=?requests.get(i) ????????p_name?=?i.split('/')[7] ????????with?open('picture\'?+?p_name,?'wb')?as?f: ????????????f.write(pic.content)
Add the downloader to the fire function. At this time, in order not to affect the normal access of Douban network due to too frequent requests, set the sleep time to 1 second
def?fire(): ????page?=?0 ????for?i?in?range(0,?450,?30): ????????print("Start crawling?%s page"?%?page) ????????url?=?'https://movie.douban.com/celebrity/1011562/photos/?type=C&start={}&sortby=like&size=a&subtype=a'.format(i) ????????res?=?requests.get(url).text ????????data?=?get_poster_url(res) ????????download_picture(data) ????????page?+=?1 ????????time.sleep(1)
Next, execute the fire function. After the program is completed, a picture folder will be generated in the current directory, which saves all the posters we downloaded
Core code explanation
Now let's take a look at the complete code
import?requests from?bs4?import?BeautifulSoup import?time import?osdef?fire(): ????page?=?0 ????for?i?in?range(0,?450,?30): ????????print("Start crawling?%s page"?%?page) ????????url?=?'https://movie.douban.com/celebrity/1011562/photos/?type=C&start={}&sortby=like&size=a&subtype=a'.format(i) ????????res?=?requests.get(url).text ????????data?=?get_poster_url(res) ????????download_picture(data) ????????page?+=?1 ????????time.sleep(1)def?get_poster_url(res): ????content?=?BeautifulSoup(res,?"html.parser") ????data?=?content.find_all('div',?attrs={'class':?'cover'}) ????picture_list?=?[] ????for?d?in?data: ????????plist?=?d.find('img')['src'] ????????picture_list.append(plist) ????return?picture_listdef?download_picture(pic_l): ????if?not?os.path.exists(r'picture'): ????????os.mkdir(r'picture') ????for?i?in?pic_l: ????????pic?=?requests.get(i) ????????p_name?=?i.split('/')[7] ????????with?open('picture\'?+?p_name,?'wb')?as?f: ????????????f.write(pic.content)if?__name__?==?'__main__': ????fire()
fire function
This is a main execution function that uses the range function to handle paging.
- The range function can quickly create an integer list, which is very easy to use in the for loop. In the function, 0 represents counting from 0, 450 represents iteration to 450, excluding 450, and 30 represents step size, that is, the numerical interval of each increment. range(0, 450, 30), will output: 0, 30, 60, 90
- The format function is a string formatting method
- time.sleep(1) means pause for 1 second
get_poster_url function
This is the function of parsing HTML, using BeautifulSoup
- Through find_ The all method finds all div elements with class "cover" and returns a list
- Use the for loop to loop the list obtained in the previous step, take out the contents of src, and append it to the list
- Append is a method of a list, which can append elements to the end of the list
download_picture function
Simple picture downloader
- First, judge whether there is a picture folder in the current directory, os.path.exists
- OS library is a very common command library for operating system, os MKDIR is to create a folder
- split is used to cut the string and take out the element with the corner mark of 7 as the name of the stored picture
- The with method is used to quickly open a file. The open process can close the file handle by itself, instead of manually executing f.close() to close the file
summary
This section explains the basic process of crawler and the Python libraries and methods that need to be used, and completes the whole process from web page analysis to data storage through a practical example. In fact, crawlers are nothing more than simulating requests, parsing data, and saving data.
Of course, sometimes, the website will set up various anti crawling mechanisms, such as cookie verification, request frequency check, non browser access restrictions, JS confusion, etc. at this time, anti crawling technology is needed, such as grabbing cookies into headers, using proxy IP access, and using Selenium to simulate the browser waiting mode.
Since this course is not a special reptile course, these skills are left to you to explore.
As a passer-by, I would like to talk about my self-study experience with you, hoping to help you avoid detours and step on holes.
More Python, crawler, artificial intelligence supporting video tutorials + books can be +v free of charge.
If you have problems in direction selection, learning planning, learning route and career development, you can add groups: 809160367
First of all, I would like to introduce myself. I graduated from Jiaotong University in 13 years. I once worked in a small company, went to large factories such as Huawei OPPO, and joined Alibaba in 18 years, until now. I know that most junior and intermediate Java engineers who want to improve their skills often need to explore and grow by themselves or sign up for classes, but there is a lot of pressure on training institutions to pay nearly 10000 yuan in tuition fees. The self-study efficiency of their own fragmentation is very low and long, and it is easy to encounter the ceiling technology to stop. Therefore, I collected a "full set of learning materials for java development" and gave it to you. The original intention is also very simple. I hope to help friends who want to learn by themselves and don't know where to start, and reduce everyone's burden at the same time. Add the business card below to get a full set of learning materials