4 capture cat's eye movie rankings

3.4 ranking of cat's eye movies

In this section, we use the requests library and regular expressions to capture the content of the cat's eye movie TOP100. Requests is more convenient to use than urllib. At present, we have not systematically learned HTML parsing library, so we choose regular expressions as parsing tools.

1. objectives of this section

In this section, we will extract the movie name, time, score, picture and other information of cat's eye movie TOP100. The extracted site URL is http://maoyan.com/board/4 , the extracted results will be saved in file form.

2. preparation

Before starting this section, make sure that the requests library is properly installed. If not, refer to the installation instructions in Chapter 1.

3. grab analysis

The target sites we need to crawl are http://maoyan.com/board/4 After opening, you can view the list information, as shown in Figure 3-11.

Figure 3-11 list information

The film ranking first is farewell my concubine. The effective information displayed on the page includes the film name, leading actor, release time, release region, score, picture and other information.

Scroll the web page to the bottom to find the paged list. Directly click Page 2 to observe how the URL and content of the page have changed, as shown in Figure 3-12.

Figure 3-12 page URL change

You can find that the URL of the page becomes http://maoyan.com/board/4?offset=10 , there is one more parameter than the previous URL, that is, offset=10. The currently displayed results are movies ranking 11~20. It is preliminarily inferred that this is an offset parameter. Click the next page and find that the URL of the page becomes http://maoyan.com/board/4?offset=20 , the parameter offset becomes 20, and the displayed result is the movies ranking 21~30.

From this, we can summarize the rules. Offset represents the offset value. If the offset is n, the displayed movie sequence number is n+1 to n+10, and 10 are displayed on each page. Therefore, if you want to obtain TOP100 movies, you only need to request 10 times separately, and the offset parameters of the 10 times are set to 0, 10, 20... 90 respectively. In this way, after obtaining different pages, you can extract the relevant information with regular expressions, and then you can get all the movie information of TOP100.

4. grab the home page

Next, implement this process in code. First grab the contents of the first page. We implemented get_one_page method and pass it a url parameter. Then return the captured page results, and call the main method. The preliminary code implementation is as follows:

import requests  

def get_one_page(url):  
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko)   
            Chrome/65.0.3325.162 Safari/537.36'  
    }

    response = requests.get(url, headers=headers)  
    if response.status_code == 200:  
        return response.text  
    return None  

def main():  
    url = 'http://maoyan.com/board/4'  
    html = get_one_page(url)  
    print(html)  

main()

After running in this way, you can successfully obtain the source code of the home page. After obtaining the source code, we need to parse the page and extract the information we want.

5. regular extraction

Next, go back to the web page to see the real source code of the page. View the source code in the Network listening component in the developer mode, as shown in Figure 3-13.

Figure 3-13 source code

Note that you should not directly view the source code in the Elements tab here, because the source code there may be different from the original request through JavaScript operations. Instead, you need to view the source code obtained from the original request from the Network tab.

View the source code of one of the entries, as shown in Figure 3-14.


Figure 3-14 source code

As you can see, the source code corresponding to a movie information is a dd node. We use regular expressions to extract some movie information. First, we need to extract its ranking information. Its ranking information is in the i node whose class is board index. Here, non greedy matching is used to extract the information in the i node. The regular expression is written as:

<dd>.*?board-index.*?>(.*?)</i>

Then you need to extract the pictures of the movie. As you can see, there is a node behind it and two img nodes inside it. After checking, it is found that the data SRC attribute of the second img node is a link to the image. Here, extract the data SRC attribute of the second img node. The regular expression can be rewritten as follows:

<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)"

Next, you need to extract the name of the movie. It is in the following p node, and class is name. Therefore, you can use name as a flag bit, and then further extract the body content of node a in it. At this time, the regular expression is rewritten as follows:

<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>

The same principle applies to the extraction of starring, release time, score, etc. Finally, the regular expression is written as:

<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>

Such a regular expression can match the result of a movie, which matches 7 pieces of information. Next, extract all the contents by calling the findall method.

Next, we define the parse method to parse the page_ One_ Page is mainly used to extract the content we want from the results through regular expressions. The implementation code is as follows:

def parse_one_page(html):
    pattern = re.compile('<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>',
        re.S)
    items = re.findall(pattern, html)
    print(items)

In this way, you can successfully extract all the 10 movie information on a page. This is in the form of a list. The output results are as follows:

[('1', 'http://P1.meituan Net/movie/20803f59291c47e1e116c11963ce019e68711 jpg@160w_220h_1e_1c ',' Farewell My Concubine ',' \n Starring: zhangguorong, zhangfengyi, Gong Li ',' release time: January 1, 1993 (Hong Kong), China,'9.',' 6', ('2', ' http://p0.meituan.net/movie/__40191813__4767047.jpg @160W_ 220h_ 1e_ 1c',' Shawshank Redemption ',' \n starring Tim Robbins, Morgan Freeman, Bob Gunton ',' release time: 1994-10-14 (USA),'9.',' 5', ('3', ' http://p0.meituan.net/movie/fc9d78dd2ce84d20e53b6d1ae2eea4fb1515304.jpg @160W_ 220h_ 1e_ 1c',' this killer is not too cold ',' \n Starring: Jean Reno, Gary Oldman, Natalie Portman \n ',' release time: September 14, 1994 (France),'9.',' 5', ('4', ' http://p0.meituan.net/movie/23/6009725.jpg @160W_ 220h_ 1e_ 1c',' Roman Holiday ',' \n Starring: Gregory Parker, Audrey Hepburn, Eddie Albert \n ',' release time: September 2, 1953 (United States),'9.',' 1', ('5', ' http://p0.meituan.net/movie/53/1541925.jpg @160W_ 220h_ 1e_ 1c',' Forrest Gump ',' \n starring Tom Hanks, Robin white, Gary sinais', 'release time: July 6, 1994 (United States),'9.', ' 4', ('6', ' http://p0.meituan.net/movie/11/324629.jpg @160W_ 220h_ 1e_ 1c',' Titanic ','\n starring Leonardo DiCaprio, Kate Winslet, Billy Zane \n', 'release time: April 3, 1998,'9.', ' 5', ('7', ' http://p0.meituan.net/movie/99/678407.jpg @160W_ 220h_ 1e_ 1c',' chinchilla ',' \n Starring: nikoko Fako, Sakamoto Chixia, Miyai Chongli \n ',' release time: April 16, 1988 (Japan),'9.',' 2', ('8', ' http://p0.meituan.net/movie/92/8212889.jpg @160W_ 220h_ 1e_ 1c',' Godfather ',' \n Starring: Marlon Brando, Al Pacino, James Kane \n ',' release time: March 24, 1972 (United States),'9.',' 3', ('9', ' http://p0.meituan.net/movie/62/109878.jpg @160W_ 220h_ 1e_ 1c',' Tang Bohu points at autumn fragrance ',' \ Starring: Stephen Chow, Gong Li, Zheng Peipei ',' release time: July 1, 1993 (Hong Kong, China), ''9.', ' 2', ('10', ' http://p0.meituan.net/movie/9bf7d7b81001a9cf8adbac5a7cf7d766132425.jpg @160W_ 220h_ 1e_ 1c',' Chihiro ',' \n starring in: new kaumi, freedom to enter the wild, Masaki truths \n ',' release time: July 20, 2001 (Japan),'9.',' 3')]

However, this is not enough. The data is messy. Let's process the matching results again, traverse the extracted results and generate a dictionary. At this time, the method is rewritten as follows:

def parse_one_page(html):
    pattern = re.compile('<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>',
        re.S)
    items = re.findall(pattern, html)
    for item in items:
        yield {'index': item[0],
            'image': item[1],
            'title': item[2].strip(),
            'actor': item[3].strip()[3:] if len(item[3]) > 3 else '',
            'time': item[4].strip()[5:] if len(item[4]) > 5 else '',
            'score': item[5].strip() + item[6].strip()}

In this way, the ranking, picture, title, actor, time, score and other contents of the film can be successfully extracted, and assigned to dictionaries to form structured data. The operation results are as follows:

{'image': 'http://P1.meituan Net/movie/20803f59291c47e1e116c11963ce019e68711 jpg@160w_220h_1e_1c ','actor': 'zhangguorong, zhangfengyi, Gong Li','score':'9.6','index':'1','title':' Farewell My Concubine ','time':'1993-01-01 (Hong Kong, China)'}
{'image': 'http://P0.meituan Net/movie/__ 40191813__ four million seven hundred and sixty-seven thousand and forty-seven jpg@160w_220h_1e_1c ''actor': 'Tim Robbins, Morgan Freeman, Bob Gunton','score':'9.5','index':'2','title':'shawshank redemption','time':'1994-10-14 (U.S.)'}
{'image': 'http://P0.meituan Net/movie/fc9d78dd2ce84d20e53b6d1ae2eea4fb1515304 jpg@160w_220h_1e_1c ''actor': 'Jean Reno, Gary Oldman, Natalie Portman','score':'9.5','index':'3','title':' the killer is not too cold ','time':'1994-09-14 (France)'}
{'image': 'http://P0.meituan Net/movie/23/6009725 jpg@160w_220h_1e_1c ','actor': 'Gregory Parker, Audrey Hepburn, Eddie Albert', 'score':'9.1','index':'4','title': 'Roman Holiday', 'time':'1953-09-02 (U.S.)'}
{'image': 'http://P0.meituan Net/movie/53/1541925 jpg@160w_220h_1e_1c ','actor': 'Tom Hanks, Robin white, Gary sinais','score':'9.4','index':'5','title':'forrest Gump's true story ','time':'1994-07-06 (USA)'}
{'image': 'http://P0.meituan Net/movie/11/324629 jpg@160w_220h_1e_1c ''actor': 'Leonardo DiCaprio, Kate Winslet, Billy Zane','score':'9.5','index':'6','title':' Titanic ',' time':'1998-04-03'}
{'image': 'http://P0.meituan Net/movie/99/678407 jpg@160w_220h_1e_1c ','actor': 'rigaofazi, Sakamoto qianxia, Sijing Chongli','score':'9.2','index':'7','title':'chinchilla','time':'1988-04-16 (Japan)'}
{'image': 'http://P0.meituan Net/movie/92/8212889 jpg@160w_220h_1e_1c ','actor': 'Malone Brando, Al Pacino, James Kane','score':'9.3','index':'8','title':' Godfather ','time':'1972-03-24 (United States)'}
{'image': 'http://P0.meituan Net/movie/62/109878 jpg@160w_220h_1e_1c ','actor': 'Stephen Chow, Gong Li, Zheng Peipei','score':'9.2','index':'9','title':' Tang Bohu points Qiuxiang ','time':'1993-07-01 (Hong Kong, China)'}
{'image': 'http://P0.meituan Net/movie/9bf7d7b81001a9cf8adbac5a7cf7d766132425 jpg@160w_220h_1e_1c ','actor': 'new beauty, freedom to enter the wild, xiamu truth','score':'9.3','index':'10','title':' Chihiro ','time':'2001-07-20 (Japan)'}

So far, we have successfully extracted single page movie information.

6. write file

Then, we write the extracted results to a file, here directly to a text file. Here, the dictionary is serialized through the dumps method of the JSON library, and the guarantee is specified_ The ASCII parameter is False, which ensures that the output result is in Chinese instead of Unicode. The codes are as follows:

def write_to_file(content):  
    with open('result.txt', 'a', encoding='utf-8') as f:  
        print(type(json.dumps(content)))  
        f.write(json.dumps(content, ensure_ascii=False)+'\n')

By calling write_ To_ The file method can be used to write the dictionary to a text file. The content parameter here is the extraction result of a movie, which is a dictionary.

7. integration code

Finally, the main method is implemented to call the previously implemented method to write the single page movie results to the file. Relevant codes are as follows:

def main():  
    url = 'http://maoyan.com/board/4'  
    html = get_one_page(url)  
    for item in parse_one_page(html):  
        write_to_file(item)

So far, we have completed the extraction of single page movies, that is, the 10 movies on the home page can be successfully extracted and saved to the text file.

8. paging crawling

Because we need to crawl TOP100 movies, we need to traverse the link and pass in the offset parameter to crawl the other 90 movies. At this time, add the following calls:

if __name__ == '__main__':  
    for i in range(10):  
        main(offset=i * 10)

Here, you need to modify the main method to receive an offset value as an offset, and then construct a URL to crawl. The implementation code is as follows:

def main(offset):  
    url = 'http://maoyan.com/board/4?offset=' + str(offset)  
    html = get_one_page(url)  
    for item in parse_one_page(html):  
        print(item)  
        write_to_file(item)

So far, the crawler of our cat's eye movie TOP100 has been completed. After a little sorting, the complete code is as follows:

import json  
import requests  
from requests.exceptions import RequestException  
import re  
import time  

def get_one_page(url):  
    try:  
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like   
                Gecko) Chrome/65.0.3325.162 Safari/537.36'  
        }

        response = requests.get(url, headers=headers)  
        if response.status_code == 200:  
            return response.text  
        return None  
    except RequestException:  
        return None  

def parse_one_page(html):  
   pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a' 
+ '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>' 
+ '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)  
    items = re.findall(pattern, html)  
    for item in items:  
        yield {'index': item[0],  
            'image': item[1],  
            'title': item[2],  
            'actor': item[3].strip()[3:],  
            'time': item[4].strip()[5:],  
            'score': item[5] + item[6]  
        }  

def write_to_file(content):  
    with open('result.txt', 'a', encoding='utf-8') as f:  
        f.write(json.dumps(content, ensure_ascii=False) + '\n')  

def main(offset):  
    url = 'http://maoyan.com/board/4?offset=' + str(offset)  
    html = get_one_page(url)  
    for item in parse_one_page(html):  
        print(item)  
        write_to_file(item)  

if __name__ == '__main__':  
    for i in range(10):  
        main(offset=i * 10)  
        time.sleep(1)

Now there are many anti crawlers in the cat's eye. If the speed is too fast, there will be no response, so a delay waiting is added here.

9. operation results

Finally, we run the following code, and the output results are similar to the following:

{'index': '1', 'image': 'http://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg@160w_220h_1e_1c',   
    'title': ' Farewell my concubine ', 'actor': ' Zhangguorong, zhangfengyi, Gong Li ', 'time': '1993-01-01(Hong Kong, China)', 'score': '9.6'}  
{'index': '2', 'image': 'http://p0.meituan.net/movie/__40191813__4767047.jpg@160w_220h_1e_1c', 'title':   
    ' Shawshank Redemption ', 'actor': ' Tim・Robbins, Morgan・Freeman, Bob・Gunton ', 'time': '1994-10-14(United States)', 'score': '9.5'}  
...  
{'index': '98', 'image': 'http://P0.meituan Net/movie/76/7073389 jpg@160w_220h_1e_1c ','title': 'Tokyo Story',   
    'actor': ' Li Zhizhong, Hara Jiezi, Suzuki sugamura ', 'time': '1953-11-03(Japan)', 'score': '9.1'}  
{'index': '99', 'image': 'http://P0.meituan Net/movie/52/3420293 jpg@160w_220h_1e_1c ','title': 'I love you',   
    'actor': ' Song Zaihe, Li caien, Ji Haiyan ', 'time': '2011-02-17(Korea)', 'score': '9.0'}  
{'index': '100', 'image': 'http://p1.meituan.net/movie/__44335138__8470779.jpg@160w_220h_1e_1c', 'title':   
    ' Migratory birds ', 'actor': ' Jacques・Behan, Philip・Lapolo, Philippe Labro', 'time': '2001-12-12(France)', 'score': '9.1'}

The middle part of the output result is omitted here. It can be seen that the movie information of TOP100 has been successfully crawled down.

At this time, let's look at the text file. The results are shown in Figure 3-15.

Figure 3-15 operation results

As you can see, all the movie information has been saved to the text file. It's done!

10. code of this section

The code address of this section is: https://github.com/Python3WebSpider/MaoYan.

In this section, we practiced the use of requests and regular expressions by crawling through the movie information of cat's eye TOP100. This is the most basic example. I hope you can have a basic idea about the implementation of crawlers through this example and have a deeper understanding of the usage of these two libraries.

Tags: crawler

Posted by MadRhino on Tue, 31 May 2022 02:25:00 +0530