Using Python to crawl different categories of Douban movies

Using Python to crawl different categories of Douban movies

I have done a little text classification work before, and I have crawled thousands of introductions of films of different categories from Douban.

Crawling target

The goal of our crawling is Douban film and television. Open douban.com and click on a movie casually to see the introduction, comments and other information of the movie. What we need to crawl is the introduction of the movie.

thinking

Through the Network tool in the debugging tool of Chrome browser, we can see that the colleagues who load the page will send an Ajax request to query the movie list of the specified category.

The url field is the link to the details page.

On the details page, you can find the corresponding tag through the chrome debugger, right-click to view the source code, and use ctrl+f (common+f) to find that the current page has only one property="v:summary" tag.

code implementation

Since the number of crawls is relatively small, I use the lightweight crawler tool BeautifulSoup here

$ pip install bs4

The first step is to get the movie list and the url of its details page

types = ['love', 'action', 'terror']
for i in range(types):
    start = 0
    while start < 400:
        params = {
            "start": start,
            "genres": types[i]
        }
        targetUrl = url + 'start=' + str(start) + "&genres=" + types[i]
        try:
            r = requests.get(targetUrl)
        except:
            continue
        text = json.loads(r.text)
        movies = text['data']
        j = 0
        for movie in movies:
            j += 1
            info = getInfoByUrl(movie['url'])

The second step is to get the profile according to the url of the movie

def getInfoByUrl(url):
    try:
        res = requests.get(url)
        html = res.text
        soup = BeautifulSoup(html, 'lxml')
        span1 = soup.find('span', attrs={'property': 'v:summary'})
        span2 = soup.find('span', attrs={'class': 'hideen'})
        if span2 != None:
            return span2.text
        return span1.text
    except:
        return " "

Finally, save the results to different files according to the movie category

info = info.replace("\n", "")
info = info.replace("  ", "")
info = info.replace(" ", "")
print(i, start, j)
with open(files[i] + '.txt', 'a+') as f:
    f.write(info + "\n")

Crawling results

Python spider Py to start crawling. Finally, check the txt information in the current directory to get the results

The complete code has been uploaded to the official account [HackDev], and can be obtained by replying to "Douban" in the background.

Tags: Python

Posted by Online Connect on Wed, 01 Jun 2022 23:08:03 +0530