Using Python to crawl different categories of Douban movies
I have done a little text classification work before, and I have crawled thousands of introductions of films of different categories from Douban.
Crawling target
The goal of our crawling is Douban film and television. Open douban.com and click on a movie casually to see the introduction, comments and other information of the movie. What we need to crawl is the introduction of the movie.
thinking
Through the Network tool in the debugging tool of Chrome browser, we can see that the colleagues who load the page will send an Ajax request to query the movie list of the specified category.
The url field is the link to the details page.
On the details page, you can find the corresponding tag through the chrome debugger, right-click to view the source code, and use ctrl+f (common+f) to find that the current page has only one property="v:summary" tag.
code implementation
Since the number of crawls is relatively small, I use the lightweight crawler tool BeautifulSoup here
$ pip install bs4
The first step is to get the movie list and the url of its details page
types = ['love', 'action', 'terror'] for i in range(types): start = 0 while start < 400: params = { "start": start, "genres": types[i] } targetUrl = url + 'start=' + str(start) + "&genres=" + types[i] try: r = requests.get(targetUrl) except: continue text = json.loads(r.text) movies = text['data'] j = 0 for movie in movies: j += 1 info = getInfoByUrl(movie['url'])
The second step is to get the profile according to the url of the movie
def getInfoByUrl(url): try: res = requests.get(url) html = res.text soup = BeautifulSoup(html, 'lxml') span1 = soup.find('span', attrs={'property': 'v:summary'}) span2 = soup.find('span', attrs={'class': 'hideen'}) if span2 != None: return span2.text return span1.text except: return " "
Finally, save the results to different files according to the movie category
info = info.replace("\n", "") info = info.replace(" ", "") info = info.replace(" ", "") print(i, start, j) with open(files[i] + '.txt', 'a+') as f: f.write(info + "\n")
Crawling results
Python spider Py to start crawling. Finally, check the txt information in the current directory to get the results
The complete code has been uploaded to the official account [HackDev], and can be obtained by replying to "Douban" in the background.