1. Introduction to BeatuifulSoup
Beautiful Soup4, like lxml, is also an HTML/XML parser and the main function is how to solve it
Parse and extract HTML/XML data.
Beautiful Soup automatically converts input documents to Unicode encoding and output documents to utf-8 encoding. You don't need to think about encoding, unless the document doesn't specify one, Beautiful Soup won't automatically recognize the encoding. Then you just need to explain the original encoding.
2. Simple use of BeautifulSoup
Suppose we have the following HTML text
<!DOCTYPE html> <html> <head> <meta content="text/html;charset=utf-8" http-equiv="content-type" /> <meta content="IE=Edge" http-equiv="X-UA-Compatible" /> <meta content="always" name="referrer" /> <link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min. css" rel="stylesheet" type="text/css" /> <title>Baidu once, you know </title> </head> <body link="#0000cc"> <div id="wrapper"> <div id="head"> <div class="head_wrapper"> <div id="u1"> <a class="mnav" href="http://News.baidu. COM "name=" tj_ Trnews ">News</a> <a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123 </a> <a class="mnav" href="http://Map.baidu. COM "name=" tj_ TRMAP ">Map</a> <a class="mnav" href="http://V.baidu. COM "name=" tj_ Trvideo ">Video</a> <a class="mnav" href="http://Tieba.baidu. COM "name=" tj_ Trtieba "> post it </a> <a class="bri" href="//Www.baidu. Com/more/"name=" tj_ Briicon "style=" display: block; ">More Products</a> </div> </div> </div> </div> </body> </html>
2.1 Creating BeautifulSoup objects
import requests from bs4 import BeautifulSoup file=open('eg.html','rb') html=file.read() bs=BeautifulSoup(html,"html.parser") #How HTML text is parsed print(bs)
2.2 Getting Text Information
print(bs.prettify()) #Standardize label content #Get the text information in the text <title> </title>tag above print(bs.title) <title>Baidu once, you know </title> #Get text information in labels print(bs.title.text) print(bs.title.string) Baidu once, you know Baidu once, you know print(bs.a) #Get the contents of the first <a> </a> tag <a class="mnav" href="http://News.baidu. COM "name=" tj_ Trnews ">News</a>
#Get the contents of all <div> </div>tags print(bs.div) Output: <div id="wrapper"> <div id="head"> <div class="head_wrapper"> <div id="u1"> <a class="mnav" href="http://News.baidu. COM "name=" tj_ Trnews ">News</a> <a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123 </a> <a class="mnav" href="http://Map.baidu. COM "name=" tj_ TRMAP ">Map</a> <a class="mnav" href="http://V.baidu. COM "name=" tj_ Trvideo ">Video</a> <a class="mnav" href="http://Tieba.baidu. COM "name=" tj_ Trtieba "> post it </a> <a class="bri" href="//Www.baidu. Com/more/"name=" tj_ Briicon "style=" display: block; ">More Products</a> </div> </div> </div> </div>
2.2.1 find class method
tag_a=bs.find('a') print(tag_a) Return to the first<a> </a>Content of Label <a class="mnav" href="http://News.baidu. COM "name=" tj_ Trnews ">News</a> ###### Find the specified label content by attributes tag_a=bs.find('a',class_="bri") print(tag_a) Output: <a class="bri" href="//Www.baidu. Com/more/"name=" tj_ Briicon "style=" display: block; ">More Products</a> tag_all_a=bs.find_all('a') print(tag_all_a) Return all as a list<a></a>Content of Label [<a class="mnav" href="http://News.baidu. COM "name=" tj_ Trnews ">News</a>, <a class="mnav"href="https://www.hao123.com" name="tj_trhao123">hao123 </a>, <a class="mnav" href="http://Map.baidu. COM "name=" tj_ TRMAP ">Map</a>, <a class="mnav" href="http://V.baidu. COM "name=" tj_ Trvideo ">Video</a>, <a class="mnav" href="http://Tieba.baidu. COM "name=" tj_ Trtieba "> post it </a>, <a class="bri" href="//Www.baidu. Com/more/"name=" tj_ Briicon "style=" display: block; ">More Products</a>]
Get the text content of all the above labels
l1=[] for i in tag_all_a: texta=i.text l1.append(texta) print(l1) #Output: ['Journalism', 'hao123 ', 'Map ', 'video ', 'Post Bar', 'More products ']
3.BeautifulSoup4 Four Object Types
BeautifulSoup4 converts complex HTML documents into a complex tree structure where each node is a Python object and all objects can
There are four types:
- Tag
- NavigableString
- BeautifulSoup
- Comment
3.1 Tag Class
Tag is a tag in HTML, for example:
from bs4 import BeautifulSoup file = open('eg.html', 'rb') html = file.read() bs = BeautifulSoup(html,"html.parser") # Get all the contents of the title tag print(bs.title) # Get all the contents of the head tag print(bs.head) # Get everything from the first a tag print(bs.a) # type print(type(bs.a))
We can easily get the contents of these tags using the soup tagged name, and the object type is bs4.element.Tag. But note
Meaning, it looks for the first qualifying label in all content.
For Tag, there are two important attributes, name and attrs:
file=open('D:\\python Study\\Reptiles\\eg.html','rb') html=file.read() bs=BeautifulSoup(html,"html.parser") bs.prettify() print(bs.name) print(bs.a.name) print(bs.a.attrs) #The bs object itself is special, and its name is [document] #For other internal tags, the output value is the name of the tag itself such as a tag's name is a #Here, we use the attrs method to print out all the attributes of the a tag, and the resulting type is a dictionary. Output: [document] a {'class': ['mnav'], 'href': 'http://news.baidu.com', 'name': 'tj_trnews'} #You can also use the get method to pass in the name of an attribute, which is equivalent print(bs.a['class']) # Equivalent bs.a.get('class') ['mnav'] # These properties, contents, and so on can be modified bs.a['class'] = "newClass" print(bs.a['class']) newClass # This property can also be deleted del bs.a['class'] print(bs.a) #The class attribute in the tag was deleted <a href="http://News.baidu. COM "name=" tj_ Trnews ">News</a>
3.2NavigableString
Now that we have the content of the label, the question arises. What do we want to do to get the text inside the label? It's easy to use. string, for example
bs = BeautifulSoup(html,"html.parser") print(bs.title.string) print(type(bs.title.string)) Baidu once, you know <class 'bs4.element.NavigableString'>
3.3BeautifulSoup
The BeautifulSoup object represents the content of a document. Most of the time, you can think of it as a Tag object, which is a special Tag, and we can get its type, name, and attributes, for example:
bs = BeautifulSoup(html,"html.parser") print(type(bs.name)) print(bs.name) print(bs.attrs) <class 'str'> [document] {}
3.4Comment
Comment object is a special type of NavigableString object whose output does not include comment symbols
bs = BeautifulSoup(html,"html.parser") print(bs.a) print(bs.a.string) print(type(bs.a.string)) <a class="mnav" href="http://News.baidu. COM "name=" tj_ Trnews ">new</a> Journalism <class 'bs4.element.NavigableString'>
4.BeautifulSoup Search
4.1 Traversing the document tree
.content: Gets all the child nodes of a Tag and returns a list
# Tag. The content property can output tag's child nodes as a list tag_head=bs.head.contents # Use list index to get one of its elements print(tag_head[1]) Output: <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
.children: Gets all the children of Tag and returns a generator
for child in bs.body.children: print(child) Output: <div id="wrapper"> <div id="head"> <div class="head_wrapper"> <div id="u1"> <a class="mnav" href="http://News.baidu. COM "name=" tj_ Trnews ">News</a> <a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123 </a> <a class="mnav" href="http://Map.baidu. COM "name=" tj_ TRMAP ">Map</a> <a class="mnav" href="http://V.baidu. COM "name=" tj_ Trvideo ">Video</a> <a class="mnav" href="http://Tieba.baidu. COM "name=" tj_ Trtieba "> post it </a> <a class="bri" href="//Www.baidu. Com/more/"name=" tj_ Briicon "style=" display: block; ">More Products</a> </div> </div> </div> </div>
4.2 Other methods
- .descendants: Get all Tag's descendant nodes
- .strings: If a Tag contains multiple strings, that is, content in a descendant node, it can be retrieved and then traversed
- .stripped_strings: the same as strings, except to remove those extra blanks
- .parent: Gets the parent node of the Tag
- .parents: Recurses all the nodes of the parent element, returning a generator
- .previous_sibling: Gets the previous node of the current Tag, usually a string or a blank property. The real result is the stop and line break between the current Tag and the previous Tag
- .next_sibling: Gets the next node of the current Tag, usually a string or a blank property. The result is a stop and a line break between the current Tag and the next Tag
- .previous_siblings: Gets all sibling nodes above the current Tag and returns a generator
- .next_siblings: Gets all sibling nodes below the current Tag and returns a generator
- .previous_element: Gets the last parsed object (string or tag) during parsing, possibly with previous_sibling is the same, but usually different
- .next_element: Gets the next parsed object (string or tag) during parsing, possibly with next_sibling is the same, but usually different
- .previous_elements: Returns a generator that provides forward access to the parsed content of the document
- .next_elements: Returns a generator that provides backward access to the parsed content of the document
- .has_attr: Determine if Tag contains attributes
5. Search Document Tree
5.1find_all
find_all(name, attrs, recursive, text, **kwargs)
In the example above, we briefly described find_all, let's talk about find_ More uses of all - filters. These filters run through the entire search API and can be used in tag name s, attributes of nodes, and so on.
5.1.1 name parameter
String filtering: Finds exactly what matches a string
a_list = bs.find_all("a") print(a_list) [<a class="mnav" href="http://News.baidu. COM "name=" tj_ Trnews ">News</a>, <a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123 </a>, <a class="mnav" href="http://Map.baidu. COM "name=" tj_ TRMAP ">Map</a>, <a class="mnav" href="http://V.baidu. COM "name=" tj_ Trvideo ">Video</a>, <a class="mnav" href="http://Tieba.baidu. COM "name=" tj_ Trtieba "> post it </a>, <a class="bri" href="//Www.baidu. Com/more/"name=" tj_ Briicon "style=" display: block; ">More Products</a>]
List: If a list is passed in, BeautifulSoup4 will return a node that matches any element in the list
t_list = bs.find_all(["meta","link"]) for item in t_list: print(item) Output: <meta content="text/html;charset=utf-8" http-equiv="content-type"/> <meta content="IE=Edge" http-equiv="X-UA-Compatible"/> <meta content="always" name="referrer"/> <link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet"type="text/css"/>
5.1.2 kwargs parameter
from bs4 import BeautifulSoup import re file = open('./aa.html', 'rb') html = file.read() bs = BeautifulSoup(html,"html.parser") # Tag with query id=head t_list = bs.find_all(id="head") print(t_list) ''' Output: [<div id="head"> <div class="head_wrapper"> <div id="u1"> <a class="mnav" href="http://News.baidu. COM "name=" tj_ Trnews ">News</a> <a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123 </a> <a class="mnav" href="http://Map.baidu. COM "name=" tj_ TRMAP ">Map</a> <a class="mnav" href="http://V.baidu. COM "name=" tj_ Trvideo ">Video</a> <a class="mnav" href="http://Tieba.baidu. COM "name=" tj_ Trtieba "> post it </a> <a class="bri" href="//Www.baidu. Com/more/"name=" tj_ Briicon "style=" display: block; ">More Products</a> </div> </div> </div>] ''' # Query href attribute containing Tag for ss1.bdstatic.com # re.compile is a string matching function in the re Library t_list = bs.find_all(href=re.compile("http://news.baidu.com")) print(t_list) #[<a class="mnav" href=" http://news.baidu.com "Name=" tj_ Trnews ">News</a>] # Query all Tags that contain classes (Note: classes are keywords in Python, so add _to make a difference) t_list = bs.find_all(class_=True) for item in t_list: print(item)
5.1.3 attrs parameter
Using the attrs parameter, we can define a dictionary to search for tag s that contain special attributes:
t_list = bs.find_all(attrs={"data-foo":"value"}) for item in t_list: print(item)
5.1.4 text parameter
The text parameter allows you to search for string content in the document. As with the optional value of the name parameter, the text parameter accepts strings, regular expressions, lists
t_list = bs.find_all(text="hao123") for item in t_list: print(item) Output: hao123 t_list = bs.find_all(text=["hao123", "Map", "Post Bar"]) for item in t_list: print(item) Output: hao123 Map sticker #Regular Expression Matching t_list = bs.find_all(text=re.compile("\d")) for item in t_list: print(item) Output: hao123
5.1.5 limit parameter
A limit parameter can be passed in to limit the number of returns. When the amount of data searched is 5 and limit=2 is set, only the first two data are returned.
t_list = bs.find_all("a",limit=2) for item in t_list: print(item) <a class="mnav" href="http://News.baidu. COM "name=" tj_ Trnews ">News</a> <a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123</a>
5.2 CSS Selector
5.2.1 Find by Tag Name
print(bs.select('title' print(bs.select('a')) ''' [<title>Baidu once, you know </title>] [<a class="mnav" href="http://News.baidu. COM "name=" tj_ Trnews ">News </a>, <a class=" mnav "href=" https://www.hao123.com "Name=" tj_ Trhao123 ">hao123</a>, <a class=" mnav "href=" http://map.baidu.com "Name=" tj_ TRMAP'>Map </a>, <a class='mnav'href=' http://v.baidu.com "Name=" tj_ Trvideo'> Video </a>, <a class='mnav'href=' http://tieba.baidu.com "Name=" tj_ Trtieba "> post it </a>, <a class=" bri "href="//www.baidu. Com/more/"name=" tj_ Briicon "style=" display: block; ">More Products</a>] '''
5.2.2 Find by Class Name
print(bs.select('.mnav')) ''' [<a class="mnav" href="http://News.baidu. COM "name=" tj_ Trnews ">News </a>, <a class=" mnav "href=" https://www.hao123.com "Name=" tj_ Trhao123 ">hao123</a>, <a class=" mnav "href=" http://map.baidu.com "Name=" tj_ TRMAP'>Map </a>, <a class='mnav'href=' http://v.baidu.com "Name=" tj_ Trvideo'> Video </a>, <a class='mnav'href=' http://tieba.baidu.com "Name=" tj_ Trtieba "> post it </a>] '''
5.2.3 Find by id
print(bs.select('#u1')) ''' [<div id="u1"> <a class="mnav" href="http://News.baidu. COM "name=" tj_ Trnews ">News</a> <a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123</a> <a class="mnav" href="http://Map.baidu. COM "name=" tj_ TRMAP ">Map</a> <a class="mnav" href="http://V.baidu. COM "name=" tj_ Trvideo ">Video</a> <a class="mnav" href="http://Tieba.baidu. COM "name=" tj_ Trtieba "> post it </a> <a class="bri" href="//Www.baidu. Com/more/"name=" tj_ Briicon "style=" display: block; ">More Products</a> </div>] '''
5.2.4 Attribute Lookup
print(bs.select('a[class="bri"]')) print(bs.select('a[href="http://tieba.baidu.com"]')) ''' [<a class="bri" href="//Www.baidu. Com/more/"name=" tj_ Briicon "style=" display: block; ">More Products</a>] [<a class="mnav" href="http://Tieba.baidu. COM "name=" tj_ Trtieba "> post it </a>] '''
6. Crawler Warfare
6.1 Crawling Journey to the West
Website: https://www.shicimingju.com/book/sanguoyanyi.html
We can look at the web addresses of the first 5 times to find out the rules of the web addresses
First time: https://www.shicimingju.com/book/sanguoyanyi/1.html Second time: https://www.shicimingju.com/book/sanguoyanyi/2.html Third time: https://www.shicimingju.com/book/sanguoyanyi/3.html Fourth time: https://www.shicimingju.com/book/sanguoyanyi/4.html Fifth time: https://www.shicimingju.com/book/sanguoyanyi/5.html As you can see, only i.html Different, the rest of the web content is the same. So we can use a loop to get the web address. url1='https://www.shicimingju.com/book/sanguoyanyi/' for i in range(0,121): url=url1+str(i)+".html"
Network analysis
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36
Web Page Analysis
url1='https://www.shicimingju.com/book/sanguoyanyi/' user_agent={ 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36' } for i in range(1,121): url=url1+str(i)+".html" r=requests.get(url=url,headers=user_agent) r.encoding='utf-8' #Modify the encoding of response to utf-8 text_r=r.text bs=BeautifulSoup(text_r,'html.parser') title=bs.find('div',class_="card bookmark-list").h1.text text=bs.find('div',class_="chapter_content").text name="D:\\python Study\\Reptiles\\Romance of the Three Kingdoms\\No.{}return.txt".format(i) with open(name,'a',encoding='UTF-8') as f: f.writelines(title) f.writelines(text)
6.2 Crawl Pictures
Crawl the poster for the top250 Douban movie
Web address analysis:
First page: https://movie.douban.com/top250?start=0&filter= Page 2: https://movie.douban.com/top250?start=25&filter= Last page: https://movie.douban.com/top250?start=225&filter= url_base='https://movie.douban.com/top250?start=' for i in range(0,10): url=url_base+str(i)+'&filter='
Web Page Analysis
The unique attribute for all pictures is class="pic". Let's take the first page's web address as an example:
user_agent={ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36' } url_base='https://movie.douban.com/top250?start=' url=url_base+str(0)+'&filter=' r=requests.get(url=url,headers=user_agent) text=r.text soup=BeautifulSoup(text,'html.parser') pic=soup.select('div[class="pic"]') print(pic)
You can see that the web address of each picture in the list element is under an a tag and an img tag
j=1 for i in pic: img_i=i.a.img.get('src') #Get the web address of the picture from its web address properties content=requests.get(img_i).content #Get binary data for pictures from their web addresses path='No.{}Picture.jpg'.format(j) j=j+1 with open(path,'wb') as f: f.write(content)
Successfully crawled to the picture
Next, crawl the poster for the top250 movie
user_agent={ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36' } url_base='https://movie.douban.com/top250?start=' j=1 for i in range(0,10): url=url_base+str(i*25)+'&filter=' r=requests.get(url=url,headers=user_agent) text=r.text soup=BeautifulSoup(text,'html.parser') pic=soup.select('div[class="pic"]') #Load Pictures for i in pic: img_i=i.a.img.get('src') #Web address to get pictures content=requests.get(img_i).content #Get binary data for pictures from their web addresses path='D:\\python Study\\Reptiles\\picture\\No.{}Picture.jpg'.format(j) j=j+1 with open(path,'wb') as f: f.write(content)
Error Resolution
Douban's server sees this as a DDOS attack because we visit it so frequently.
Solution:
Header needs to be introduced
import time
Add a method within the loop that calls request
time.sleep(0.5)