Experiment 1: Crawling and cleaning of big data
Experiment content:
Crawl laptops from Jingdong Mall (you can also choose other products) https://list.jd.com/list.html?cat=670%2C671%2C672
Goods information must include product price p-price, merchant p-shop, product number data-sku, and product pictures (as shown in the figure below). Other product information can be appropriately added, and at least five pages of product information must be crawled. After the information is output, it is stored in a txt file or an excel file.
experiment procedure
-
First, you need to understand the basic syntax of python and HTML, analyze the structure of the web page, and then use python to crawl
-
Install the Requests and BeautifulSoup expansion packages first
requests are used to make requests, and BeautifulSoup is used to clean data
-
Use requests.get() to request, respose is the returned result
respose = requests.get(local_url, headers=headers)
At the beginning of use, the headers attribute was not added, and the returned result was empty. After looking up the data, we found that the crawler we usually write will send a crawling request to the server by default, and under normal circumstances, the website is not allowed to be accessed by the crawler, and the output is Words such as sorry, inaccessible, etc. will appear in the text information. **We can implement website request and web page response by changing the User-Agent field. **Add property after checking web page
headers = { 'user-agent': "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Mobile Safari/537.36" }
-
Analyze the page address of subsequent pages
It is found that the page and s parameters are spliced to request, so the subsequent request address is processed locally
if page != 1: local_url = local_url + '&page=' + str(page) + '&s=' + str(size) + '&click=0'
-
After getting the response data, first through the analysis of the web page structure, it is found that all the required search elements are contained under the li element with the class name gl-item
So find all li elements whose class name is gl-item, and save
li = bf.find_all('li', class_='gl-item')
-
Traverse the li element, create a local variable data to save the data, and then analyze the web page structure and clean each data required by the experiment
for item in li: data = {} id = item.get('data-sku') name = item.find_all('div', class_='p-name') shop = item.find_all('div', class_='p-shop') price = item.find_all('div', class_='p-price') price = price[0].find_all('i') temp_shop = shop[0].find_all('span', class_='J_im_icon')
In the process of processing, it is found that the business information cannot be crawled, so it is necessary to deal with the situation that cannot be crawled.
if len(temp_shop) == 0: data['shop'] = "The business information is incorrect!" else: data['shop'] = temp_shop[0].text
The image crawled is its address, and then the product name is processed. Because some product names will appear in front of the words Jingpin Computer, it needs to be processed.
name = name[0].find_all('em') if (name[0].text[0:4] == "Jingpin computer"): data['name'] = name[0].text[6:] else: data['name'] = name[0].text
-
Print the product information and add it to the dataList array for storage
data['id'] = id data['price'] = price[0].text data['img'] = img print('number:', data['id']) print('name', data['name']) print('price', data['price']) print('shop', data['shop']) dataList.append(data)
-
Encapsulate it as a getdata function and call it in the main function
if __name__ == '__main__': url = 'https://list.jd.com/list.html?cat=670%2C671%2C672' page = int(input("Please enter the number of pages to be crawled:")) for i in range(1, page + 1): getdata(url, i, 30)
-
Save the crawled data into a txt file and add line breaks for easy viewing
f = open('com.txt', 'w') for line in dataList: f.write(str(line)) f.write('\n') f.close() print("Saved")