Big Data Pilot Practice Experiment 1

Experiment 1: Crawling and cleaning of big data

Experiment content:

Crawl laptops from Jingdong Mall (you can also choose other products)
https://list.jd.com/list.html?cat=670%2C671%2C672

Goods information must include product price p-price, merchant p-shop, product number data-sku, and product pictures (as shown in the figure below). Other product information can be appropriately added, and at least five pages of product information must be crawled. After the information is output, it is stored in a txt file or an excel file.

experiment procedure

  1. First, you need to understand the basic syntax of python and HTML, analyze the structure of the web page, and then use python to crawl

  2. Install the Requests and BeautifulSoup expansion packages first

requests are used to make requests, and BeautifulSoup is used to clean data

  1. Use requests.get() to request, respose is the returned result

    respose = requests.get(local_url, headers=headers)
    

    At the beginning of use, the headers attribute was not added, and the returned result was empty. After looking up the data, we found that the crawler we usually write will send a crawling request to the server by default, and under normal circumstances, the website is not allowed to be accessed by the crawler, and the output is Words such as sorry, inaccessible, etc. will appear in the text information. **We can implement website request and web page response by changing the User-Agent field. **Add property after checking web page

    headers = {
            'user-agent': "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Mobile Safari/537.36"
        }
    
  2. Analyze the page address of subsequent pages

    It is found that the page and s parameters are spliced ​​to request, so the subsequent request address is processed locally

        if page != 1:
            local_url = local_url + '&page=' + str(page) + '&s=' + 	str(size) + '&click=0'
    
  3. After getting the response data, first through the analysis of the web page structure, it is found that all the required search elements are contained under the li element with the class name gl-item

    So find all li elements whose class name is gl-item, and save

    li = bf.find_all('li', class_='gl-item')
    
  4. Traverse the li element, create a local variable data to save the data, and then analyze the web page structure and clean each data required by the experiment

        for item in li:
            data = {}
            id = item.get('data-sku')
            name = item.find_all('div', class_='p-name')
            shop = item.find_all('div', class_='p-shop')
            price = item.find_all('div', class_='p-price')
            price = price[0].find_all('i')
            temp_shop = shop[0].find_all('span', class_='J_im_icon')
    

    In the process of processing, it is found that the business information cannot be crawled, so it is necessary to deal with the situation that cannot be crawled.

            if len(temp_shop) == 0:
                data['shop'] = "The business information is incorrect!"
            else:
                data['shop'] = temp_shop[0].text
    

    The image crawled is its address, and then the product name is processed. Because some product names will appear in front of the words Jingpin Computer, it needs to be processed.

    name = name[0].find_all('em')
    if (name[0].text[0:4] == "Jingpin computer"):
         data['name'] = name[0].text[6:]
    else:
         data['name'] = name[0].text
    
  5. Print the product information and add it to the dataList array for storage

            data['id'] = id
            data['price'] = price[0].text
            data['img'] = img
            print('number:', data['id'])
            print('name', data['name'])
            print('price', data['price'])
            print('shop', data['shop'])
            dataList.append(data)
    
  6. Encapsulate it as a getdata function and call it in the main function

    if __name__ == '__main__':
        url = 'https://list.jd.com/list.html?cat=670%2C671%2C672'
        page = int(input("Please enter the number of pages to be crawled:"))
        for i in range(1, page + 1):
            getdata(url, i, 30)
    
  7. Save the crawled data into a txt file and add line breaks for easy viewing

        f = open('com.txt', 'w')
        for line in dataList:
            f.write(str(line))
            f.write('\n')
        f.close()
        print("Saved")
    

operation result

Tags: Big Data Python crawler Data Mining

Posted by gardnc on Thu, 02 Jun 2022 00:58:12 +0530