4. selenium of Python crawler

I. Introduction to selenium

Selenium is an automated testing tool for the Web. It was originally developed for website automated testing. Selenium can directly call the browser. It supports all mainstream browsers (including PhantomJS, which have no interface). It can receive instructions, let the browser automatically load pages, obtain required data, and even screen captures. We can use selenium to easily complete the crawler written before. Next, let's take a look at the running effect of selenium

1.1 download and configure chromedriver and selenium modules

Download chromedriver

Download address: (both below are OK)
1,http://chromedriver.storage.googleapis.com/index.html

2,https://npm.taobao.org/mirrors/chromedriver/

be careful:
The version of chromedriver should be consistent with your browser version (the larger version is OK), otherwise it will not take effect

My browser version:

Open the driver download interface: https://npm.taobao.org/mirrors/chromedriver/


Here are the corresponding version descriptions that should be downloaded:

My browser version is 88.0.43243, so the downloaded chromedriver version:

Configure driver

After downloading, unzip the unzipped chrom edriver Exe in the installation path of the Chrome browser, and then add this path to the path environment variable


After downloading the chromedriver and installing the selenium module, execute the following code and observe the running process

from selenium import webdriver 

# If the driver is not added to the environment variable, you need to assign the absolute path of the driver to executable_path parameter
# driver = webdriver.Chrome(executable_path='/home/worker/Desktop/driver/chromedriver')

# If the driver adds an environment variable, you do not need to set executable_path
driver = webdriver.Chrome()

# Make a request to a url
driver.get("https://www.baidu.com/")

# Save the web page as a picture, and Google browsers above 69 will not be able to use the screenshot function
# driver.save_screenshot("itcast.png")

print(driver.title) # Print page title

# Exit simulation browser
driver.quit() # Be sure to exit! There will be residual processes if you do not exit!

Error: Win32 Error,Code:740 The requested operation requires elevation
Solution: right click pycharm and "run as administrator";

Error: Message: 'chromedriver' executable may have wrong permissions
Solution: the driver is not placed in the correct position (it is not placed in the installation position of the Chrome browser), or the environment variables are not configured correctly (it can be solved by completely installing the above steps)

1.2 operation effect of phantomjs browser without interface

PhantomJS is a Webkit based "headless" browser that loads websites into memory and executes JavaScript on pages. Download address: http://phantomjs.org/download.html

from selenium import webdriver

# If it is an absolute path: Driver = webdriver PhantomJS(executable_path='/home/worker/Desktop/driver/phantomjs') 
driver = webdriver.PhantomJS(executable_path='C:/Program Files (x86)/Google/Chrome/Application/phantomjs.exe')
# driver = webdriver.Chrome(executable_path='/home/worker/Desktop/driver/chromedriver')

# Make a request to a url
driver.get("http://www.itcast.cn/")

# Save web page as picture
driver.save_screenshot("itcast.png")

# Exit simulation browser
driver.quit() # Be sure to exit! There will be residual processes if you do not exit!

I found that I had downloaded the pictures

There are warnings on the console, but they do not affect it (this blog will be introduced in Chapter 7);
python code can automatically call Google browsing or phantomjs no interface browser to control its automatic access to the website;

1.3 usage scenarios of headless browser and headlined browser

  • Usually, we need to view various situations during the development process, so we usually use a header browser
  • When the project is completed and deployed, the operating system adopted by the platform is usually the server version operating system. The server version operating system must use the headless browser to operate normally

II. Function and working principle of selenium

Use the browser's native API to package into a set of more object-oriented Selenium WebDriver API to directly operate the elements in the browser page, and even the browser itself (screen capture, window size, startup, shutdown, plug-in installation, certificate configuration, etc.)

  • The essence of webdriver is a web server, which provides webapi externally, and encapsulates various functions of the browser
  • Different browsers use different webdriver s

Three cases: python open Baidu search "python"

Find the id value of the input box and the "Baidu click" button

import time
from selenium import webdriver

# The driver object is instantiated by specifying the path of the chromedriver, which is placed in the current directory.
# driver = webdriver.Chrome(executable_path='./chromedriver')
# chromedriver has added environment variables
driver = webdriver.Chrome()

# Control browser access url address
driver.get("https://www.baidu.com/")

# Search 'python' in Baidu search box
driver.find_element_by_id('kw').send_keys('python')
# Click 'Baidu search'
driver.find_element_by_id('su').click()

time.sleep(6)
# Exit browser
driver.quit()
  • webdriver. The executable parameter in Chrome (executable_path='./chromedriver') specifies the path of the downloaded chromedriver file
  • driver.find_element_by_id('kw').send_keys('python') locate the tag whose ID attribute value is' kW' and enter the string 'Python' into it
  • driver.find_element_by_id('su').click() locate the tag whose ID attribute value is Su, and click
    • The click function is used to trigger the js click event of the tag

Operation results:

The browser will close automatically after 6 seconds;

IV selenium extracted data

4.1 common attributes and methods of driver object

In the process of using selenium, after instantiating the driver object, the driver object has some common properties and methods

  1. driver.page_source the web page source code after the current tab browser rendering
  2. driver.current_url URL of the current tab
  3. driver.close() closes the current tab. If there is only one tab, the entire browser will be closed
  4. driver.quit() close browser
  5. driver.forward() page forward
  6. driver.back() page back
  7. driver. screen_ Screenshot of shot (img\u name) page

4.2 driver object locating label element obtaining label object

In selenium, you can locate labels and return label element objects in a variety of ways

find_element_by_id 						(Return an element)
find_element(s)_by_class_name 			(Get element list by class name)
find_element(s)_by_name 				(According to the label name Property value returns a list containing the elements of a label object)
find_element(s)_by_xpath 				(Returns a list containing elements)
find_element(s)_by_link_text 			(Get element list from connection text)
find_element(s)_by_partial_link_text 	(Get a list of elements based on the text contained in the link)
find_element(s)_by_tag_name 			(Get element list by tag name)
find_element(s)_by_css_selector 		(according to css Selector to get a list of elements)
  • be careful:
    • find_element and find_ Differences between elements:
      • If there is more than one s, the list will be returned. If there is no s, the first label object matched will be returned
      • find_ Throw an exception if the element does not match, find_ If the elements do not match, an empty list will be returned
    • by_link_text and by_ partial_ link_ The difference between tex: all text and containing a certain text
    • Examples of using the above functions
      • driver.find_element_by_id('id_str')

4.3 extracting text content and attribute values from label objects

find_element can only obtain elements, not directly obtain the data in them. If you need to obtain data, you need to use the following methods

  • Click on the element click()

    • Click the located label object
  • Enter data element into the input box send_ keys(data)

    • Enter data for the anchored label object
  • Get the text element text

    • Get the text content by locating the text attribute of the obtained label object
  • Get the attribute value element get_ Attribute ("attribute name")

    • Get of the tag object obtained by positioning_ The attribute function passes in the attribute name to get the value of the attribute
  • Code implementation, as follows:

from selenium import webdriver

driver = webdriver.Chrome()

driver.get('http://www.itcast.cn/')

ret = driver.find_elements_by_tag_name('h2')
print(ret[0].text) 

ret = driver.find_elements_by_link_text('Dark horse programmer')
print(ret[0].get_attribute('href'))

driver.quit()

V. other use methods of selenium

5.1 selenium tab switching

When selenium controls the browser to open multiple tabs, how to control the browser to switch between different tabs? We need to do the following two steps:

# 1. get the list of handles of all current tabs
current_windows = driver.window_handles

# 2. switch according to the index subscript of the tab handle list
driver.switch_to.window(current_windows[0])

Reference code example:

import time
from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://www.baidu.com/")

time.sleep(1)
driver.find_element_by_id('kw').send_keys('python')
time.sleep(1)
driver.find_element_by_id('su').click()
time.sleep(1)

# Create a new tab by executing js
js = 'window.open("https://www.sogou.com");'
driver.execute_script(js)
time.sleep(1)

# 1. get all current windows
windows = driver.window_handles

time.sleep(2)
# 2. switch according to the window index
driver.switch_to.window(windows[0])
time.sleep(2)
driver.switch_to.window(windows[1])

time.sleep(6)
driver.quit()

5.2 switch_to toggle frame label

Frame is a common technology in html, that is, one page is nested with another web page. selenium cannot access the content in the frame by default. The corresponding solution is driver switch_ to. frame(frame_element). Next, we learn this knowledge point through qq email simulated Login

Find iframe tag

  • Reference code:
import time
from selenium import webdriver

driver = webdriver.Chrome()

url = 'https://mail.qq.com/cgi-bin/loginpage'
driver.get(url)
time.sleep(2)

login_frame = driver.find_element_by_id('login_frame') # Locate the frame element by id
driver.switch_to.frame(login_frame) # Turn to the frame

driver.find_element_by_xpath('//*[@id="u"]').send_keys('1596930226@qq.com')
time.sleep(2)

driver.find_element_by_xpath('//*[@id="p"]').send_keys('hahamimashicuode')
time.sleep(2)

driver.find_element_by_xpath('//*[@id="login_button"]').click()
time.sleep(2)

"""operation frame External elements need to be switched out"""
windows = driver.window_handles
driver.switch_to.window(windows[0])

content = driver.find_element_by_class_name('login_pictures_title').text
print(content)

driver.quit()

Summary:

  • Switch to the nested page of the positioned frame tag

    • driver. switch_ to. Frame (locate the frame and iframe label objects through the find_element_by function)
  • Cut out the frame label by switching tabs

windows = driver.window_handles
driver.switch_to.window(windows[0])

5.3 selenium's handling of cookie s

selenium can help us deal with cookie s in the page, such as obtaining and deleting. Next, we will learn this part of knowledge

5.3.1 obtaining cookie s

driver.get_cookies() returns a list containing complete cookie information! Not only name and value, but also domain and other dimensions of cookies. Therefore, if you want to use the obtained cookie information with the requests module, you need to convert it into a cookie dictionary with name and value as key value pairs

# Get all cookie information of the current tab
print(driver.get_cookies())
# Convert cookie s into Dictionaries
cookies_dict = {cookie['name']: cookie['value'] for cookie in driver.get_cookies()}

5.3.2 deleting cookie s

#Delete a cookie
driver.delete_cookie("CookieName")

# Delete all cookie s
driver.delete_all_cookies()

5.4 selenium controls the browser to execute js code

selenium allows the browser to execute the js code specified by us. Run the following code to see the running effect

import time
from selenium import webdriver

driver = webdriver.Chrome()
driver.get("http://www.itcast.cn/")
time.sleep(1)

js = 'window.scrollTo(0,document.body.scrollHeight)' # js statement
driver.execute_script(js) # Method of executing js

time.sleep(5)
driver.quit()

Method to execute JS: driver execute_ script(js)

Six page waiting

The page needs to spend time waiting for the response of the website server during the loading process. During this process, the tag element may not be visible because it has not been loaded. How to deal with this situation?

  1. Page waiting for classification
  2. Forced waiting for introduction
  3. Explicit wait for introduction
  4. Implicit waiting for introduction
  5. Manually implement page waiting

6.1 classification of page waiting

First, let's take a look at the following categories of selenium page waiting

  1. Forced wait
  2. Implicit waiting
  3. Explicit wait

6.2 forced waiting (understand)

  • Time sleep()
  • The disadvantage is that it is not intelligent. If the setting time is too short, the elements have not been loaded; If the setting time is too long, it will waste time

6.3 implicit waiting

  • The implicit wait is for element positioning. The implicit wait sets a time to judge whether the element positioning is successful within a period of time. If it is completed, proceed to the next step

  • If the positioning is not successful within the set time, the timeout loading will be reported

  • Sample code

    from selenium import webdriver
    
    driver = webdriver.Chrome()  
    
    driver.implicitly_wait(10) # Implicit wait, max. 20 seconds  
    
    driver.get('https://www.baidu.com')
    
    driver.find_element_by_xpath()
    
    

6.4 explicit wait (understand)

  • Check whether the waiting conditions are met every few seconds. If the waiting conditions are met, stop waiting and continue to execute subsequent codes

  • If there is no achievement, continue to wait until the specified time is exceeded, and a timeout exception is reported

  • Sample code

from selenium import webdriver  
from selenium.webdriver.support.wait import WebDriverWait  
from selenium.webdriver.support import expected_conditions as EC  
from selenium.webdriver.common.by import By 

driver = webdriver.Chrome()

driver.get('https://www.baidu.com')

# Explicit wait
WebDriverWait(driver, 20, 0.5).until(
    EC.presence_of_element_located((By.LINK_TEXT, 'Okay 123')))  
# Parameter 20 indicates a maximum wait of 20 seconds
# Parameter 0.5 means to check whether the specified label exists once every 0.5 seconds
# EC.presence_of_element_located((By.LINK_TEXT, 'good 123')) indicates that the label is located through the linked text content
# Check every 0.5 seconds to determine whether the label exists by linking the text content. If it exists, continue to execute downward; If it does not exist, an exception will be thrown until the upper limit of 20 seconds

print(driver.find_element_by_link_text('Okay 123').get_attribute('href'))
driver.quit() 

6.5 manually realize page waiting

After learning about implicit waiting, explicit waiting and forced waiting, we found that there is no general method to solve the problem of page waiting, such as the scenario that "the page needs to slide to trigger ajax asynchronous loading", then we will Taobao Homepage For example, manually implement page waiting

  • Principle:
    • Use the idea of forced waiting and explicit waiting to implement manually
    • Constantly judge or judge whether a label object has been loaded (whether it exists) with a limited number of times
  • The implementation code is as follows:
import time
from selenium import webdriver
#driver = webdriver.Chrome('/home/worker/Desktop/driver/chromedriver')

driver.get('https://www.taobao.com/')
time.sleep(1)

# i = 0
# while True:
for i in range(10):
    i += 1
    try:
        time.sleep(3)
        element = driver.find_element_by_xpath('//div[@class="shop-inner"]/h3[1]/a')
        print(element.get_attribute('href'))
        break
    except:
        js = 'window.scrollTo(0, {})'.format(i*500) # js statement
        driver.execute_script(js) # Method of executing js
driver.quit()

VII selenium enables the interface free mode

Most servers have no interface, and selenium controls Google browser in an interface free mode. In this section, we will learn how to enable the interface free mode (also known as headless mode)

  • How to open the interface free mode
    • Instantiate configuration object
      • options = webdriver.ChromeOptions()
    • Configuration object adding command to enable no interface mode
      • options.add_argument("--headless")
    • Configuration object add command to disable gpu
      • options.add_argument("--disable-gpu")
    • Instantiate the driver object with the configuration object
      • driver = webdriver.Chrome(chrome_options=options)
  • Note: only the 59+ version of the chrome browser in macos and the 57+ version in Linux can use the no interface mode!
  • The reference codes are as follows:
from selenium import webdriver

options = webdriver.ChromeOptions() # Create a configuration object
options.add_argument("--headless") # Enable no interface mode
options.add_argument("--disable-gpu") # Disable gpu

# options.set_headles() # Another way to open the interface free mode
driver = webdriver.Chrome(chrome_options=options) # Instantiate the driver object with configuration

driver.get('http://www.itcast.cn')
print(driver.title)
driver.quit()

VIII selenium uses proxy ip

selenium control browser can also use proxy ip!

  • Method of using proxy ip

    • Instantiate configuration object
      • options = webdriver.ChromeOptions()
    • Configuration object add command using proxy ip
      • options.add_argument('--proxy-server=http://202.20.16.82:9527')
    • Instantiate the driver object with the configuration object
      • driver = webdriver.Chrome('./chromedriver', chrome_options=options)
  • The reference codes are as follows:

from selenium import webdriver

options = webdriver.ChromeOptions() # Create a configuration object
options.add_argument('--proxy-server=http://202.20.16.82:9527') \

driver = webdriver.Chrome(chrome_options=options) # Instantiate the driver object with configuration

driver.get('http://www.itcast.cn')
print(driver.title)
driver.quit()

IX selenium replacement user agent

When selenium controls Google browser, the user agent defaults to Google browser. In this section, we will learn to use different user agents

  • Method of replacing user agent

    • Instantiate configuration object
      • options = webdriver.ChromeOptions()
    • Add and replace UA command for configuration object
      • options.add_argument('--user-agent=Mozilla/5.0 HAHA')
    • Instantiate the driver object with the configuration object
      • driver = webdriver.Chrome('./chromedriver', chrome_options=options)
  • The reference codes are as follows:

from selenium import webdriver

options = webdriver.ChromeOptions() # Create a configuration object
options.add_argument('--user-agent=Mozilla/5.0 HAHA') # Replace user agent

driver = webdriver.Chrome('./chromedriver', chrome_options=options)

driver.get('http://www.itcast.cn')
print(driver.title)
driver.quit()

Tags: Python

Posted by jtbaker on Thu, 02 Jun 2022 22:32:02 +0530