How to efficiently scrape web data using DrissionPage?

DrissionPage Introduction#

https://github.com/g1879/DrissionPage
DrissionPage is a web automation tool written in Python that cleverly integrates the functionalities of Selenium and Requests, providing a unified and simple operation interface. Developers can freely switch between browser mode (like using Selenium) and headless mode (similar to using Requests). With this feature, whether dealing with dynamic web content that requires JavaScript rendering or scraping simple static page data, DrissionPage can handle it easily.

Main Page Objects#

DrissionPage provides three main page objects, each suitable for its own use case:

ChromiumPage: Mainly used for direct browser manipulation, suitable for situations that require interaction with web pages, such as clicking buttons, entering text, running JavaScript scripts, etc. However, its performance may be limited by the browser, and the running speed may not be as fast, with potentially higher memory usage.
WebPage: A comprehensive page object that can control the browser and send and receive data packets. It has two modes:
- d mode: Used for browser manipulation, very powerful but slow in speed;
- s mode: Mainly handles data packets, fast, suitable for simpler data packet scenarios.
SessionPage: A lightweight page object specifically designed for sending and receiving data packets without needing to interact with web pages. It is highly efficient for large-scale data scraping and is an ideal choice in this regard.

Features#

Seamless Mode Switching#

DrissionPage allows developers to switch freely between Selenium's browser driver and Requests' session. If rendering a web page is needed, use Selenium; if quick data scraping is desired, use Requests. For example, when encountering a web page with both dynamic and static content, one can quickly obtain static data using SessionPage and then switch to ChromiumPage or WebPage's d mode to handle dynamic content.

Simplified Interface#

DrissionPage provides a unified interface that simplifies the process of web automation and data scraping. Developers no longer need to learn the complex APIs of both Selenium and Requests separately, saving a lot of time in learning and development. For locating web elements, DrissionPage offers ele() and eles() methods similar to Selenium, supporting several types of selectors (like CSS selectors, XPath), making it particularly convenient to use.

Flexible Customization#

It supports users in setting request headers, proxies, timeout settings, etc., making web scraping more flexible. When scraping data, one may encounter anti-scraping mechanisms on websites; in such cases, setting custom request headers and proxies can help bypass these restrictions smoothly.

Built-in Common Features#

DrissionPage includes many commonly used features, such as waiting for elements to load and automatic retries. When dealing with dynamic web pages, web elements may take some time to load, and DrissionPage's wait-for-element-loading feature ensures that operations are performed only after elements have fully loaded, avoiding errors due to incomplete loading.

Multi-Tab Operation#

It can operate multiple tabs in the browser simultaneously, even if the tabs are not in the currently active state, without needing to switch. This feature is particularly useful when handling multiple web pages at the same time, significantly improving work efficiency.

Packet Capture Function Upgrade - Listen Feature#

In DrissionPage version 4.0, the packet capture feature has seen significant improvements, with each page object now having a built-in listener, enhancing its capabilities and making the API more reasonable. This greatly aids developers in debugging and data collection.

Example Code#

The following example can be run directly to see the effect; it will also record the time, allowing you to understand how to use the Listen feature:

from DrissionPage import ChromiumPage
from TimePinner import Pinner
from pprint import pprint

page = ChromiumPage()
page.listen.start('api/getkeydata')  # Specify the target to listen to, then start listening
pinner = Pinner(True, False)
page.get('http://www.hao123.com/')  # Open this website
packet = page.listen.wait()  # Wait to receive the data packet
pprint(packet.response.body)  # Print the content of the data packet
pinner.pin('Time Taken', True)

After running this code, the content of the captured data packet and the total time taken will be output, facilitating performance analysis and debugging for developers.

Page Access Logic Optimization#

In version 3.x, there were two main issues with page connections: the timeout parameter of the get() method in the browser page object was only effective during the page loading phase and not during the connection phase; the loading strategy none mode was not useful in practice. Both issues have been resolved in version 4.0, and users can now control when to terminate connections.

Use Cases#

Web Automation Testing#

Utilizing the functionalities of Selenium to simulate user operations on web pages for automated testing. Various functions of the web page, such as login, registration, and form submission, can be tested to ensure the stability and reliability of the web page.

Data Scraping#

Using Requests to obtain data from static pages, switching to browser mode when encountering complex pages. This allows for quick and efficient scraping of data from various websites, such as news, product information, social network data, etc.

Crawler Development#

With its flexible mode switching and powerful element locating capabilities, DrissionPage is well-suited for developing various types of crawlers. One can choose the appropriate mode based on the characteristics of the website, improving the efficiency and stability of the crawler.

Usage Examples#

Controlling the Browser#

Using the ChromiumPage object, browser automation operations such as logging in and filling out forms can be easily achieved.

from DrissionPage import ChromiumPage

page = ChromiumPage()
page.get('https://gitee.com/login')  # Open the login page
# Find the account input box
user_login = page.ele('#user_login')
user_login.input('Your Account')  # Enter account
# Find the password input box
user_password = page.ele('#user_password')
user_password.input('Your Password')  # Enter password
# Find the login button and click
login_button = page.ele('@value=Log In')
login_button.click()

Scraping Data#

Using the SessionPage object, data can be efficiently scraped without complex interactions with the web page.

from DrissionPage import SessionPage

page = SessionPage()
for i in range(1, 4):  # Loop to visit three pages
    page.get(f'https://gitee.com/explore/all?page={i}')  # Open each page
    # Find all project link elements
    links = page.eles('.title.project-namespace-path')
    for link in links:  # Iterate through each link element
        print(link.text, link.link)  # Print the link's text and address

Page Analysis#

Using the WebPage object, one can flexibly switch between browser mode and data packet mode to adapt to different analysis needs.

from DrissionPage import WebPage

page = WebPage()
page.get('https://gitee.com/explore/all')  # Open the page
page.change_mode()  # Switch mode
# Find project list elements
items = page.ele('.ui.relaxed.divided.items.explore-repo__list').eles('.item')
for item in items:  # Iterate through each project
    print(item('t:h3').text)  # Print project title
    print(item('.project-desc.mb-1').text)  # Print project description

Conclusion#

DrissionPage is a powerful and user-friendly open-source Python package that provides efficient and flexible solutions for web automation and data scraping. By integrating the functionalities of Selenium and Requests, it offers seamless mode switching and a simple interface, allowing developers to focus more on business logic. Whether for novice developers or experienced professionals, DrissionPage is worth trying, making it easier to accomplish various web automation tasks.