banner
andrewji8

Being towards death

Heed not to the tree-rustling and leaf-lashing rain, Why not stroll along, whistle and sing under its rein. Lighter and better suited than horses are straw sandals and a bamboo staff, Who's afraid? A palm-leaf plaited cape provides enough to misty weather in life sustain. A thorny spring breeze sobers up the spirit, I feel a slight chill, The setting sun over the mountain offers greetings still. Looking back over the bleak passage survived, The return in time Shall not be affected by windswept rain or shine.
telegram
twitter
github

The new generation alternative to selenium - DrissionPage

Today, I would like to recommend a web automation tool based on Python: DrissionPage. This tool can control browsers, send and receive data packets, and even combine the two. In simple terms, it combines the convenience of web browser automation with the efficiency of requests.

There are usually two forms of web automation:

  1. Sending request packets directly to the server to obtain the required data and simulate data flow operations.

  2. Interacting with the browser and web pages to simulate user interface operations.

The former is lightweight and fast, such as the requests library. However, when faced with websites that require login, it often needs to deal with anti-crawling measures such as captchas, JS obfuscation, and signature parameters, which have a higher threshold. If the data is generated by JS calculations, the calculation process needs to be reproduced, which is not efficient in terms of development.

The latter directly uses the browser to simulate user behavior, such as the Selenium library, which can largely bypass these obstacles, but the browser's runtime efficiency is not high.

Therefore, the original intention of DrissionPage is to combine them into one, switch the corresponding mode when needed, and provide a user-friendly method to improve development and execution efficiency.

image

Features:

  • No webdriver features
  • No need to download different drivers for different versions of browsers
  • Faster execution speed
  • Can search for elements across iframes without switching in and out
  • Treat iframes as regular elements, making the logic clearer
  • Can operate on multiple tabs in the browser simultaneously, even if the tabs are not active, without switching
  • Can directly read browser cache to save images, without using GUI to click "Save As"
  • Can take screenshots of the entire webpage, including areas outside the viewport (supported by browser versions 90 and above)
  • Can handle shadow-root in a non-open state

Project address:

https://gitee.com/g1879/DrissionPage

Install DrissionPage using pip:

pip install DrissionPage -i https://pypi.tuna.tsinghua.edu.cn/simple

Application example: Scraping Maoyan's Top 100 Movies

This example demonstrates data scraping using a browser.

Target URL: https://www.maoyan.com/board/4

Example code:

The following code can be run directly.

Note that a recorder object is used here, see DataRecorder for details.

from DrissionPage import ChromiumPage
from DataRecorder import Recorder

# Create a page object
page = ChromiumPage()

# Create a recorder object
recorder = Recorder('data.csv')

# Access the webpage
page.get('https://www.maoyan.com/board/4')

while True:
    # Iterate through all dd elements on the page
    for mov in page.eles('t:dd'):
        # Get the required information
        num = mov('t:i').text
        score = mov('.score').text
        title = mov('@data-act=boarditem-click').attr('title')
        star = mov('.star').text
        time = mov('.releasetime').text
        # Write to the recorder
        recorder.add_data((num, title, star, time, score))

    # Get the next page button, click if it exists
    btn = page('下一页', timeout=2)
    if btn:
        btn.click()
        page.wait.load_start()
    # Exit the program if it doesn't exist
    else:
        break

recorder.record()

Now let's talk about this useful library, DataRecorder.

https://gitee.com/huiwei13/data-recorder

Although it is not widely known, it is very useful.

It can cache data and write it once it reaches a certain quantity, reducing the number of file read/write operations and lowering overhead.

It supports writing data simultaneously in multiple threads.

When writing, it automatically waits for the file to close before writing to avoid data loss.

It provides good support for resuming crawling from breakpoints.

It can easily transfer data in batches.

It can automatically create headers based on dictionary data.

It automatically creates files and paths, reducing code volume.

Recorder:

Recorder is a simple, intuitive, efficient, and practical tool that only performs one action, which is continuously receiving data and adding it to the file in order. It can receive single-row data or multi-row data for writing at once.

It supports four file formats: csv, xlsx, json, and txt.

from DataRecorder import Recorder

data = ((1, 2, 3, 4), 
        (5, 6, 7, 8))

r = Recorder('data.csv')
r.add_data(data)  # Write multiple rows of data at once
r.add_data('abc')  # Write a single row of data

Filler:

Filler is used to fill data into table files, and you can specify the coordinates for filling. It is very flexible and can specify the coordinates as the top-left corner and fill in a two-dimensional data array. It also encapsulates the functionality of recording data processing progress (such as resuming crawling). In addition, it can also set links for cells.

It only supports csv and xlsx file formats.

from DataRecorder import Filler

f = Filler('results.csv')
f.add_data((1, 2, 3, 4), 'a2')  # Write a row of data starting from cell A2
f.add_data(((1, 2), (3, 4)), 'd4')  # Write a two-dimensional data array starting from cell D4
Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.