Today, I would like to recommend a web automation tool based on Python: DrissionPage. This tool can control browsers, send and receive data packets, and even combine the two. In simple terms, it combines the convenience of web browser automation with the efficiency of requests.
There are usually two forms of web automation:
-
Sending request packets directly to the server to obtain the required data and simulate data flow operations.
-
Interacting with the browser and web pages to simulate user interface operations.
The former is lightweight and fast, such as the requests library. However, when faced with websites that require login, it often needs to deal with anti-crawling measures such as captchas, JS obfuscation, and signature parameters, which have a higher threshold. If the data is generated by JS calculations, the calculation process needs to be reproduced, which is not efficient in terms of development.
The latter directly uses the browser to simulate user behavior, such as the Selenium library, which can largely bypass these obstacles, but the browser's runtime efficiency is not high.
Therefore, the original intention of DrissionPage is to combine them into one, switch the corresponding mode when needed, and provide a user-friendly method to improve development and execution efficiency.
Features:
- No webdriver features
- No need to download different drivers for different versions of browsers
- Faster execution speed
- Can search for elements across iframes without switching in and out
- Treat iframes as regular elements, making the logic clearer
- Can operate on multiple tabs in the browser simultaneously, even if the tabs are not active, without switching
- Can directly read browser cache to save images, without using GUI to click "Save As"
- Can take screenshots of the entire webpage, including areas outside the viewport (supported by browser versions 90 and above)
- Can handle shadow-root in a non-open state
Project address:
https://gitee.com/g1879/DrissionPage
Install DrissionPage using pip:
pip install DrissionPage -i https://pypi.tuna.tsinghua.edu.cn/simple
Application example: Scraping Maoyan's Top 100 Movies
This example demonstrates data scraping using a browser.
Target URL: https://www.maoyan.com/board/4
Example code:
The following code can be run directly.
Note that a recorder object is used here, see DataRecorder for details.
from DrissionPage import ChromiumPage
from DataRecorder import Recorder
# Create a page object
page = ChromiumPage()
# Create a recorder object
recorder = Recorder('data.csv')
# Access the webpage
page.get('https://www.maoyan.com/board/4')
while True:
# Iterate through all dd elements on the page
for mov in page.eles('t:dd'):
# Get the required information
num = mov('t:i').text
score = mov('.score').text
title = mov('@data-act=boarditem-click').attr('title')
star = mov('.star').text
time = mov('.releasetime').text
# Write to the recorder
recorder.add_data((num, title, star, time, score))
# Get the next page button, click if it exists
btn = page('下一页', timeout=2)
if btn:
btn.click()
page.wait.load_start()
# Exit the program if it doesn't exist
else:
break
recorder.record()
Now let's talk about this useful library, DataRecorder.
https://gitee.com/huiwei13/data-recorder
Although it is not widely known, it is very useful.
It can cache data and write it once it reaches a certain quantity, reducing the number of file read/write operations and lowering overhead.
It supports writing data simultaneously in multiple threads.
When writing, it automatically waits for the file to close before writing to avoid data loss.
It provides good support for resuming crawling from breakpoints.
It can easily transfer data in batches.
It can automatically create headers based on dictionary data.
It automatically creates files and paths, reducing code volume.
Recorder:
Recorder is a simple, intuitive, efficient, and practical tool that only performs one action, which is continuously receiving data and adding it to the file in order. It can receive single-row data or multi-row data for writing at once.
It supports four file formats: csv, xlsx, json, and txt.
from DataRecorder import Recorder
data = ((1, 2, 3, 4),
(5, 6, 7, 8))
r = Recorder('data.csv')
r.add_data(data) # Write multiple rows of data at once
r.add_data('abc') # Write a single row of data
Filler:
Filler is used to fill data into table files, and you can specify the coordinates for filling. It is very flexible and can specify the coordinates as the top-left corner and fill in a two-dimensional data array. It also encapsulates the functionality of recording data processing progress (such as resuming crawling). In addition, it can also set links for cells.
It only supports csv and xlsx file formats.
from DataRecorder import Filler
f = Filler('results.csv')
f.add_data((1, 2, 3, 4), 'a2') # Write a row of data starting from cell A2
f.add_data(((1, 2), (3, 4)), 'd4') # Write a two-dimensional data array starting from cell D4