Python Automation - Collection and Analysis of Hot Searches on Toutiao, Douyin, and Weibo

This is a common case used in self-media. Today, we will analyze this case to gain insights into learning and using Python.

Effect
By running the py script, we output the local xlsx data table file.

Automation classification; overall matching rate: around 84%~96%.
Word frequency statistics; the coexistence of the three indicates a persistent public popularity with a high information density.
Average sentiment value of the text, sentiment value of each title; main: the intensity of the emotional sentiment of the manually pinned hot search.
Part-of-speech analysis; marking words that may contain guiding and implanted consciousness components. As long as the attributive and adverbial modifiers are stacked, it can always promote the normal situation.

640 (1)

Code#

import os
from datetime import datetime
import requests
from bs4 import BeautifulSoup
from openpyxl import Workbook
from snownlp import SnowNLP
import jieba
from collections import Counter
import jieba.posseg as pseg
import json
import urllib.request

# Determine the format for saving text.
def get_formatted_time():
    """
    Get the formatted current time
    :return: Formatted current time string
    """
    now = datetime.now()
    return now.strftime('%Y-%m-%d')

# News parsing function
def get_news_from_url(url: str):
    """
    Fetch hot search news from the specified URL
    :param url: Webpage URL
    :return: List of hot search news
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
                      '(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    r = requests.get(url, headers=headers)
    r.encoding = 'utf-8'
    soup = BeautifulSoup(r.text, 'html.parser')
    toutiao_resoubang = soup.find_all('div', class_='single-entry tindent')

    resoubang_list = []
    for item in toutiao_resoubang:
        spans = item.find_all('span')
        for span in spans:
            resoubang_list.append(span.string)
    return resoubang_list

# Delete empty rows
def delete_empty_rows(sheet_name: str, wb: Workbook):
    """
    Delete empty rows in the specified worksheet
    :param sheet_name: Worksheet name
    :param wb: Excel workbook object
    :param None keyword
    https://notes-by-yangjinjie.readthedocs.io/zh_CN/latest/python/05-modules/openpyxl.html?highlight=openpyxl
    """
    ws = wb[sheet_name]
    # Iterate using the built-in function ws.iter_rows()
    for row in ws.iter_rows():
        if all(cell.value is None for cell in row):
            ws.delete_rows(row[0].row)

# Count average index and sentiment score
def calculate_average_index_and_sentiment_score(sheet_name: str, wb: Workbook):
    """
    Calculate the average index and sentiment score of hot search news in the specified worksheet
    :param sheet_name: Worksheet name
    :param wb: Excel workbook object
    :return: Average index, sentiment score tuple
    """
    ws = wb[sheet_name]
    total_index = 0
    count = 0
    sentiment_score_list = []
    for row in ws.iter_rows(min_row=2, min_col=1, max_col=3):
        news_str = ''
        for cell in row:
            if cell.value is not None:
                news_str += str(cell.value)
        # Many functions, see: https://github.com/isnowfy/snownlp
        s = SnowNLP(news_str)
        sentiment_score = s.sentiments
        sentiment_score_list.append(sentiment_score)
        # row[2] refers to the third cell in each row, which is the third column.
        total_index += row[2].value
        count += 1
        ws.cell(row=row[0].row, column=4, value=sentiment_score)
    # Average index and sentiment score for each table
    return (total_index / count, sum(sentiment_score_list) / len(sentiment_score_list))

# Count word frequency
def calculate_word_count(sheet_names: list, wb: Workbook):
    """
    Calculate the top 20 most frequent words in the worksheet and write the results to a new worksheet
    :param sheet_names: Worksheet names
    :param wb: Excel workbook object
    :param stopwords_file: Stop words file path
    Stop words refer to words that are very frequent in natural language,
    but usually do not have practical meaning or are not very helpful for text analysis tasks, such as "的", "了", etc.
    """
    
    # Stop words file
    stopwords_file = 'https://ghproxy.com/https://raw.githubusercontent.com/goto456/stopwords/master/cn_stopwords.txt'
    # Request stop words library
    response = requests.get(stopwords_file)
    stopwords = response.content.decode('utf-8').split('\n')

    # Load stop words library
    for word in stopwords:
        jieba.del_word(word.strip())

    # Traverse all worksheets to count word frequency. Due to the poor functionality of the corpus vocabulary, only rough statistics can be made.
    word_count = Counter()
    for sheet_name in sheet_names:
        ws = wb[sheet_name]
        for row in ws.iter_rows(min_row=2, min_col=1, max_col=3):
            news_str = ''
            for cell in row:
                if cell.value is not None:
                    news_str += str(cell.value)
            words = jieba.lcut(news_str)

            # Ignore numeric types.
            # words = [word for word in words if not(word.isdigit() or (word.replace('w', '').replace('.', '').isdigit()))]
            new_words = []
            for word in words:
                 # Ignore strings of length 0 or 1
                if len(word) <= 1:
                    continue
                # Remove numeric noise
                if not(word.isdigit() or (word.replace('w', '').replace('.', '').isdigit())):
                    new_words.append(word)
            words = new_words
            # Update word count
            word_count.update(words)

    # Remove stop words
    for word in list(word_count):
        if word in stopwords:
            del word_count[word]

    # Get the top 30 most frequent words
    top_words = word_count.most_common(30)
    # Create a new worksheet
    ws = wb.create_sheet(title='Word Frequency Statistics')
    # Add rows
    ws.append(['Rank', 'Word', 'Frequency'])
    # Start counting from 1, not 0
    for i, (word, freq) in enumerate(top_words, 1):
        ws.append([i, word, freq])


# Added classification function on 2023.5.9
def write_category_to_sheet(sheet_name: str, wb: Workbook):
    """
    Write the keyword classification information of news events into the fifth column of the Excel worksheet
    :param sheet_name: Worksheet name
    :param wb: Excel workbook object
    :jieba segmentation: https://github.com/fxsjy/jieba
    """

    # Call online classification dictionary json
    # Get JSON data from URL
    response = urllib.request.urlopen('https://ghproxy.com/https://raw.githubusercontent.com/hoochanlon/scripts/main/AQUICK/category_news.json')
    json_data = response.read().decode('utf-8')

    # Parse JSON data
    category_keywords = json.loads(json_data)

    # Start from the current sheet
    ws = wb[sheet_name]
    for row in ws.iter_rows(min_row=2, min_col=1, max_col=4):
        title_str = ''
        for cell in row:
            if cell.value is not None:
                title_str += str(cell.value)

        # Segment the title string and convert it into a list array
        words = pseg.cut(title_str)
        category = ''
        for word, flag in words:
            # Built-in string method to check for prefix specified start
            # if flag.startswith('n'):
            # key keywords, .items() retrieves key-value pairs
                for key, keywords in category_keywords.items():
                    if word in keywords:
                        category = key
                        break
                if category:
                    break
        if not category:
           # category = 'Others'
            category = 'Comprehensive'
        ws.cell(row=row[0].row, column=5, value=category)


# Write the parsed webpage layout data into xlsx according to the rules, with symmetrical rows and columns and clear logic.
def write_news_to_sheet(news_list: list, sheet_name: str, wb: Workbook):
    """
    Write the news list into the Excel worksheet
    :param news_list: News list
    :param sheet_name: Worksheet name
    :param wb: Excel workbook object
    :cell.value.isnumeric() indicates whether the current string can be represented as a number
    :isinstance(cell.value, str) indicates whether the current value is a string
    """
    ws = wb.create_sheet(title=sheet_name)

    row = []
    for i, item in enumerate(news_list, start=1):
        if i >= 156: # Extract 50 groups of data (if the index is greater than 156, it is about other introduction articles of Weibo)
            continue  
        if i % 3 == 1: #  # Index starts from 0, that is, 2%3, and the next is new group data.
            item = item.replace("、", "")
        row.append(item)
        if i % 3 == 0: # If the modulus is divisible, it means that it has reached 3 columns, and a new line is started.
            ws.append(row)
            row = []
    
    # Start iterating each row of data from the second row
    for row in ws.iter_rows(min_row=2, min_col=1):
        for cell in row:
            if cell.column == 1 or cell.column == 3:
                # Instance check, returns True if it is, otherwise returns False.
                if isinstance(cell.value, str) and not cell.value.isnumeric():
                    # Remove the '[Pinned]' characters from the string, 185 is the average value of ranks 2, 3, and 4 (real-time)
                    cell.value = cell.value.replace('[Pinned]', '185w') 
                if isinstance(cell.value, str) and cell.value.isnumeric():
                    cell.value = int(cell.value)
                elif isinstance(cell.value, str):
                    cell.value = float(cell.value.replace('w', ''))

    ws.cell(row=1, column=3, value='Index (ten thousand)')
    ws.cell(row=1, column=4, value='Sentiment Score')
    ws.cell(row=1, column=5, value='Category')
    

def main():

    urls = ['http://resou.today/art/11.html', 'http://resou.today/art/22.html','http://resou.today/art/6.html']
    sheet_names = ['Today's Headlines Hot List', 'Douyin Current Affairs Hot List', 'Weibo Hot Search']

    wb = Workbook()

    wb.remove(wb['Sheet'])
    for url, sheet_name in zip(urls, sheet_names):
        news_list = get_news_from_url(url)
        # Write the parsed data from the webpage to xlsx
        write_news_to_sheet(news_list, sheet_name, wb)
        # Delete empty rows left by operations
        delete_empty_rows(sheet_name, wb)
        # Classification
        write_category_to_sheet(sheet_name, wb)

        # Calculate average index and average sentiment value for each table
        average_index, sentiment_score = calculate_average_index_and_sentiment_score(sheet_name, wb)
        print(f'{sheet_name} Average Index: {average_index:.2f} Sentiment Score: {sentiment_score:.2f}')
    
    # Word frequency statistics
    calculate_word_count(sheet_names, wb)

    # Perform cross-platform processing to save path
    save_path_xlsx_file = os.path.join(os.path.join(os.path.expanduser("~"), "Desktop"), 
                                       "resoubang_{}.xlsx".format(get_formatted_time()))
    # Delete empty sheets and save as specified file
    # wb.remove(wb['Sheet']);wb.save(save_path_xlsx_file)
    wb.save(save_path_xlsx_file)

# If the current module is imported by other modules, the code below this condition will not be executed.
if __name__ == '__main__':
    main()

Analysis#

Through the code, we first see that the first part of the code imports some commonly used Python libraries and modules.

import os: This is one of Python's standard libraries, used for interacting with the operating system, such as creating folders, deleting files, etc.

from datetime import datetime: Import the datetime method from the datetime module, which is used for handling date and time functions.

import requests: This is a very commonly used library for sending HTTP requests, such as fetching webpage content.

from bs4 import BeautifulSoup: Import the BeautifulSoup class from the BeautifulSoup module, which is used to parse and extract data from HTML or XML documents.

from openpyxl import Workbook: Import the Workbook class from the openpyxl module, which is used to manipulate Excel files.

from snownlp import SnowNLP: Import the SnowNLP class from the snownlp module, which is used for Chinese text sentiment analysis.

import jieba: This is a very commonly used Chinese word segmentation library, used to split Chinese text into individual words.

from collections import Counter: Import the Counter class from the collections module, which is used to count an iterable object.

import jieba.posseg: Import the posseg module from the jieba library for Chinese part-of-speech tagging.

import json: This is one of Python's standard libraries, used for handling JSON data.

import urllib.request: This is one of Python's standard libraries, used for sending HTTP requests and handling URLs.

Overall Explanation#

It uses several of the above third-party libraries to achieve different functions.

Next, the code defines a function get_formatted_time() to get the formatted current time.

Then, the code defines a function get_news_from_url(url: str) to fetch hot search news from the specified URL.

Next, the code defines a function delete_empty_rows(sheet_name: str, wb: Workbook) to delete empty rows in the specified worksheet.

Then, the code defines a function calculate_average_index_and_sentiment_score(sheet_name: str, wb: Workbook) to calculate the average index and sentiment score of hot search news in the specified worksheet.

Next, the code defines a function calculate_word_count(sheet_names: list, wb: Workbook) to count the top 20 most frequent words in the worksheet.

Then, the code defines a function write_category_to_sheet(sheet_name: str, wb: Workbook) to write the keyword classification information of news events into the Excel worksheet.

Next, the code defines a function write_news_to_sheet(news_list: list, sheet_name: str, wb: Workbook) to write the news list into the Excel worksheet.

Finally, the code defines a function main() to execute the main program logic. It first defines the list of URLs to fetch and the list of worksheet names, then creates an Excel workbook object. Next, it uses the get_news_from_url() function to fetch hot search news and uses the write_news_to_sheet() function to write the news list into the Excel worksheet. Then, it uses the delete_empty_rows() function to delete empty rows and the write_category_to_sheet() function to write keyword classification information into the Excel worksheet. Next, it uses the calculate_average_index_and_sentiment_score() function to calculate the average index and sentiment score and prints the results. Finally, it uses the calculate_word_count() function to count word frequency and saves the workbook as an xlsx file.

Finally, the code uses the if name == 'main': condition to check whether the current module is being executed directly, and if so, calls the main() function to execute the main program logic.

Usage#

Install Python
Download Python: First, you need to download and install the Python interpreter. You can find different versions of Python on the official Python website (https://www.python.org). Choose the version suitable for your operating system and download the installer.

Install Python: Run the downloaded installer and follow the instructions in the installation wizard. Make sure to select the option to add Python to the system path.

To run a Python file: Open the command prompt (Windows) or terminal (Mac/Linux), navigate to the directory where the Python script is stored. Use the cd command to switch directories, for example: cd C:\Users\YourUsername\Documents. Then, you can run the following command to execute the Python script: python example.py. This way, you can install and run Python files. If everything goes smoothly, you should see the output in the command prompt or terminal.

Note: For python3.x versions, run python3 example.py to execute the file.

Set Python's Global Environment Variable#

Open the terminal application. You can find it in the "Applications" folder or use Spotlight search and type "Terminal."

In the terminal, enter the following command to edit your bash configuration file (if you are using Zsh, replace .bash_profile in the command with .zshrc):

nano ~/.bash_profile

The terminal will open a text editor displaying the contents of your bash configuration file (if you have not created a configuration file, it will be blank).

At the end of the file, add the following line to set Python's global environment variable:

export PATH="/usr/local/bin:$PATH"

This will add the /usr/local/bin directory to your PATH environment variable, which is usually where Python is installed.

Press Control + X to exit the text editor, then press Y to save changes, and finally press Enter to confirm the saved file name.

In the terminal, enter the following command to make the changes take effect:

source ~/.bash_profile

This will reload your bash configuration file, making the changes take effect immediately.

Now, you have successfully set Python's global environment variable. You can run the python command in the terminal to verify if it is set correctly.

Correctly Verify Server Certificates
If you encounter certificate verification failure issues when sending HTTPS requests using the urllib library in Python.

To resolve this issue, you can try the following steps:

Open the terminal application.

Enter the following command and press Enter:

/applications/python\ {your_python_version}/install\ Certificates.command

Please replace {your_python_version} with the version number of Python you are currently using. For example, if you are using Python 3.10, the command should be:

/applications/python\ 3.10/install\ Certificates.command

After executing the above command, it will automatically download and install the missing root certificates.

This command runs the Install Certificates.command script in the Python installation directory to install the missing root certificates. After executing this command, your Python environment should be able to correctly verify server certificates without encountering certificate verification failure errors.

PIP Install File Dependencies
Enter the following command to check if pip is already installed:

pip --version

If pip is installed, you will see the version information of pip. If not installed, you will see an error message.

If pip is not installed, you can use one of the following methods to install it:

Run the following command in the command line to install pip:

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py

sudo python get-pip.py

Open https://bootstrap.pypa.io/get-pip.py in your browser.

Save the content of the page as a Python script file named get-pip.py.

In the command line, navigate to the directory where the get-pip.py file is saved.

Run the following command to install pip:

python get-pip.py
For Windows users:

For Mac and Linux users:

After installation is complete, you can run the following command again to verify if pip has been successfully installed:

If the version information of pip is displayed, then pip has been successfully installed.

pip --version

Now you can use the following command in the terminal to install the required dependency libraries: pip install requests beautifulsoup4 openpyxl snownlp jieba. The libraries import os and from datetime import datetime are built-in libraries in Python and do not require installation. As for import jieba.posseg as pseg and import urllib.request, they are part of Python's standard library and do not need to be installed separately. If you encounter permission issues during pip installation, you can add the --user parameter to install the libraries in the user directory: pip install --user requests beautifulsoup4 openpyxl snownlp jieba.

Finally, with the necessary Python environment and required dependency libraries satisfied, you can run the file directly using terminal commands or the Python interpreter.