banner
andrewji8

Being towards death

Heed not to the tree-rustling and leaf-lashing rain, Why not stroll along, whistle and sing under its rein. Lighter and better suited than horses are straw sandals and a bamboo staff, Who's afraid? A palm-leaf plaited cape provides enough to misty weather in life sustain. A thorny spring breeze sobers up the spirit, I feel a slight chill, The setting sun over the mountain offers greetings still. Looking back over the bleak passage survived, The return in time Shall not be affected by windswept rain or shine.
telegram
twitter
github

Surprise! Python web crawler only requires 10 lines of code, allowing you to crawl a massive amount of public account articles!

Preface
Since the appearance of chatGPT, the ability to process text has increased by a dimension. Before this, after crawling text content from the internet, we needed to write a text cleaning program to clean the text. But now, with the help of chatGPT, we can easily clean all types and formats of text content in just a few seconds, removing HTML tags and more.
Even after cleaning the articles, we can still do a lot of things, such as extracting key points and rewriting articles. Using chatGPT can easily solve these tasks.

As early as 2019, I wrote an article introducing the method of crawling articles from WeChat public accounts. It still applies now, but back then, we didn't do much processing on the crawled articles. But now, things have changed, and we can do a lot more, as long as we don't violate laws and regulations.

Obtaining the URL of an Article
The most important thing for web crawling is to find the URL address of the target website, and then find the patterns and crawl the addresses one by one or in multiple threads. There are generally two ways to obtain addresses. One is through web page pagination, inferring the pattern of the URL address, such as using parameters like pageNum = num, and simply incrementing num. The other way is hidden in the HTML, where we need to parse the hyperlinks in the current webpage, such as , and extract the URL as the address for subsequent crawling.

Unfortunately, both of these methods are difficult to use for WeChat public accounts. When we open a WeChat public account address in a web page, we can see its URL format:

https://mp.weixin.qq.com/s?__biz=MzIxMTgyODczNg==&mid=2247483660&idx=1&sn=2c14b9b416e2d8eeed0cbbc6f44444e9&chksm=974e2863a039a1752605e6e76610eb39e5855c473edab16ab2c7c8aa2f624b4d892d26130110&token=20884314&lang=zh_CN#rd
Except for the domain name at the beginning, the parameters at the end are completely irregular. Moreover, in an article, there is no way to link to the next article. We cannot start from one article and crawl all the articles under this public account.

But fortunately, we still have a way to obtain all the article addresses under a specific public account. We just need to save them and then crawl them one by one, which becomes much easier.

  1. First, you need to have a public account. If you don't have one, you can register one. This is a prerequisite. The steps to register a public account are relatively simple and can be done by yourself.
  2. After registering, log in to the WeChat public platform and click on "Drafts" on the left side to create a new article.

640 (1)
3. On the new article page, click on the hyperlink at the top to open a popup window. Select "Public Account Article" and enter the name of the public account you want to crawl, as shown below:

640 (2)
4. After selecting, you can see the list of all articles under this public account. At this time, open F12 and view the network requests in the webpage.

640 (3)
When you click on the next page, you can see the requested URL and the accompanying business parameters and request header parameters.

640 (4)
The above are some business parameters. These parameters are easy to understand. The important ones are "begin", which indicates from which article to start querying, and "count", which indicates how many articles to query at a time. "fakeid" is the unique identifier of the public account, which is different for each public account. If you want to crawl other public accounts, just change this parameter. "random" can be omitted. You can also see the corresponding results:

640 (5)
Writing Code
With the above information, we can write the code. I am using Python 3.8. First, define the URL, headers, and the required parameters.

# Target URL
url = "https://mp.weixin.qq.com/cgi-bin/appmsg"
# Request header parameters
headers = {
  "Cookie": "ua_id=YF6RyP41YQa2QyQHAAAAAGXPy_he8M8KkNCUbRx0cVU=; pgv_pvi=2045358080; pgv_si=s4132856832; uuid=48da56b488e5c697909a13dfac91a819; bizuin=3231163757; ticket=5bd41c51e53cfce785e5c188f94240aac8fad8e3; ticket_id=gh_d5e73af61440; cert=bVSKoAHHVIldcRZp10_fd7p2aTEXrTi6; noticeLoginFlag=1; remember_acct=mf1832192%40smail.nju.edu.cn; data_bizuin=3231163757; data_ticket=XKgzAcTceBFDNN6cFXa4TZAVMlMlxhorD7A0r3vzCDkS++pgSpr55NFkQIN3N+/v; slave_sid=bU0yeTNOS2VxcEg5RktUQlZhd2xheVc5bjhoQTVhOHdhMnN2SlVIZGRtU3hvVXJpTWdWakVqcHowd3RuVF9HY19Udm1PbVpQMGVfcnhHVGJQQTVzckpQY042QlZZbnJzel9oam5SdjRFR0tGc0c1eExKQU9ybjgxVnZVZVBtSmVnc29ZcUJWVmNWWEFEaGtk; slave_user=gh_d5e73af61440; xid=93074c5a87a2e98ddb9e527aa204d0c7; openid2ticket_obaWXwJGb9VV9FiHPMcNq7OZzlzY=lw6SBHGUDQf1lFHqOeShfg39SU7awJMxhDVb4AbVXJM=; mm_lang=zh_CN",
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36",
}

# Business parameters
data = {
    "token": "1378111188",
    "lang": "zh_CN",
    "f": "json",
    "ajax": "1",
    "action": "list_ex",
    "begin": "0",
    "count": "5",
    "query": "",
    "fakeid": "MzU5MDUzMTk5Nw==",
    "type": "9",
}

The cookie and token need to be changed according to the URL you are requesting. Then send the request and parse the response.

content_list = []
for i in range(20):
    data["begin"] = i*5
    time.sleep(5)
    # Use the get method to submit the request
    content_json = requests.get(url, headers=headers, params=data).json()
    # It returns a JSON, which contains the data of each page
    for item in content_json["app_msg_list"]:
    # Extract the title and corresponding URL of each article on the page
        items = []
        items.append(item["title"])
        items.append(item["link"])
        t = time.localtime(item["create_time"])
        items.append(time.strftime("%Y-%m-%d %H:%M:%S", t))
        content_list.append(items)

The first for loop is for the number of pages to crawl. It is recommended to crawl 20 pages at a time, with 5 articles per page, which is a total of 100 articles. First, you need to determine how many pages there are in the history article list of the public account, and this number should be smaller than the number of pages. Change data["begin"] to indicate from which article to start, 5 articles each time. Note that the crawling should not be too much or too frequent, so you need to wait a few seconds after each crawl, otherwise your IP, cookie, and even the public account may be blocked.

Finally, we just need to save the titles and URLs, and then we can crawl them one by one.

name = ['title', 'link', 'create_time']
test = pd.DataFrame(columns=name, data=content_list)
test.to_csv("url.csv", mode='a', encoding='utf-8')

To get all the historical articles, we need to obtain the total number of articles from "app_msg_cnt" and calculate how many pages there are, so we can crawl all the articles at once.

content_json = requests.get(url, headers=headers, params=data).json()
count = int(content_json["app_msg_cnt"])
print(count)
page = int(math.ceil(count / 5))
print(page)

To avoid crawling too frequently, we need to sleep for a period of time. After crawling 10 times, we can let the program sleep for a few seconds.

if (i > 0) and (i % 10 == 0):
        name = ['title', 'link', 'create_time']
        test = pd.DataFrame(columns=name, data=content_list)
        test.to_csv("url.csv", mode='a', encoding='utf-8')
        print("Saved successfully for the " + str(i) + "th time")
        content_list = []
        time.sleep(random.randint(60,90))
    else:
        time.sleep(random.randint(15,25))

640 (6)
The complete code is available on GitHub and can be downloaded and used directly.
https://github.com/cxyxl66/WeChatCrawler

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.