Use web crawling to get Baidu Tieba’s post

While now full of shitposting and trolling, Baidu Tieba used to be one of the largest and most used Chinese communication platforms. It kind of works like Reddit, as people can create their own forum under different categories.(For more info check https://en.wikipedia.org/wiki/Baidu_Tieba)

It indeed bred a lot of internet subcultures back then. I also had a good and nostalgic memory when I was posting my own fiction and communicating with other people in a really friendly environment back around a decade ago.

This program can be used to crawl the post by entering the URL of that post. It only gets the post from LZ(thread starter) by default, since I would only be needing those posts.

Note that because of the censorship and regulations, some posts containing “sensitive words” might already be deleted. Also Tieba is known for its unstable system because some posts will mysteriously disappear for no reason. There was a wide “system maintenance” that happened in 2019, all the posts back in 2017 can no longer be found, and this news went viral. Many of them were still salvageable and have been restored at this time point, but I can’t guarantee that all the posts are still accessible.

Anyways, let’s get to the code.

I use this post for example: https://tieba.baidu.com/p/1816186611

Code reference: https://www.cnblogs.com/chengxuyuanaa/p/13065178.html

see_lz=1:only the posts from the thread starter(LZ) will be available

pn=1: the page number

# -*- coding:utf-8 -*-
from urllib import request
req = request.Request('https://tieba.baidu.com/p/1816186611?see_lz=1&pn=1')
response = request.urlopen(req).read().decode('utf-8')
print (response)

Note that there are some similar codes online for the same function, but most of them are using ullib2, which will pop error in Python3. To use ulllib2, we would need to install urllib.request and install urllib.error and replace the corresponding code. Change urllib2 to urllib.request and it should work.

Use the following RE to match the title.

r'<h1 class="core_title_txt.*?>(.*?)</h1>',re.S

Tieba has changed from GBK to utf-8, but I still encountered problems with Mojibake when opening the .txt

Remember to decode in ‘utf-8’ when requesting the content, and encode also in ‘utf-8’ when writing into txt.

response = request.urlopen(req).read().decode('utf-8', 'ignore')

If it prints that the title is None, go check if the code for the header has changed(I have seen both h1 and h3 occur).

If it returned with the error message ‘URL is not valid’, Anti-Spider might be triggered and you might want to wait for a while(at least that is what I did). I personally do not have the need to crawl a massive amount of info from it.

Here is the code that I use. The RE might change throughout the time, so feel free to change accordingly.

# -*- coding: utf-8 -*-
# @Author  : TheParnassus
# @Time    : 2021/10/30 10:00
# @Function: get Baidu Tieba's post
import os
import re
from urllib import request

class RemoveTags:
    # Delete img
    removeImg = re.compile('<img.*?>| {7}|')
    # delete address
    removeAddr = re.compile('<a.*?>|</a>')
    # replace tr to \n
    replaceLine = re.compile('<tr>|<div>|</div>|</p>')
    # replace td to \t
    replaceTD= re.compile('<td>')
    # replace paragraph with space
    replacePara = re.compile('<p.*?>')
    # replace linebreak with \n
    replaceBR = re.compile('<br><br>|<br>')
    # remove other tags
    removeOtherTag = re.compile('<.*?>')
    def replace(self,x):
        x = re.sub(self.removeImg,"",x)
        x = re.sub(self.removeAddr,"",x)
        x = re.sub(self.replaceLine,"\n",x)
        x = re.sub(self.replaceTD,"\t",x)
        x = re.sub(self.replacePara,"\n    ",x)
        x = re.sub(self.replaceBR,"\n",x)
        x = re.sub(self.removeOtherTag,"",x)
        return x.strip()


class Tieba:
    def __init__(self, baseurl, seelz):
        # url
        self.baseurl = baseurl
        # LZ post only = 1 by default
        self.seelz = '?see_lz=' + str(seelz)
        # removetags
        self.removetag = RemoveTags()

    def getpage(self, pn):
        # get the post in the page(pn)
        try:
            url = self.baseurl + self.seelz + '&pn=' + str(pn)
            req = request.Request(url)
            response = request.urlopen(req).read().decode('utf-8', 'ignore')
            # return in 'utf-8'
            return response
        except request.URLError as e:
            if hasattr(e, 'reason'):
                print('连接失败,错误原因:', e.reason)
                return None

    def gettitle(self, page):
        # get the post title
        pattern = re.compile(r'<h3 class="core_title_txt.*?>(.*?)</h3>',re.S)
        result = re.search(pattern, page)
        if result:
            return result.group(1).strip()
        else:
            return None

    def getpagenum(self, page):
        # get the page number
        pattern = re.compile('<li class="l_reply_num.*?</span>.*?<span.*?>(.*?)</span>',re.S)
        result = re.search(pattern, page)
        if result:
            return result.group(1).strip()
        else:
            return None

    def getcontent(self, page):
        # get the post content
        pattern = re.compile('<div id="post_content_.*?>(.*?)</div>', re.S)
        items = re.findall(pattern, page)
        # add filtered contents without tags that are not needed 
        contents = []
        for item in items:
            content = "\n" + self.removetag.replace(item) + "\n"
            contents.append(content)
        return contents

    def start(self):
        indexpage = self.getpage(1)
        pageNum = self.getpagenum(indexpage)
        title = self.gettitle(indexpage)
        print(title)
        # the page does not contain anything
        if pageNum is None:
            print('URL已失效')
            return
        print('总页数:%s' % pageNum)
        # start a new txt file
        with open('test.txt', 'w', encoding='utf-8') as f:
            f.write(str(title))
        for i in range(1, int(pageNum)+1):
            page = self.getpage(i)
            contents = self.getcontent(page)
            for content in contents:
                with open('test.txt', 'a+',encoding='utf-8') as f:
                    f.write(str(content))
        #rename the txt file when done
        os.rename('test.txt',((str(title))+ '.txt'))

if __name__ == '__main__':
    print ("请输入帖子链接")
    baseURL = 'https://tieba.baidu.com/p/' + str(input(u'https://tieba.baidu.com/p/'))
    # only thread starter's post will be recorded
    seelz = 1
    bdtb = Tieba(baseURL, seelz)
    bdtb.start()

Leave a Reply

Your email address will not be published. Required fields are marked *