抽取网络信息进行数据挖掘建立语料库-白红宇

抽取网络信息进行数据挖掘建立语料库

阅读量：4641 次

发布时间：2019-06-09

本文共 11949 字，大约阅读时间需要 39 分钟。

最近的实习项目需要做一个大数据库（语料库），采集博客、微博、问答库的信息。将数据库的内容进行训练，最后应该是做成一个类似中文siri的模型吧。

第一步新闻抓取器已经稳定运行了，基本原理用的是爬虫去爬新闻门户网站的代码，可以看到各新闻节点是比较规范的：存在<li>或者<table>节点下，有标题、时间、新闻链接。

找到这些特征就好办了，通过Winista.HtmlParser把节点都提取出来。判断是否符合定义的新闻格式。

当然，这里用到正则表达式。

最近在看关于微博抓取的资料，发现数据挖掘这个领域太奇妙了。感慨自己学识有限啊。这里收集一些有意思的东西。先把一些东西备份，现在看不懂的，说不定以后能看懂。

1.获取新浪微博1000w用户的基本信息和每个爬取用户最近发表的50条微博

虽然不懂python语言，但是有同学也在做这个，感觉以后肯定还是会接触到的。

但是这个方法不是那么好实现的，首先新浪API不会给普通用户每天50,000w这么多的连接数。你只能借用某个顶级应用的身份去获取，除了新浪官方或者ZF，谁还能这么牛B？

View Code

#!/usr/bin/python#-*-coding:utf8-*-from pprint import pprintfrom weibopy.auth import OAuthHandlerfrom weibopy.api import APIfrom weibopy.binder import bind_apifrom weibopy.error import WeibopErrorimport time,os,pickle,sysimport logging.config from multiprocessing import Processfrom pymongo import Connectionmongo_addr = 'localhost'mongo_port = 27017db_name = 'weibo'class Sina_reptile():    """    爬取sina微博数据    """    def __init__(self,consumer_key,consumer_secret):        self.consumer_key,self.consumer_secret = consumer_key,consumer_secret        self.connection = Connection(mongo_addr,mongo_port)        self.db = self.connection[db_name]        self.collection_userprofile = self.db['userprofile']        self.collection_statuses = self.db['statuses']    def getAtt(self, key):        try:            return self.obj.__getattribute__(key)        except Exception, e:            print e            return ''    def getAttValue(self, obj, key):        try:            return obj.__getattribute__(key)        except Exception, e:            print e            return ''    def auth(self):        """        用于获取sina微博  access_token 和access_secret        """        if len(self.consumer_key) == 0:            print "Please set consumer_key"            return                if len(self.consumer_key) == 0:            print "Please set consumer_secret"            return                self.auth = OAuthHandler(self.consumer_key, self.consumer_secret)        auth_url = self.auth.get_authorization_url()        print 'Please authorize: ' + auth_url        verifier = raw_input('PIN: ').strip()        self.auth.get_access_token(verifier)        self.api = API(self.auth)    def setToken(self, token, tokenSecret):        """        通过oauth协议以便能获取sina微博数据        """        self.auth = OAuthHandler(self.consumer_key, self.consumer_secret)        self.auth.setToken(token, tokenSecret)        self.api = API(self.auth)    def get_userprofile(self,id):        """        获取用户基本信息        """        try:            userprofile = {}            userprofile['id'] = id            user = self.api.get_user(id)            self.obj = user                        userprofile['screen_name'] = self.getAtt("screen_name")            userprofile['name'] = self.getAtt("name")            userprofile['province'] = self.getAtt("province")            userprofile['city'] = self.getAtt("city")            userprofile['location'] = self.getAtt("location")            userprofile['description'] = self.getAtt("description")            userprofile['url'] = self.getAtt("url")            userprofile['profile_image_url'] = self.getAtt("profile_image_url")            userprofile['domain'] = self.getAtt("domain")            userprofile['gender'] = self.getAtt("gender")            userprofile['followers_count'] = self.getAtt("followers_count")            userprofile['friends_count'] = self.getAtt("friends_count")            userprofile['statuses_count'] = self.getAtt("statuses_count")            userprofile['favourites_count'] = self.getAtt("favourites_count")            userprofile['created_at'] = self.getAtt("created_at")            userprofile['following'] = self.getAtt("following")            userprofile['allow_all_act_msg'] = self.getAtt("allow_all_act_msg")            userprofile['geo_enabled'] = self.getAtt("geo_enabled")            userprofile['verified'] = self.getAtt("verified")#            for i in userprofile:#                print type(i),type(userprofile[i])#                print i,userprofile[i]#                    except WeibopError, e:      #捕获到的WeibopError错误的详细原因会被放置在对象e中            print "error occured when access userprofile use user_id:",id            print "Error:",e            log.error("Error occured when access userprofile use user_id:{0}\nError:{1}".format(id, e),exc_info=sys.exc_info())            return None                    return userprofile    def get_specific_weibo(self,id):        """        获取用户最近发表的50条微博        """        statusprofile = {}        statusprofile['id'] = id        try:            #重新绑定get_status函数            get_status = bind_api( path = '/statuses/show/{id}.json',                                  payload_type = 'status',                                 allowed_param = ['id'])        except:            return "**绑定错误**"        status = get_status(self.api,id)        self.obj = status        statusprofile['created_at'] = self.getAtt("created_at")        statusprofile['text'] = self.getAtt("text")        statusprofile['source'] = self.getAtt("source")        statusprofile['favorited'] = self.getAtt("favorited")        statusprofile['truncated'] = self.getAtt("ntruncatedame")        statusprofile['in_reply_to_status_id'] = self.getAtt("in_reply_to_status_id")        statusprofile['in_reply_to_user_id'] = self.getAtt("in_reply_to_user_id")        statusprofile['in_reply_to_screen_name'] = self.getAtt("in_reply_to_screen_name")        statusprofile['thumbnail_pic'] = self.getAtt("thumbnail_pic")        statusprofile['bmiddle_pic'] = self.getAtt("bmiddle_pic")        statusprofile['original_pic'] = self.getAtt("original_pic")        statusprofile['geo'] = self.getAtt("geo")        statusprofile['mid'] = self.getAtt("mid")        statusprofile['retweeted_status'] = self.getAtt("retweeted_status")        return statusprofile    def get_latest_weibo(self,user_id,count):        """        获取用户最新发表的count条数据        """        statuses,statusprofile = [],{}        try:            #error occur in the SDK            timeline = self.api.user_timeline(count=count, user_id=user_id)        except Exception as e:            print "error occured when access status use user_id:",user_id            print "Error:",e            log.error("Error occured when access status use user_id:{0}\nError:{1}".format(user_id, e),exc_info=sys.exc_info())            return None        for line in timeline:            self.obj = line            statusprofile['usr_id'] = user_id            statusprofile['id'] = self.getAtt("id")            statusprofile['created_at'] = self.getAtt("created_at")            statusprofile['text'] = self.getAtt("text")            statusprofile['source'] = self.getAtt("source")            statusprofile['favorited'] = self.getAtt("favorited")            statusprofile['truncated'] = self.getAtt("ntruncatedame")            statusprofile['in_reply_to_status_id'] = self.getAtt("in_reply_to_status_id")            statusprofile['in_reply_to_user_id'] = self.getAtt("in_reply_to_user_id")            statusprofile['in_reply_to_screen_name'] = self.getAtt("in_reply_to_screen_name")            statusprofile['thumbnail_pic'] = self.getAtt("thumbnail_pic")            statusprofile['bmiddle_pic'] = self.getAtt("bmiddle_pic")            statusprofile['original_pic'] = self.getAtt("original_pic")            statusprofile['geo'] = repr(pickle.dumps(self.getAtt("geo"),pickle.HIGHEST_PROTOCOL))            statusprofile['mid'] = self.getAtt("mid")            statusprofile['retweeted_status'] = repr(pickle.dumps(self.getAtt("retweeted_status"),pickle.HIGHEST_PROTOCOL))            statuses.append(statusprofile)        return statuses    def friends_ids(self,id):        """        获取用户关注列表id        """        next_cursor,cursor = 1,0        ids = []        while(0!=next_cursor):            fids = self.api.friends_ids(user_id=id,cursor=cursor)            self.obj = fids            ids.extend(self.getAtt("ids"))            cursor = next_cursor = self.getAtt("next_cursor")            previous_cursor = self.getAtt("previous_cursor")        return ids    def manage_access(self):        """        管理应用访问API速度,适时进行沉睡        """        info = self.api.rate_limit_status()        self.obj = info        sleep_time = round( (float)(self.getAtt("reset_time_in_seconds"))/self.getAtt("remaining_hits"),2 ) if self.getAtt("remaining_hits") else self.getAtt("reset_time_in_seconds")        print self.getAtt("remaining_hits"),self.getAtt("reset_time_in_seconds"),self.getAtt("hourly_limit"),self.getAtt("reset_time")        print "sleep time:",sleep_time,'pid:',os.getpid()        time.sleep(sleep_time + 1.5)    def save_data(self,userprofile,statuses):        self.collection_statuses.insert(statuses)        self.collection_userprofile.insert(userprofile)def reptile(sina_reptile,userid):    ids_num,ids,new_ids,return_ids = 1,[userid],[userid],[]    while(ids_num <= 10000000):        next_ids = []        for id in new_ids:            try:                sina_reptile.manage_access()                return_ids = sina_reptile.friends_ids(id)                ids.extend(return_ids)                userprofile = sina_reptile.get_userprofile(id)                statuses = sina_reptile.get_latest_weibo(count=50, user_id=id)                if statuses is None or userprofile is None:                    continue                sina_reptile.save_data(userprofile,statuses)            except Exception as e:                log.error("Error occured in reptile,id:{0}\nError:{1}".format(id, e),exc_info=sys.exc_info())                time.sleep(60)                continue            ids_num+=1            print ids_num            if(ids_num >= 10000000):break            next_ids.extend(return_ids)        next_ids,new_ids = new_ids,next_idsdef run_crawler(consumer_key,consumer_secret,key,secret,userid):    try:        sina_reptile = Sina_reptile(consumer_key,consumer_secret)        sina_reptile.setToken(key, secret)        reptile(sina_reptile,userid)        sina_reptile.connection.close()    except Exception as e:        print e        log.error("Error occured in run_crawler,pid:{1}\nError:{2}".format(os.getpid(), e),exc_info=sys.exc_info())if __name__ == "__main__":    logging.config.fileConfig("logging.conf")    log = logging.getLogger('logger_sina_reptile')    with open('test.txt') as f:        for i in f.readlines():            j = i.strip().split(' ')            p = Process(target=run_crawler, args=(j[0],j[1],j[2],j[3],j[4]))            p.start()

2、MetaSeeker抓取新浪微博

不知为啥博客园对此网站设定为非法词汇

只能截图了

基于FireFox浏览器插件的工具，功能很强大，原理应该是基于DOM的网页结构分析和正则表达式提取。

不过使用者貌似不需要了解很多，只要按照教程掌握使用方法既可以轻松抓取内容。不过只能说是半自动的，貌似数据还得传到云端。

MetaStudio使用示例

既然知道这个东西的人，那我就不做详细介绍了，简单地说，就是一个firefox上的插件，可以抓取web页面上的信息。这里就用新浪微博为范例，说明一下如何使用MetaStudio。首先，在firefox中登陆微博（这里建议登陆Xweibo，一个新浪的开放平台，对于抓取数据更为便利http://demo.x.weibo.com/）。任意打开某个人的微博，我想要抓取这个人所发过的所有的微博，怎么办？好吧，如果你想手工一条一条来或者手工一页一页来也行啊。不开玩笑，这里使用MetaStudio抓取数据。打开了某个人的微博之后，打开插件MetaStudio，在地址栏中输入同样的网址。并将地址栏右侧的可选项打上勾。之后，在右侧Theme Editor里面，输入你的主题名，这个是任意的。假设我输入的是“dang1”作为主题名。等到下方会出现那个微薄的页面。之后建议将这个页面一直往下拖到底，如果拖到底之后它会继续刷新，那你就继续拖，直到它不刷为止。这个时候，打开菜单栏中的文件，点击“刷新DOM”。然后就进入下一步。在右侧选择Clue Editor选项卡，点击newClue。单选的那个点，选择Marker。NewClue左侧旁边的两个也要勾上。为了以防万一，可以再刷新一下DOM。点击下面，浏览器当中的下一页。这个是要作为翻页的节点记号。然后再左边的那段地方，上下拖拉直到找到被选中的那一行。展开它，然后选择SPAN下面的#text，注意选中它之后，右边的文本内容会显示为下一页。右击这个#text，线索映射，记号映射。之后向上找到class名为list-footer的那一行，右击，线索映射，线索映射，s_clue 0。接下来，点击Bucket Editor选项卡，点击newBckt。随便取个名字（如abc），点击确定。此时右边会出现一个abc。右击abc，添加包容。这个部分就根据自己的需要来设定。加入这里我要想抓取微博发出来的时间，评论数，转发数，内容这四点。那我就在信息属性里面填写内容，选上key，然后确定。同样的方法添加其他的包容。然后来到下方浏览器里面，任意点击一条微博的内容，然后又在左侧找到选中的那一行。找到#text是该条微博内容的一行，右击，内容映射，内容。往上找到feed-content，右击，FreeFormat映射，abc。然后其他的包容也是用相同的方法，只是可以省略feed-content这一步。最后，在菜单配置里面只选中积极模式，首选项里面都选中偏向class。。点击右上角的保存，搞定！

3、 ROST DetailMinner——武汉大学ROST虚拟学习团队开发的一款用于采集网页信息的情报分析软件

4、关于使用R语言对微博进行提取和信息可视化的东西

挺有意思的，第一次听说R语言，词云和传播途径形成可视化确实让我眼前一亮。

主要的启发是，去weibo.cn（微博手机版）提取信息！因为weibo.com的源码貌似加密了。看不到实际的文字。

由于本人专业电子的（不务正业），对数据挖掘领域知识欠缺，欢迎拍砖。当然也是记录自己学习进步的过程。

转载于:https://www.cnblogs.com/zhangweilong/archive/2012/11/03/2753133.html

你可能感兴趣的文章