Dataset:fetch_20newsgroups(20类新闻文本)数据集的简介、安装、使用方法之详细攻略

Dataset:fetch_20newsgroups(20类新闻文本)数据集的简介、安装、使用方法之详细攻略

 

 

 

目录

fetch_20newsgroups(20类新闻文本)数据集的简介

1、数据集信息

2、数据集标签20类别

3、数据集前三篇文章

fetch_20newsgroups(20类新闻文本)数据集的安装

fetch_20newsgroups(20类新闻文本)数据集的使用方法


 

 

fetch_20newsgroups(20类新闻文本)数据集的简介

        20 newsgroups数据集18000多篇新闻文章,一共涉及到20种话题,所以称作20newsgroups text dataset,分为两部分:训练集和测试集,通常用来做文本分类,均匀分为20个不同主题的新闻组集合。20newsgroups数据集是被用于文本分类、文本挖据和信息检索研究的国际标准数据集之一。一些新闻组的主题特别相似(e.g. comp.sys.ibm.pc.hardware/ comp.sys.mac.hardware),还有一些却完全不相关 (e.g misc.forsale /soc.religion.christian)。

 

1、数据集信息

数据集形状 (18846,)

    =================   ==========
    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features                  text
    =================   ==========

 

 

2、数据集标签20类别

 [‘alt.atheism’, ‘comp.graphics’, ‘comp.os.ms-windows.misc’, ‘comp.sys.ibm.pc.hardware’, ‘comp.sys.mac.hardware’, ‘comp.windows.x’, ‘misc.forsale’, ‘rec.autos’, ‘rec.motorcycles’, ‘rec.sport.baseball’, ‘rec.sport.hockey’, ‘sci.crypt’, ‘sci.electronics’, ‘sci.med’, ‘sci.space’, ‘soc.religion.christian’, ‘talk.politics.guns’, ‘talk.politics.mideast’, ‘talk.politics.misc’, ‘talk.religion.misc’]

 

3、数据集前三篇文章

 [“From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>\nSubject: Pens fans reactions\nOrganization: Post Office, Carnegie Mellon, Pittsburgh, PA\nLines: 12\nNNTP-Posting-Host: po4.andrew.cmu.edu\n\n\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers’ relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n”, ‘From: mblawson@midway.ecn.uoknor.edu (Matthew B Lawson)\nSubject: Which high-performance VLB video card?\nSummary: Seek recommendations for VLB video card\nNntp-Posting-Host: midway.ecn.uoknor.edu\nOrganization: Engineering Computer Network, University of Oklahoma, Norman, OK, USA\nKeywords: orchid, stealth, vlb\nLines: 21\n\n  My brother is in the market for a high-performance video card that supports\nVESA local bus with 1-2MB RAM.  Does anyone have suggestions/ideas on:\n\n  – Diamond Stealth Pro Local Bus\n\n  – Orchid Farenheit 1280\n\n  – ATI Graphics Ultra Pro\n\n  – Any other high-performance VLB card\n\n\nPlease post or email.  Thank you!\n\n  – Matt\n\n– \n    |  Matthew B. Lawson <————> (mblawson@essex.ecn.uoknor.edu)  |   \n  –+– “Now I, Nebuchadnezzar, praise and exalt and glorify the King  –+– \n    |   of heaven, because everything he does is right and all his ways  |   \n    |   are just.” – Nebuchadnezzar, king of Babylon, 562 B.C.           |   \n’]

 

 

fetch_20newsgroups(20类新闻文本)数据集的安装

fetch_20newsgroups(data_home=None, # 文件下载的路径
                   subset='train', # 加载那一部分数据集 train/test
                   categories=None, # 选取哪一类数据集[类别列表],默认20类
                   shuffle=True,  # 将数据集随机排序
                   random_state=42, # 随机数生成器
                   remove=(), # ('headers','footers','quotes') 去除部分文本
                   download_if_missing=True # 如果没有下载过,重新下载
                   )

news = fetch_20newsgroups(subset='all')

 

fetch_20newsgroups(20类新闻文本)数据集的使用方法

ML之LoR:利用pipeline对fetch_20newsgroups数据集(文本抽取TfidfVectorizer)采用SVC算法(GSCV)实现多分类
ML之NB:利用朴素贝叶斯NB算法(CountVectorizer+不去除停用词)对fetch_20newsgroups数据集(20类新闻文本)进行分类预测、评估

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

关注博主即可阅读全文


原创:https://www.panoramacn.com
源码网提供WordPress源码,帝国CMS源码discuz源码,微信小程序,小说源码,杰奇源码,thinkphp源码,ecshop模板源码,微擎模板源码,dede源码,织梦源码等。

专业搭建小说网站,小说程序,杰奇系列,微信小说系列,app系列小说

Dataset:fetch_20newsgroups(20类新闻文本)数据集的简介、安装、使用方法之详细攻略

免责声明,若由于商用引起版权纠纷,一切责任均由使用者承担。

您必须遵守我们的协议,如果您下载了该资源行为将被视为对《免责声明》全部内容的认可-> 联系客服 投诉资源
www.panoramacn.com资源全部来自互联网收集,仅供用于学习和交流,请勿用于商业用途。如有侵权、不妥之处,请联系站长并出示版权证明以便删除。 敬请谅解! 侵权删帖/违法举报/投稿等事物联系邮箱:2640602276@qq.com
未经允许不得转载:书荒源码源码网每日更新网站源码模板! » Dataset:fetch_20newsgroups(20类新闻文本)数据集的简介、安装、使用方法之详细攻略
关注我们小说电影免费看
关注我们,获取更多的全网素材资源,有趣有料!
120000+人已关注
分享到:
赞(0) 打赏

评论抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址

您的打赏就是我分享的动力!

支付宝扫一扫打赏

微信扫一扫打赏