基于sklearn.decomposition.TruncatedSVD的潜在语义分析实践人工智能Michael是个半路程序员-

02 五月

星期六, 02 五月 2020 15:10 Last Updated on 星期六, 02 五月 2020 15:10 0 Comments

文章目录

《统计学习方法》潜在语义分析（Latent Semantic Analysis，LSA）笔记

1. sklearn.decomposition.TruncatedSVD

class sklearn.decomposition.TruncatedSVD(n_components=2, algorithm='randomized', n_iter=5, random_state=None, tol=0.0)

主要参数：

n_components： default = 2，话题数量
algorithm： default = “randomized”，算法选择
n_iter： optional (default 5)，迭代次数
Number of iterations for randomized SVD solver. Not used by ARPACK.

属性：

components_, shape (n_components, n_features)
explained_variance_, shape (n_components,)
The variance of the training samples transformed by a projection to each component.
explained_variance_ratio_, shape (n_components,)
Percentage of variance explained by each of the selected components.
singular_values_, shape (n_components,)
The singular values corresponding to each of the selected components.

2. sklearn.feature_extraction.text.TfidfVectorizer

sklearn.feature_extraction.text.TfidfVectorizer 官网介绍
将原始文档集合转换为TF-IDF矩阵

class sklearn.feature_extraction.text.TfidfVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word',  stop_words=None, token_pattern='(?u)bww+b', ngram_range=(1, 1),  max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False,  dtype=<class 'numpy.float64'>, norm='l2', use_idf=True, smooth_idf=True,  sublinear_tf=False)

参数介绍这个博客写的很清楚。

from sklearn.feature_extraction.text import TfidfVectorizer corpus = [ 'This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?', ] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names()) print(X.shape) print(X)

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'] (4, 9) (0, 8) 0.38408524091481483 (0, 3) 0.38408524091481483 (0, 6) 0.38408524091481483 (0, 2) 0.5802858236844359 (0, 1) 0.46979138557992045 (1, 8) 0.281088674033753 (1, 3) 0.281088674033753 (1, 6) 0.281088674033753 (1, 1) 0.6876235979836938 (1, 5) 0.5386476208856763 (2, 8) 0.267103787642168 (2, 3) 0.267103787642168 (2, 6) 0.267103787642168 (2, 0) 0.511848512707169 (2, 7) 0.511848512707169 (2, 4) 0.511848512707169 (3, 8) 0.38408524091481483 (3, 3) 0.38408524091481483 (3, 6) 0.38408524091481483 (3, 2) 0.5802858236844359 (3, 1) 0.46979138557992045

3. 代码实践

# -*- coding:utf-8 -*- # @Python Version: 3.7 # @Time: 2020/5/1 10:27 # @Author: Michael Ming # @Website: https://michael.blog.csdn.net/ # @File: 17.LSA.py # @Reference: https://cloud.tencent.com/developer/article/1530432 import numpy as np from sklearn.decomposition import TruncatedSVD  # LSA 潜在语义分析 from sklearn.feature_extraction.text import TfidfVectorizer  # 将文本集合转成权值矩阵 # 5个文档 docs = ["Love is patient, love is kind. It does not envy, it does not boast, it is not proud.", "It does not dishonor others, it is not self-seeking, it is not easily angered, it keeps no record of wrongs.", "Love does not delight in evil but rejoices with the truth.", "It always protects, always trusts, always hopes, always perseveres.",         "Love never fails. But where there are prophecies, they will cease; where there are tongues,          they will be stilled; where there is knowledge, it will pass away. (1 Corinthians 13:4-8 NIV)"] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(docs) # 转成权重矩阵 print("--------转成权重---------") print(X) print("--------获取特征（单词）---------") words = vectorizer.get_feature_names() print(words) print(len(words), "个特征（单词）") # 52个单词  topics = 4 lsa = TruncatedSVD(n_components=topics) # 潜在语义分析，设置4个话题 X1 = lsa.fit_transform(X) # 训练并进行转化 print("--------lsa奇异值---------") print(lsa.singular_values_) print("--------5个文本，在4个话题向量空间下的表示---------") print(X1) # 5个文本，在4个话题向量空间下的表示  pick_docs = 2 # 每个话题挑出2个最具代表性的文档 topic_docid = [X1[:, t].argsort()[:-(pick_docs + 1):-1] for t in range(topics)] # argsort,返回排序后的序号 print("--------每个话题挑出2个最具代表性的文档---------") print(topic_docid) # print("--------lsa.components_---------") # print(lsa.components_)  # 4话题*52单词,话题向量空间 pick_keywords = 3 # 每个话题挑出3个关键词 topic_keywdid = [lsa.components_[t].argsort()[:-(pick_keywords + 1):-1] for t in range(topics)] print("--------每个话题挑出3个关键词---------") print(topic_keywdid) print("--------打印LSA分析结果---------") for t in range(topics): print("话题 {}".format(t)) print("t 关键词：{}".format(", ".join(words[topic_keywdid[t][j]] for j in range(pick_keywords)))) for i in range(pick_docs): print("tt 文档{}".format(i)) print("tt", docs[topic_docid[t][i]])

运行结果

--------转成权重--------- (0, 24) 0.3031801002944161 (0, 19) 0.4547701504416241 (0, 32) 0.2263512201359201 (0, 22) 0.2263512201359201 (0, 20) 0.3825669873635752 (0, 12) 0.3031801002944161 (0, 28) 0.4547701504416241 (0, 14) 0.2263512201359201 (0, 6) 0.2263512201359201 (0, 36) 0.2263512201359201 (1, 19) 0.28327311337182914 (1, 20) 0.4765965465346523 (1, 12) 0.14163655668591457 (1, 28) 0.42490967005774366 (1, 11) 0.21148886348790247 (1, 30) 0.21148886348790247 (1, 40) 0.21148886348790247 (1, 39) 0.21148886348790247 (1, 13) 0.21148886348790247 (1, 2) 0.21148886348790247 (1, 21) 0.21148886348790247 (1, 27) 0.21148886348790247 (1, 37) 0.21148886348790247 (1, 29) 0.21148886348790247 (1, 51) 0.21148886348790247 : : (3, 46) 0.22185332169737518 (3, 17) 0.22185332169737518 (3, 33) 0.22185332169737518 (4, 24) 0.09483932399667956 (4, 19) 0.09483932399667956 (4, 20) 0.0797818291938777 (4, 7) 0.1142518110942895 (4, 25) 0.14161217495916 (4, 16) 0.14161217495916 (4, 48) 0.42483652487747997 (4, 43) 0.42483652487747997 (4, 3) 0.28322434991832 (4, 34) 0.14161217495916 (4, 44) 0.28322434991832 (4, 49) 0.42483652487747997 (4, 8) 0.14161217495916 (4, 45) 0.14161217495916 (4, 5) 0.14161217495916 (4, 41) 0.14161217495916 (4, 23) 0.14161217495916 (4, 31) 0.14161217495916 (4, 4) 0.14161217495916 (4, 9) 0.14161217495916 (4, 0) 0.14161217495916 (4, 26) 0.14161217495916 --------获取特征（单词）--------- ['13', 'always', 'angered', 'are', 'away', 'be', 'boast', 'but', 'cease', 'corinthians', 'delight', 'dishonor', 'does', 'easily', 'envy', 'evil', 'fails', 'hopes', 'in', 'is', 'it', 'keeps', 'kind', 'knowledge', 'love', 'never', 'niv', 'no', 'not', 'of', 'others', 'pass', 'patient', 'perseveres', 'prophecies', 'protects', 'proud', 'record', 'rejoices', 'seeking', 'self', 'stilled', 'the', 'there', 'they', 'tongues', 'trusts', 'truth', 'where', 'will', 'with', 'wrongs'] 52 个特征（单词） --------lsa奇异值--------- [1.29695724 1.00165234 0.98752651 0.94862686] --------5个文本，在4个话题向量空间下的表示--------- [[ 0.85667347 -0.00334881 -0.11274158 -0.14912237] [ 0.80868148 0.09220662 -0.16057627 -0.33804609] [ 0.46603522 -0.3005665 -0.06851382 0.82322097] [ 0.13423034 0.92315127 0.22573307 0.2806665 ] [ 0.24297388 -0.22857306 0.9386499 -0.08314939]] --------每个话题挑出2个最具代表性的文档--------- [array([0, 1], dtype=int64), array([3, 1], dtype=int64), array([4, 3], dtype=int64), array([2, 3], dtype=int64)] --------每个话题挑出3个关键词--------- [array([28, 20, 19], dtype=int64), array([ 1, 46, 33], dtype=int64), array([49, 48, 43], dtype=int64), array([10, 42, 18], dtype=int64)] --------打印LSA分析结果--------- 话题 0   关键词：not, it, is    文档0    Love is patient, love is kind. It does not envy, it does not boast, it is not proud.    文档1    It does not dishonor others, it is not self-seeking, it is not easily angered, it keeps no record of wrongs. 话题 1   关键词：always, trusts, perseveres    文档0    It always protects, always trusts, always hopes, always perseveres.    文档1    It does not dishonor others, it is not self-seeking, it is not easily angered, it keeps no record of wrongs. 话题 2   关键词：will, where, there    文档0    Love never fails. But where there are prophecies, they will cease; where there are tongues,         they will be stilled; where there is knowledge, it will pass away. (1 Corinthians 13:4-8 NIV)    文档1    It always protects, always trusts, always hopes, always perseveres. 话题 3   关键词：delight, the, in    文档0    Love does not delight in evil but rejoices with the truth.    文档1    It always protects, always trusts, always hopes, always perseveres.

4. 参考文献

主要参考了下面作者的文章，表示感谢！
sklearn: 利用TruncatedSVD做文本主题分析

Michael阿明

基于sklearn.decomposition.TruncatedSVD的潜在语义分析实践人工智能Michael是个半路程序员-

原创文章 1013获赞 3720访问量 55万+

关注他的留言板

展开阅读全文

13
评论 16
x
海报

扫一扫，海报
手机看

到微信朋友圈

x

扫一扫，手机阅读
打赏

打赏

Michael阿明

“如果可以，请留言支持我哦！”

5C币 10C币 20C币 50C币 100C币 200C币

确定

blmoistawinde的博客

10-27 基于sklearn.decomposition.TruncatedSVD的潜在语义分析实践人工智能Michael是个半路程序员- 4404

sklearn: 利用TruncatedSVD做文本主题分析

本文是一个使用sklearn中的TruncatedSVD进行文本主题分析的简要demo。通过主题分析，我们可以得到一个语料中的关键主题，即各个词语在主题中的重要程度，各个文章在各个主题上的倾向…

本页所有内容来自官方网站 https://www.imapbox.com 新闻来源：互联网搜索引擎和新闻站

本网页所有图片由 ImageBox 图片批量下载器,网页图片批量下载专家,网页图片批量下载器,获取到文章图片，下载并得到。

ImageBox 图片批量下载器工具地址: 网页图片批量下载工具-最新版本下载

非凡下载站地址：https://www.crsky.com/soft/35838.html

本网页所有视频内容由 imoviebox边看边下-网页视频下载, iurlBox网页地址收藏管理器下载并得到。

ImovieBox网页视频下载器下载地址: ImovieBox网页视频下载器-最新版本下载

本文章由: imapbox邮箱云存储,邮箱网盘,ImageBox 图片批量下载器,网页图片批量下载专家,网页图片批量下载器,获取到文章图片,imoviebox网页视频批量下载器,下载视频内容,为您提供.

阅读和此文章类似的: 全球云计算

基于sklearn.decomposition.TruncatedSVD的潜在语义分析实践人工智能Michael是个半路程序员-

文章目录

1. sklearn.decomposition.TruncatedSVD

2. sklearn.feature_extraction.text.TfidfVectorizer

3. 代码实践

4. 参考文献

sklearn: 利用TruncatedSVD做文本主题分析

文章目录

近期文章

官方链接

关于我们

软件产品

事业方向

联系我们

ImapBox Technology Research Group

基于sklearn.decomposition.TruncatedSVD的潜在语义分析实践人工智能Michael是个半路程序员-

文章目录

1. sklearn.decomposition.TruncatedSVD

2. sklearn.feature_extraction.text.TfidfVectorizer

3. 代码实践

4. 参考文献

sklearn: 利用TruncatedSVD做文本主题分析

文章目录

近期文章

官方链接

关于我们

软件产品

事业方向

联系我们

ImapBox Technology Research Group

登录