《统计学习方法》潜在语义分析(Latent Semantic Analysis,LSA) 笔记 sklearn.decomposition.TruncatedSVD 官网介绍 主要参数: 属性: sklearn.feature_extraction.text.TfidfVectorizer 官网介绍 参数介绍 这个博客 写的很清楚。 运行结果 主要参考了下面作者的文章,表示感谢!文章目录
1. sklearn.decomposition.TruncatedSVD
class sklearn.decomposition.TruncatedSVD(n_components=2, algorithm='randomized', n_iter=5, random_state=None, tol=0.0)
n_components
: default = 2,话题数量algorithm
: default = “randomized”,算法选择n_iter
: optional (default 5),迭代次数
Number of iterations for randomized SVD solver. Not used by ARPACK.
components_
, shape (n_components, n_features)explained_variance_
, shape (n_components,)
The variance of the training samples transformed by a projection to each component.explained_variance_ratio_
, shape (n_components,)
Percentage of variance explained by each of the selected components.singular_values_
, shape (n_components,)
The singular values corresponding to each of the selected components.2. sklearn.feature_extraction.text.TfidfVectorizer
将原始文档集合转换为TF-IDF矩阵class sklearn.feature_extraction.text.TfidfVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern='(?u)bww+b', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.float64'>, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
from sklearn.feature_extraction.text import TfidfVectorizer corpus = [ 'This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?', ] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names()) print(X.shape) print(X)
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'] (4, 9) (0, 8) 0.38408524091481483 (0, 3) 0.38408524091481483 (0, 6) 0.38408524091481483 (0, 2) 0.5802858236844359 (0, 1) 0.46979138557992045 (1, 8) 0.281088674033753 (1, 3) 0.281088674033753 (1, 6) 0.281088674033753 (1, 1) 0.6876235979836938 (1, 5) 0.5386476208856763 (2, 8) 0.267103787642168 (2, 3) 0.267103787642168 (2, 6) 0.267103787642168 (2, 0) 0.511848512707169 (2, 7) 0.511848512707169 (2, 4) 0.511848512707169 (3, 8) 0.38408524091481483 (3, 3) 0.38408524091481483 (3, 6) 0.38408524091481483 (3, 2) 0.5802858236844359 (3, 1) 0.46979138557992045
3. 代码实践
# -*- coding:utf-8 -*- # @Python Version: 3.7 # @Time: 2020/5/1 10:27 # @Author: Michael Ming # @Website: https://michael.blog.csdn.net/ # @File: 17.LSA.py # @Reference: https://cloud.tencent.com/developer/article/1530432 import numpy as np from sklearn.decomposition import TruncatedSVD # LSA 潜在语义分析 from sklearn.feature_extraction.text import TfidfVectorizer # 将文本集合转成权值矩阵 # 5个文档 docs = ["Love is patient, love is kind. It does not envy, it does not boast, it is not proud.", "It does not dishonor others, it is not self-seeking, it is not easily angered, it keeps no record of wrongs.", "Love does not delight in evil but rejoices with the truth.", "It always protects, always trusts, always hopes, always perseveres.", "Love never fails. But where there are prophecies, they will cease; where there are tongues, they will be stilled; where there is knowledge, it will pass away. (1 Corinthians 13:4-8 NIV)"] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(docs) # 转成权重矩阵 print("--------转成权重---------") print(X) print("--------获取特征(单词)---------") words = vectorizer.get_feature_names() print(words) print(len(words), "个特征(单词)") # 52个单词 topics = 4 lsa = TruncatedSVD(n_components=topics) # 潜在语义分析,设置4个话题 X1 = lsa.fit_transform(X) # 训练并进行转化 print("--------lsa奇异值---------") print(lsa.singular_values_) print("--------5个文本,在4个话题向量空间下的表示---------") print(X1) # 5个文本,在4个话题向量空间下的表示 pick_docs = 2 # 每个话题挑出2个最具代表性的文档 topic_docid = [X1[:, t].argsort()[:-(pick_docs + 1):-1] for t in range(topics)] # argsort,返回排序后的序号 print("--------每个话题挑出2个最具代表性的文档---------") print(topic_docid) # print("--------lsa.components_---------") # print(lsa.components_) # 4话题*52单词,话题向量空间 pick_keywords = 3 # 每个话题挑出3个关键词 topic_keywdid = [lsa.components_[t].argsort()[:-(pick_keywords + 1):-1] for t in range(topics)] print("--------每个话题挑出3个关键词---------") print(topic_keywdid) print("--------打印LSA分析结果---------") for t in range(topics): print("话题 {}".format(t)) print("t 关键词:{}".format(", ".join(words[topic_keywdid[t][j]] for j in range(pick_keywords)))) for i in range(pick_docs): print("tt 文档{}".format(i)) print("tt", docs[topic_docid[t][i]])
--------转成权重--------- (0, 24) 0.3031801002944161 (0, 19) 0.4547701504416241 (0, 32) 0.2263512201359201 (0, 22) 0.2263512201359201 (0, 20) 0.3825669873635752 (0, 12) 0.3031801002944161 (0, 28) 0.4547701504416241 (0, 14) 0.2263512201359201 (0, 6) 0.2263512201359201 (0, 36) 0.2263512201359201 (1, 19) 0.28327311337182914 (1, 20) 0.4765965465346523 (1, 12) 0.14163655668591457 (1, 28) 0.42490967005774366 (1, 11) 0.21148886348790247 (1, 30) 0.21148886348790247 (1, 40) 0.21148886348790247 (1, 39) 0.21148886348790247 (1, 13) 0.21148886348790247 (1, 2) 0.21148886348790247 (1, 21) 0.21148886348790247 (1, 27) 0.21148886348790247 (1, 37) 0.21148886348790247 (1, 29) 0.21148886348790247 (1, 51) 0.21148886348790247 : : (3, 46) 0.22185332169737518 (3, 17) 0.22185332169737518 (3, 33) 0.22185332169737518 (4, 24) 0.09483932399667956 (4, 19) 0.09483932399667956 (4, 20) 0.0797818291938777 (4, 7) 0.1142518110942895 (4, 25) 0.14161217495916 (4, 16) 0.14161217495916 (4, 48) 0.42483652487747997 (4, 43) 0.42483652487747997 (4, 3) 0.28322434991832 (4, 34) 0.14161217495916 (4, 44) 0.28322434991832 (4, 49) 0.42483652487747997 (4, 8) 0.14161217495916 (4, 45) 0.14161217495916 (4, 5) 0.14161217495916 (4, 41) 0.14161217495916 (4, 23) 0.14161217495916 (4, 31) 0.14161217495916 (4, 4) 0.14161217495916 (4, 9) 0.14161217495916 (4, 0) 0.14161217495916 (4, 26) 0.14161217495916 --------获取特征(单词)--------- ['13', 'always', 'angered', 'are', 'away', 'be', 'boast', 'but', 'cease', 'corinthians', 'delight', 'dishonor', 'does', 'easily', 'envy', 'evil', 'fails', 'hopes', 'in', 'is', 'it', 'keeps', 'kind', 'knowledge', 'love', 'never', 'niv', 'no', 'not', 'of', 'others', 'pass', 'patient', 'perseveres', 'prophecies', 'protects', 'proud', 'record', 'rejoices', 'seeking', 'self', 'stilled', 'the', 'there', 'they', 'tongues', 'trusts', 'truth', 'where', 'will', 'with', 'wrongs'] 52 个特征(单词) --------lsa奇异值--------- [1.29695724 1.00165234 0.98752651 0.94862686] --------5个文本,在4个话题向量空间下的表示--------- [[ 0.85667347 -0.00334881 -0.11274158 -0.14912237] [ 0.80868148 0.09220662 -0.16057627 -0.33804609] [ 0.46603522 -0.3005665 -0.06851382 0.82322097] [ 0.13423034 0.92315127 0.22573307 0.2806665 ] [ 0.24297388 -0.22857306 0.9386499 -0.08314939]] --------每个话题挑出2个最具代表性的文档--------- [array([0, 1], dtype=int64), array([3, 1], dtype=int64), array([4, 3], dtype=int64), array([2, 3], dtype=int64)] --------每个话题挑出3个关键词--------- [array([28, 20, 19], dtype=int64), array([ 1, 46, 33], dtype=int64), array([49, 48, 43], dtype=int64), array([10, 42, 18], dtype=int64)] --------打印LSA分析结果--------- 话题 0 关键词:not, it, is 文档0 Love is patient, love is kind. It does not envy, it does not boast, it is not proud. 文档1 It does not dishonor others, it is not self-seeking, it is not easily angered, it keeps no record of wrongs. 话题 1 关键词:always, trusts, perseveres 文档0 It always protects, always trusts, always hopes, always perseveres. 文档1 It does not dishonor others, it is not self-seeking, it is not easily angered, it keeps no record of wrongs. 话题 2 关键词:will, where, there 文档0 Love never fails. But where there are prophecies, they will cease; where there are tongues, they will be stilled; where there is knowledge, it will pass away. (1 Corinthians 13:4-8 NIV) 文档1 It always protects, always trusts, always hopes, always perseveres. 话题 3 关键词:delight, the, in 文档0 Love does not delight in evil but rejoices with the truth. 文档1 It always protects, always trusts, always hopes, always perseveres.
4. 参考文献
sklearn: 利用TruncatedSVD做文本主题分析
本网页所有视频内容由 imoviebox边看边下-网页视频下载, iurlBox网页地址收藏管理器 下载并得到。
ImovieBox网页视频下载器 下载地址: ImovieBox网页视频下载器-最新版本下载
本文章由: imapbox邮箱云存储,邮箱网盘,ImageBox 图片批量下载器,网页图片批量下载专家,网页图片批量下载器,获取到文章图片,imoviebox网页视频批量下载器,下载视频内容,为您提供.
阅读和此文章类似的: 全球云计算