本文通过python实现了熵增益和熵增益率的计算、实现了离散变量的决策树模型,并将代码进行了封装,方便读者调用。 此对象用于计算离散变量的熵、条件熵、熵增益(互信息)和熵增益率 使用kaggle上的一份离散变量数据进行模型验证,以下是kaggle上的数据描述: 建立求取信息熵对象 求取标注相对各特征的条件熵 求取标注相对于各特征的信息增益(互信息) 求取标注相对于各特征的信息增益率 此对象为针对离散变量的分类问题建立决策树模型适用的。 建立决策树模型 准确率检测 by CyrusMay 2020 05 20 时间如果可以倒流机器学习 决策树篇——解决离散变量的分类问题
摘要
熵增益和熵增益率计算
 .cal_entropy():计算熵的函数
 .cal_conditional_entropy():计算条件熵的函数
 .cal_entropy_gain():计算熵增益(互信息)的函数
 .cal_entropy_gain_ratio():计算熵增益率的函数
 用法:先传入特征和标注创建对象,再调用相关函数计算就行
 特征和标注的类型最好转入DataFrame、Series或者list格式
 若想计算单个变量的熵,则特征和标注传同一个值就行import numpy as np import pandas as pd import copy from sklearn.preprocessing import LabelEncoder from sklearn.datasets import  load_wine,load_breast_cancer  class CyrusEntropy(object): """     此对象用于计算离散变量的熵、条件熵、熵增益(互信息)和熵增益率     .cal_entropy():计算熵的函数     .cal_conditional_entropy():计算条件熵的函数     .cal_entropy_gain:计算熵增益(互信息)的函数     .cal_entropy_gain_ratio():计算熵增益率的函数     用法:先传入特征和标注创建对象,再调用相关函数计算就行           特征和标注的类型最好转入DataFrame、Series或者list格式           若想计算单个变量的熵,则特征和标注传同一个值就行     """     def __init__(self,x,y):         # 特征进行标签编码         x = pd.DataFrame(x)         y = pd.Series(y)         x0 = copy.copy(x)         y = copy.copy(y) for i in range(x.shape[1]):             x0.iloc[:,i] = LabelEncoder().fit_transform(x.iloc[:,i])         self.X = x0         self.Y = pd.Series(LabelEncoder().fit_transform(y))              def cal_entropy(self):         x_entropy = [] for i in range(self.X.shape[1]):             number = np.array(self.X.iloc[:,i].value_counts())             p = number/number.sum()             x_entropy.append(np.sum(-p*np.log2(p)))         number = np.array(self.Y.value_counts())         p = number/number.sum()         y_entropy = np.sum(-p*np.log2(p)) return x_entropy,y_entropy          def cal_conditional_entropy(self):         y_x_conditional_entropy = [] for i in range(self.X.shape[1]):             dict_flag = {}             list_flag = [] for j in range(self.X.shape[0]):                 dict_flag[self.X.iloc[j,i]] = dict_flag.get(self.X.iloc[j,i],list_flag) + [self.Y.iloc[j]]             condition_value = 0 for y_value in dict_flag.values():                 number = np.array(pd.Series(y_value).value_counts())                 p = number/number.sum()                 condition_value += np.sum(-p*np.log2(p))*len(y_value)/(self.Y.shape[0])             y_x_conditional_entropy.append(condition_value) return y_x_conditional_entropy                      def cal_entropy_gain(self): return list(np.array(self.cal_entropy()[1])-np.array(self.cal_conditional_entropy()))              def cal_entropy_gain_ratio(self): return list(np.array(self.cal_entropy_gain())/np.array(self.cal_entropy()[0])) 熵增益和熵增益率运行结果
 The Lifetime reality television show and social experiment, Married at First Sight, features men and women who sign up to marry a complete stranger they’ve never met before. Experts pair couples based on tests and interviews. After marriage, couples have only a few short weeks together to decide if they want to stay married or get a divorce. There have been 10 full seasons so far which provides interesting data to look at what factors may or may not play a role in their decisions at the end of eight weeks as well as longer-term outcomes since the show aired.if __name__ == "__main__":     data = pd.read_csv("./mafs.csv",header=0) Y = data.Status     X = data.drop(labels="Couple",axis=1) X = X.drop(labels="Status",axis=1) print(X.head(2)) 
 求取各特征和标注的信息熵# 建立求取信息熵对象 entropy_model = CyrusEntropy(X,Y) # 求取各特征和标注的信息熵 entropy = entropy_model.cal_entropy() ([3.29646716508619, 3.1199965768508955, 6.087462841250342, 3.520444587294042, 1.0, 6.087462841250342, 0.8739810481273578, 0.0, 0.833764907210665, 0.833764907210665, 0.833764907210665, 0.833764907210665, 0.6722948170756379, 0.8739810481273578, 0.833764907210665], 0.833764907210665) # 求取标注相对各特征的条件熵 conditon_entropy = entropy_model.cal_conditional_entropy() print(conditon_entropy) [0.6655644259732555, 0.699248162082863, 0.0, 0.7352336969711815, 0.833764907210665, 0.0, 0.67371811971174, 0.833764907210665, 0.7982018075321516, 0.7982018075321516, 0.7982018075321516, 0.7982018075321516, 0.8255150132281116, 0.8067159627055736, 0.8276667497383372] # 求取标注相对于各特征的信息增益(互信息) entropy_gain = entropy_model.cal_entropy_gain() print(entropy_gain) [0.1682004812374095, 0.13451674512780198, 0.833764907210665, 0.09853121023948352, 0.0, 0.833764907210665, 0.16004678749892498, 0.0, 0.035563099678513344, 0.035563099678513344, 0.035563099678513344, 0.035563099678513344, 0.00824989398255338, 0.027048944505091432, 0.006098157472327781] # 求取标注相对于各特征的信息增益率 entropy_gain_rate = entropy_model.cal_entropy_gain_ratio() print(entropy_gain_rate) [0.05102446735064376, 0.04311438869063559, 0.1369642704939145, 0.02798828608042902, 0.0, 0.1369642704939145, 0.183123865033287, nan, 0.04265362978335057, 0.04265362978335057, 0.04265362978335057, 0.04265362978335057, 0.012271244360381867, 0.03094912019322165, 0.007314001128602282] 离散变量的决策树模型
 .fit():拟合及训练模型的函数
 .predict():模型预测函数
 .tree_net:决策树网络
 用法:先调用类创建实例对象,再调用fit函数训练模型,
 再调用predict函数进行预测,且可通过tree_net属性查看决策树网络。
 特征和标注的类型最好转入DataFrame、Series或者list格式class CyrusDecisionTreeDiscrete(object): """     此对象为针对离散变量的分类问题建立决策树模型适用的。     .fit():拟合及训练模型的函数     .predict():模型预测函数     .tree_net:决策树网络     用法:先调用类创建实例对象,再调用fit函数训练模型,           再调用predict函数进行预测,且可通过tree_net属性查看决策树网络。           特征和标注的类型最好转入DataFrame、Series或者list格式     """     X = None     Y = None     def __init__(self,algorithm = "ID3"):         self.method = algorithm         self.tree_net = {}     def tree(self,x,y,dict_):         entropy_model = CyrusEntropy(x,y)         index = np.argmax(entropy_model.cal_entropy_gain())         dict_[index] = {}         dict_x_flag = {}         dict_y_flag = {} for i in range(x.shape[0]):             dict_x_flag[x.iloc[i,index]] = dict_x_flag.get(x.iloc[i,index],[]) + [list(x.iloc[i,:])]             dict_y_flag[x.iloc[i,index]] = dict_y_flag.get(x.iloc[i,index],[]) + [(y.iloc[i])]         key_list = [] for key,value in dict_x_flag.items(): if pd.Series(dict_y_flag[key]).value_counts().shape[0] == 1:                 dict_[index][key] = dict_y_flag[key][0] else:                 key_list.append(key)                 dict_[index][key] = {}         code = "" if len(key_list) != 0: for key in key_list:                 code += "self.tree(pd.DataFrame(dict_x_flag['{}']),pd.Series(dict_y_flag['{}']),dict_[{}]['{}']),".format(key,key,index,key)             code = code[:-1] return eval(code)     def fit(self,x,y):         self.X = pd.DataFrame(x)         self.Y = pd.Series(y)         self.tree(self.X,self.Y,self.tree_net)     def cal_label(self,x,dict_):         index = list(dict_.keys())[0] if str(type(dict_[index][x[index]])) != "<class 'dict'>": return dict_[index][x[index]] else: return self.cal_label(x,dict_[index][x[index]])              def predict(self,x):         x = pd.DataFrame(x)         y = [] for i in range(x.shape[0]):             se = pd.Series(x.iloc[i,:])             y.append(self.cal_label(se,self.tree_net)) return y      决策树模型运行结果
 训练并拟合模型
 模型预测# 建立决策树模型 tree_model = CyrusDecisionTreeDiscrete()  # 训练并拟合模型 tree_model.fit(X,Y)  # 模型预测 y_pre = tree_model.predict(X) print(y_pre) ['Married', 'Married', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Married', 'Married', 'Married', 'Married', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Married', 'Married', 'Divorced', 'Divorced', 'Married', 'Married', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Married', 'Married', 'Divorced', 'Divorced', 'Married', 'Married', 'Married', 'Married', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Married', 'Married', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced'] # 准确率检测 result = [1 if y_pre[i] == Y[i] else 0 for i in range(len(y_pre))] print("准确率为:",np.array(result).sum()/len(result)) 准确率为: 1.0 
 我想我还是
 会卯起来蹉跎
 反正就这样吧
 我知道我
 努力过
 ——————五月天(一颗苹果)——————
本网页所有视频内容由 imoviebox边看边下-网页视频下载, iurlBox网页地址收藏管理器 下载并得到。
ImovieBox网页视频下载器 下载地址: ImovieBox网页视频下载器-最新版本下载
本文章由: imapbox邮箱云存储,邮箱网盘,ImageBox 图片批量下载器,网页图片批量下载专家,网页图片批量下载器,获取到文章图片,imoviebox网页视频批量下载器,下载视频内容,为您提供.
阅读和此文章类似的: 全球云计算
 官方软件产品操作指南 (170)
官方软件产品操作指南 (170)