learn from https://www.kaggle.com/learn/feature-engineering
下一篇:Feature Engineering 特征工程 2. Categorical Encodings 预测任务:用户是否会下载APP,当其点击广告以后 有6种数值 每种多少个?按 需要关注下,label 在每个数据集中的占比是否接近
1. 读取数据
数据集:ks-projects-201801.csv
'deadline','launched'
,parse_dates
解析为时间ks = pd.read_csv('ks-projects-201801.csv',parse_dates=['deadline','launched'])
预测Kickstarter项目是否会成功。state
作为结果label
可以使用类别category
,货币currency
,资金目标funding goal
,国家country
以及启动时间launched
等特征2. 处理label
pd.unique(ks.state)
array(['failed', 'canceled', 'successful', 'live', 'undefined', 'suspended'], dtype=object)
state
分组,每组中ID
行数有多少ks.groupby('state')['ID'].count()
state canceled 38779 failed 197719 live 2799 successful 133956 suspended 1846 undefined 3562 Name: ID, dtype: int64
live
丢弃,successful
的标记为1,其余的为0ks = ks.query('state != "live"') # live行不要 ks = ks.assign(outcome=(ks['state']=='successful').astype(int)) # label 转成1,0,int型
3. 添加特征
launched
时间拆分成,年月日小时,作为新的特征ks = ks.assign(hour=ks.launched.dt.hour, day=ks.launched.dt.day, month=ks.launched.dt.month, year=ks.launched.dt.year) ks.head()
category, currency, country
为数字from sklearn.preprocessing import LabelEncoder cat_features = ['category','currency','country'] encoder = LabelEncoder() encoded = ks[cat_features].apply(encoder.fit_transform) encoded.head(10)
X = ks[['goal', 'hour', 'day', 'month', 'year', 'outcome']].join(encoded) X.head()
4. 数据集切片
sklearn.model_selection.StratifiedShuffleSplit
valid_ratio = 0.1 valid_size = int(len(X)*valid_ratio) train = X[ : -2*valid_size] valid = X[-2*valid_size : -valid_size] test = X[-valid_size : ]
for each in [train, valid, test]: print("Outcome fraction = {:.4f}".format(each.outcome.mean()))
Outcome fraction = 0.3570 Outcome fraction = 0.3539 Outcome fraction = 0.3542
5. 训练
feature_cols = train.columns.drop('outcome') dtrain = lgb.Dataset(train[feature_cols], label=train['outcome']) dvalid = lgb.Dataset(valid[feature_cols], label=valid['outcome']) param = {'num_leaves': 64, 'objective': 'binary'} param['metric'] = 'auc' num_round = 1000 bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], early_stopping_rounds=10, verbose_eval=False)
6. 预测
from sklearn import metrics ypred = bst.predict(test[feature_cols]) score = metrics.roc_auc_score(test['outcome'], ypred) print(f"Test AUC score: {score}")
本网页所有视频内容由 imoviebox边看边下-网页视频下载, iurlBox网页地址收藏管理器 下载并得到。
ImovieBox网页视频下载器 下载地址: ImovieBox网页视频下载器-最新版本下载
本文章由: imapbox邮箱云存储,邮箱网盘,ImageBox 图片批量下载器,网页图片批量下载专家,网页图片批量下载器,获取到文章图片,imoviebox网页视频批量下载器,下载视频内容,为您提供.
阅读和此文章类似的: 全球云计算