learn from https://www.kaggle.com/learn/feature-engineering
上一篇:Feature Engineering 特征工程 1. Baseline Model 下一篇:Feature Engineering 特征工程 3. Feature Generation 在中级机器学习里介绍过了 在上一篇中使用 计数编码,就是把该类型的value,替换为其出现的次数 Target encoding replaces a categorical value with the average value of the target for that value of the feature. This technique uses the targets to create new features. So including the validation or test data in the target encodings would be a form of target leakage. This is similar to target encoding in that it’s based on the target probablity for a given value.
Label Encoding
、One-Hot Encoding
,下面将学习count encoding
计数编码,target encoding
目标编码、singular value decomposition
奇异值分解LabelEncoder()
,得分为Validation AUC score: 0.7467
# Label encoding cat_features = ['category', 'currency', 'country'] encoder = LabelEncoder() encoded = ks[cat_features].apply(encoder.fit_transform)
1. Count Encoding 计数编码
例如:一个特征中CN
出现了100次,那么就将CN
,替换成数值100category_encoders.CountEncoder()
,最终得分Validation AUC score: 0.7486
import category_encoders as ce cat_features = ['category', 'currency', 'country'] count_enc = ce.CountEncoder() count_encoded = count_enc.fit_transform(ks[cat_features]) data = baseline_data.join(count_encoded.add_suffix("_count")) # Training a model on the baseline data train, valid, test = get_data_splits(data) bst = train_model(train, valid)
2. Target Encoding 目标编码
category_encoders.TargetEncoder()
,最终得分Validation AUC score: 0.7491
目标编码:将会用该特征值的 label 的平均值 替换 分类特征值
For example, given the country value “CA”, you’d calculate the average outcome for all the rows with country == ‘CA’, around 0.28.
举例子:特征值 “CA”,你要计算所有 “CA” 行的 label(即outcome列)的均值,用该均值来替换 “CA”
This is often blended with the target probability over the entire dataset to reduce the variance of values with few occurences.
这么做,可以降低很少出现的值的方差?
这种编码方法会产生新的特征,不要把验证集和测试集拿进来fit
,会产生数据泄露
Instead, you should learn the target encodings from the training dataset only and apply it to the other datasets.
应该从训练集里fit
,应用到其他数据集import category_encoders as ce cat_features = ['category', 'currency', 'country'] # Create the encoder itself target_enc = ce.TargetEncoder(cols=cat_features) train, valid, _ = get_data_splits(data) # Fit the encoder using the categorical features and target target_enc.fit(train[cat_features], train['outcome']) # Transform the features, rename the columns with _target suffix, and join to dataframe train = train.join(target_enc.transform(train[cat_features]).add_suffix('_target')) valid = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target')) train.head() bst = train_model(train, valid)
3. CatBoost Encoding
category_encoders.CatBoostEncoder()
,最终得分Validation AUC score: 0.7492
跟目标编码类似的点在于,它基于给定值的 label 目标概率
However with CatBoost, for each row, the target probability is calculated only from the rows before it.
计算上,对每一行,目标概率的计算只依靠它之前的行cat_features = ['category', 'currency', 'country'] target_enc = ce.CatBoostEncoder(cols=cat_features) train, valid, _ = get_data_splits(data) target_enc.fit(train[cat_features], train['outcome']) train = train.join(target_enc.transform(train[cat_features]).add_suffix('_cb')) valid = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_cb')) bst = train_model(train, valid)
本网页所有视频内容由 imoviebox边看边下-网页视频下载, iurlBox网页地址收藏管理器 下载并得到。
ImovieBox网页视频下载器 下载地址: ImovieBox网页视频下载器-最新版本下载
本文章由: imapbox邮箱云存储,邮箱网盘,ImageBox 图片批量下载器,网页图片批量下载专家,网页图片批量下载器,获取到文章图片,imoviebox网页视频批量下载器,下载视频内容,为您提供.
阅读和此文章类似的: 全球云计算