-
Notifications
You must be signed in to change notification settings - Fork 134
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #19 from ArtificialZeng/patch-2
Create README_EN.md
- Loading branch information
Showing
1 changed file
with
283 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,283 @@ | ||
[English](./README_EN.md) | 简体中文 | ||
|
||
# what is autox_competition | ||
An automated machine learning tool developed by AutoX for tabular data mining competitions | ||
|
||
# Get started quickly | ||
## Use the following functions to get prediction results in one click:: | ||
``` | ||
autox.get_submit # regression or classification problem | ||
autox.get_submit_ts # time series dataset | ||
``` | ||
## Demo(by data type) | ||
#### binary classification problem | ||
Kaggle_Santander-AutoX solution: | ||
- [colab](https://colab.research.google.com/drive/1HKOr3vK_Ty3Dftw2JF4SJWFtwxsBfcLz?usp=sharing) | ||
- [kaggle-kernel](https://www.kaggle.com/poteman/autox-tutorial-santander/) | ||
|
||
2021 China Information Geek Competition-Loan Anti-Fraud-AutoX Solution:(2021神州信息极客大赛-贷款反欺诈-AutoX解决方案) | ||
- [datafountain notebook](https://work.datafountain.cn/forum?type=3&id=5843) | ||
|
||
#### regression problem | ||
DC Rent Forecast - AutoX Solutions:(DC租金预测-AutoX解决方案) | ||
- [colab](https://colab.research.google.com/drive/1SxK_-_6oAE8OzDitXCy2JM29F9UE0Ujj?usp=sharing) | ||
- [DClab](https://www.dclab.run/project_content.html?type=myproject&id=5393) | ||
|
||
#### Time series prediction problem (multi-table) | ||
2021 Alibaba Cloud Supply Chain Competition - AutoX Solution:(2021阿里云供应链大赛-AutoX解决方案) | ||
- [colab](https://colab.research.google.com/drive/1cw5ynTPqc5RWbVjQdvbnDHkq_1rTlxqe?usp=sharing) | ||
- Ali Tianchi-Two lines of code to solve the problem of supply chain forecasting[阿里天池-两行代码解决供应链预测问题](https://tianchi.aliyun.com/forum/postDetail?spm=5176.12586969.1002.3.6b172672PZvjjb&postId=306418) | ||
- Ali supply chain forecasting - common features of time series forecasting problems[阿里供应链预测-时序预测问题常用特征](https://tianchi.aliyun.com/forum/postDetail?spm=5176.12586969.1002.3.6b17698aXo3jYP&postId=308014) | ||
|
||
#### The table dataset contains image data | ||
kaggle petfinder-AutoX solution: | ||
- automl for petfinder: predicting the popularity of pet photos[automl for petfinder:预测宠物照片的流行度](https://www.kaggle.com/poteman/automl-for-petfinder-autox?scriptVersionId=81076747) | ||
|
||
|
||
## Demo (divided by usage scenarios) | ||
#### Marketing Scenario | ||
Predict whether bank customers will subscribe for fixed deposits[预测银行客户是否会认购定期存款](https://www.kaggle.com/poteman/automl-for-bank-autox) | ||
#### Risk control scene | ||
[Loan Default Prediction](https://www.kaggle.com/poteman/automl-for-loan-autox/) | ||
#### Recommended scene | ||
[Predict whether a mobile ad will be clicked](https://www.kaggle.com/poteman/automl-for-avazu-autox) | ||
|
||
|
||
|
||
## Use the following functions to get important features of topk with one click: | ||
``` | ||
autox.get_top_features # regression or classification problem | ||
autox.get_top_features_ts # time series dataset | ||
``` | ||
## Get important features of topk Demo | ||
|
||
#### | ||
kaggle-Allstate obtains important features of topk: | ||
- [autox_get_top_features_Allstate](https://www.kaggle.com/poteman/autox-get-top-features-allstate?scriptVersionId=81484541) | ||
|
||
# 目录 | ||
<!-- TOC --> | ||
|
||
- [what is autox_competition](#autox_competition是什么) | ||
- [Get started quickly](#快速上手) | ||
- [content](#目录) | ||
- [Result comparison](#效果对比) | ||
- [Summary of Competition](#比赛上分点总结) | ||
- [feature engineering](./feature_engineer/README.md) | ||
- [Feature selection](./feature_selection/README.md) | ||
|
||
<!-- /TOC --> | ||
|
||
# Result comparison | ||
|data_type | single-or-multi | data_name | metric | AutoX | AutoGluon | H2o | | ||
|----- | ------------- | ----------- |---------------- |---------------- | ----------------|----------------| | ||
|binary classification | single-table | [Springleaf](https://www.kaggle.com/c/springleaf-marketing-response/) | auc | 0.78865 | 0.61141 | 0.78186 | | ||
|binary classification | single-table |[stumbleupon](https://www.kaggle.com/c/stumbleupon/) | auc | 0.87177 | 0.81025 | 0.79039 | | ||
|binary classification | single-table |[santander](https://www.kaggle.com/c/santander-customer-transaction-prediction/) | auc | 0.89196 | 0.64643 | 0.88775 | | ||
|binary classification | multi-table |[IEEE](https://www.kaggle.com/c/ieee-fraud-detection/) | accuracy | 0.920809 | 0.724925 | 0.907818 | | ||
|regression | single-table |[ventilator](https://www.kaggle.com/c/ventilator-pressure-prediction/) | mae | 0.755 | 8.434 | 4.221 | | ||
|regression | single-table |[Allstate Claims Severity](https://www.kaggle.com/c/allstate-claims-severity)| mae | 1137.07885 | 1173.35917 | 1163.12014 | | ||
|regression | single-table |[zhidemai](https://www.automl.ai/competitions/19) | mse | 1.0034 | 1.9466 | 1.1927| | ||
|regression | single-table |[Tabular Playground Series - Aug 2021](https://www.kaggle.com/c/tabular-playground-series-aug-2021) | rmse | 7.87731 | 10.3944 | 7.8895| | ||
|regression | single-table |[House Prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/) | rmse | 0.13043 | 0.13104 | 0.13161 | | ||
|regression | single-table |[Restaurant Revenue](https://www.kaggle.com/c/restaurant-revenue-prediction/)| rmse | 2133204.32146 | 31913829.59876 | 28958013.69639 | | ||
|regression | multi-table |[Elo Merchant Category Recommendation](https://www.kaggle.com/c/elo-merchant-category-recommendation/)| rmse | 3.72228 | 3.80801 | 22.88899 | | ||
|regression-ts | single-table |[Demand Forecasting](https://www.kaggle.com/c/demand-forecasting-kernels-only/)| smape | 13.79241 | 25.39182 | 18.89678 | | ||
|regression-ts | multi-table |[Walmart Recruiting](https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/)| wmae | 4660.99174 | 5024.16179 | 5128.31622 | | ||
|regression-ts | multi-table |[Rossmann Store Sales](https://www.kaggle.com/c/rossmann-store-sales/)| RMSPE | 0.13850 | 0.20453 | 0.35757 | | ||
|
||
|
||
# data type | ||
- cat: Categorical,categorical unordered variable | ||
- ord: Ordinal,categorical ordinal variable | ||
- num: Numeric,Numeric, continuous variable | ||
- datetime: Datetime time variable | ||
- timestamp: timestamp type time variable | ||
|
||
# table relationship | ||
``` | ||
"relations": [ # 表关系(可以包含为1-1, 1-M, M-1, M-M四种) | ||
{ | ||
"related_to_main_table": "true", # Whether it is a relationship with the main table | ||
"left_entity": "overdue", # left table name | ||
"left_on": ["new_user_id"], # 左表拼表键 | ||
"right_entity": "userinfo", # 右表名字 | ||
"right_on": ["new_user_id"], # 右表拼表键 | ||
"type": "1-1" # 左表与右表的连接关系 | ||
}, | ||
{ | ||
"related_to_main_table": "true", | ||
"left_entity": "overdue", | ||
"left_on": ["new_user_id"], | ||
"left_time_col": "flag1", | ||
"right_entity": "bank", | ||
"right_on": ["new_user_id"], | ||
"right_time_col": "flag1", | ||
"type": "1-M" | ||
}, | ||
{ | ||
"related_to_main_table": "true", | ||
"left_entity": "overdue", | ||
"left_on": ["new_user_id"], | ||
"left_time_col": "flag1", | ||
"right_entity": "browse", | ||
"right_on": ["new_user_id"], | ||
"right_time_col": "flag1", | ||
"type": "1-M" | ||
}, | ||
{ | ||
"related_to_main_table": "true", | ||
"left_entity": "overdue", | ||
"left_on": ["new_user_id"], | ||
"left_time_col": "flag1", | ||
"right_entity": "bill", | ||
"right_on": ["new_user_id"], | ||
"right_time_col": "flag1", | ||
"type": "1-M" | ||
} | ||
] | ||
``` | ||
|
||
# pipeline的逻辑 | ||
- 1.初始化AutoX类 | ||
``` | ||
1.1 读数据 | ||
1.2 合并train和test | ||
1.3 识别数据表中列的类型 | ||
1.4 数据预处理 | ||
``` | ||
- 2.特征工程 | ||
``` | ||
特征工程包含单表特征和多表特征。 | ||
每一个特征工程类都包含以下功能: | ||
一、自动筛选要执行当前操作的特征; | ||
二、查看筛选出来的特征 | ||
三、修改要执行当前操作的特征 | ||
四、执行特征数据的计算,返回和主表样本条数以及顺序一致的特征 | ||
``` | ||
- 3.特征合并 | ||
``` | ||
将构造出来的特征进行合并,行数不变,列数增加,返回大的宽表 | ||
``` | ||
- 4.训练集和测试集的划分 | ||
``` | ||
将宽表划分成训练集和测试集 | ||
``` | ||
- 5.特征过滤 | ||
``` | ||
通过train和test的特征列数据分布情况,对构造出来的特征进行过滤,避免过拟合 | ||
``` | ||
- 6.模型训练 | ||
``` | ||
利用过滤后的宽表特征对模型进行训练 | ||
模型类提供功能包括: | ||
一、查看模型默认参数; | ||
二、模型训练; | ||
三、模型调参; | ||
四、查看模型对应的特征重要性; | ||
五、模型预测 | ||
``` | ||
- 7.模型预测 | ||
|
||
# AutoX类 | ||
``` | ||
AutoX类自动为用户管理数据集和数据集信息。 | ||
初始化AutoX类之后会执行以下操作: | ||
一、读数据; | ||
二、合并train和test; | ||
三、识别数据表中列的类型; | ||
四、数据预处理。 | ||
``` | ||
## 属性 | ||
### info_: info_属性用于保存数据集的信息。 | ||
- info_['id']: List,用于标识样本的唯一Key | ||
- info_['target']: String,用于标识数据表的标签列 | ||
- info_['shape_of_train']: Int,train数据集的数据样本条数 | ||
- info_['shape_of_test']: Int,test数据集的数据样本条数 | ||
- info_['feature_type']: Dict of Dict,标识数据表中特征列的数据类型 | ||
- info_['train_name']: String,用于训练集主表表名 | ||
- info_['test_name']: String,用于测试集主表表名 | ||
|
||
### dfs_: dfs_属性用于保存所有的DataFrame,包含原始表数据和构造的表数据。 | ||
- dfs_['train_test']: train数据和test数据合并后的数据 | ||
- dfs_['FE_feature_name']:特征工程所构造出的数据,例如FE_count,FE_groupby | ||
- dfs_['FE_all']:原始特征和所有特征工程合并后的数据集 | ||
|
||
## 方法 | ||
- concat_train_test: 将训练集和测试集拼接起来,一般在读取数据之后执行 | ||
- split_train_test: 将训练集和测试集分开,一般在完成特征工程之后执行 | ||
- get_submit: 获得预测结果(中间过程执行了完成的机器学习pipeline,包括数据预处理,特征工程,模型训练,模型调参,模型融合,模型预测等) | ||
|
||
# AutoX的pipeline中的操作对应的具体细节: | ||
|
||
## 读数据 | ||
``` | ||
读取给定路径下的所有文件。默认情况下,会将训练集主表和测试集主表进行拼接, | ||
再进行后续的数据预处理以及特征工程等操作,并在模型预测开始前,将训练集和测试进行拆分。 | ||
``` | ||
|
||
## 数据预处理 | ||
``` | ||
- 对时间列解析年, 月, 日, 时、星期几等信息 | ||
- 在每次训练前,会对输入到模型的数据删除无效(nunique为1)的特征 | ||
- 去除异常样本,去除label为nan的样本 | ||
``` | ||
|
||
## 特征工程 | ||
- 1-1拼表特征 | ||
``` | ||
``` | ||
|
||
- 1-M拼表特征 | ||
``` | ||
- time diff特征 | ||
- 聚合统计类特征 | ||
``` | ||
|
||
- count特征 | ||
``` | ||
对要操作的特征列,将全体数据集中,和当前样本特征属性一致的样本计数作为特征 | ||
``` | ||
|
||
- target encoding特征 | ||
|
||
- 统计类特征 | ||
``` | ||
使用两层for训练提取统计类特征。 | ||
第一层for循环遍历所有筛选出来的分组特征(group_col), | ||
第二层for循环遍历所有筛选出来的聚合特征(agg_col), | ||
在第二层for循环中, | ||
若遇到类别型特征,计算的统计特征为nunique, | ||
若遇到数值型特征,计算的统计特征包括[median, std, sum, max, min, mean]. | ||
``` | ||
|
||
- shift特征 | ||
``` | ||
``` | ||
|
||
## 模型训练 | ||
``` | ||
AutoX目前支持以下模型: | ||
1. Lightgbm | ||
2. Xgboost | ||
3. TabNet | ||
``` | ||
|
||
## 模型融合 | ||
``` | ||
AutoX支持的模型融合方式包括一下两种,默认情况下,使用Bagging的方式进行融合。 | ||
1. Stacking; | ||
2. Bagging。 | ||
``` | ||
|
||
|
||
# 比赛上分点总结: | ||
|比赛|magics| | ||
|------|------| | ||
|kaggle criteo|对于nunique很大的特征列,进行分桶操作。例如,对于nunique大于10000的特征,做hash后截断保留4位,再进行label_encode。| | ||
|zhidemai|article_id隐含了时间信息,增加article_id的排序特征。例如,groupby(['date'])['article_id'].rank()。| | ||
|kaggle StumbleUpon|以文本列特征作为输入,使用Bert模型进行训练。| | ||
|kaggle ventilator|对breath_id聚合的shift、diff、cumsum特征 | | ||
|kaggle Santander|识别出fake test,剔除之后再和train合并,构造全局的count特征。识别的方法:真实的样本至少有一个特征对应的值是全局唯一的,而fake的样本没有全局唯一的特征值。参考: [List of Fake Samples and Public/Private LB split](https://www.kaggle.com/yag320/list-of-fake-samples-and-public-private-lb-split)| | ||
|kaggle Allstate Claims Severity|label取log1p后训练模型,获得结果后取expm1,mae能降低35+| | ||
|