Skip to content

Commit

Permalink
Merge pull request #19 from ArtificialZeng/patch-2
Browse files Browse the repository at this point in the history
Create README_EN.md
  • Loading branch information
enjoysport2022 authored Apr 27, 2022
2 parents ed4d522 + 66842eb commit 2d7a3f4
Showing 1 changed file with 283 additions and 0 deletions.
283 changes: 283 additions & 0 deletions autox/autox_competition/README_EN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,283 @@
[English](./README_EN.md) | 简体中文

# what is autox_competition
An automated machine learning tool developed by AutoX for tabular data mining competitions

# Get started quickly
## Use the following functions to get prediction results in one click::
```
autox.get_submit # regression or classification problem
autox.get_submit_ts # time series dataset
```
## Demo(by data type)
#### binary classification problem
Kaggle_Santander-AutoX solution:
- [colab](https://colab.research.google.com/drive/1HKOr3vK_Ty3Dftw2JF4SJWFtwxsBfcLz?usp=sharing)
- [kaggle-kernel](https://www.kaggle.com/poteman/autox-tutorial-santander/)

2021 China Information Geek Competition-Loan Anti-Fraud-AutoX Solution:(2021神州信息极客大赛-贷款反欺诈-AutoX解决方案)
- [datafountain notebook](https://work.datafountain.cn/forum?type=3&id=5843)

#### regression problem
DC Rent Forecast - AutoX Solutions:(DC租金预测-AutoX解决方案)
- [colab](https://colab.research.google.com/drive/1SxK_-_6oAE8OzDitXCy2JM29F9UE0Ujj?usp=sharing)
- [DClab](https://www.dclab.run/project_content.html?type=myproject&id=5393)

#### Time series prediction problem (multi-table)
2021 Alibaba Cloud Supply Chain Competition - AutoX Solution:(2021阿里云供应链大赛-AutoX解决方案)
- [colab](https://colab.research.google.com/drive/1cw5ynTPqc5RWbVjQdvbnDHkq_1rTlxqe?usp=sharing)
- Ali Tianchi-Two lines of code to solve the problem of supply chain forecasting[阿里天池-两行代码解决供应链预测问题](https://tianchi.aliyun.com/forum/postDetail?spm=5176.12586969.1002.3.6b172672PZvjjb&postId=306418)
- Ali supply chain forecasting - common features of time series forecasting problems[阿里供应链预测-时序预测问题常用特征](https://tianchi.aliyun.com/forum/postDetail?spm=5176.12586969.1002.3.6b17698aXo3jYP&postId=308014)

#### The table dataset contains image data
kaggle petfinder-AutoX solution:
- automl for petfinder: predicting the popularity of pet photos[automl for petfinder:预测宠物照片的流行度](https://www.kaggle.com/poteman/automl-for-petfinder-autox?scriptVersionId=81076747)


## Demo (divided by usage scenarios)
#### Marketing Scenario
Predict whether bank customers will subscribe for fixed deposits[预测银行客户是否会认购定期存款](https://www.kaggle.com/poteman/automl-for-bank-autox)
#### Risk control scene
[Loan Default Prediction](https://www.kaggle.com/poteman/automl-for-loan-autox/)
#### Recommended scene
[Predict whether a mobile ad will be clicked](https://www.kaggle.com/poteman/automl-for-avazu-autox)



## Use the following functions to get important features of topk with one click:
```
autox.get_top_features # regression or classification problem
autox.get_top_features_ts # time series dataset
```
## Get important features of topk Demo

####
kaggle-Allstate obtains important features of topk:
- [autox_get_top_features_Allstate](https://www.kaggle.com/poteman/autox-get-top-features-allstate?scriptVersionId=81484541)

# 目录
<!-- TOC -->

- [what is autox_competition](#autox_competition是什么)
- [Get started quickly](#快速上手)
- [content](#目录)
- [Result comparison](#效果对比)
- [Summary of Competition](#比赛上分点总结)
- [feature engineering](./feature_engineer/README.md)
- [Feature selection](./feature_selection/README.md)

<!-- /TOC -->

# Result comparison
|data_type | single-or-multi | data_name | metric | AutoX | AutoGluon | H2o |
|----- | ------------- | ----------- |---------------- |---------------- | ----------------|----------------|
|binary classification | single-table | [Springleaf](https://www.kaggle.com/c/springleaf-marketing-response/) | auc | 0.78865 | 0.61141 | 0.78186 |
|binary classification | single-table |[stumbleupon](https://www.kaggle.com/c/stumbleupon/) | auc | 0.87177 | 0.81025 | 0.79039 |
|binary classification | single-table |[santander](https://www.kaggle.com/c/santander-customer-transaction-prediction/) | auc | 0.89196 | 0.64643 | 0.88775 |
|binary classification | multi-table |[IEEE](https://www.kaggle.com/c/ieee-fraud-detection/) | accuracy | 0.920809 | 0.724925 | 0.907818 |
|regression | single-table |[ventilator](https://www.kaggle.com/c/ventilator-pressure-prediction/) | mae | 0.755 | 8.434 | 4.221 |
|regression | single-table |[Allstate Claims Severity](https://www.kaggle.com/c/allstate-claims-severity)| mae | 1137.07885 | 1173.35917 | 1163.12014 |
|regression | single-table |[zhidemai](https://www.automl.ai/competitions/19) | mse | 1.0034 | 1.9466 | 1.1927|
|regression | single-table |[Tabular Playground Series - Aug 2021](https://www.kaggle.com/c/tabular-playground-series-aug-2021) | rmse | 7.87731 | 10.3944 | 7.8895|
|regression | single-table |[House Prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/) | rmse | 0.13043 | 0.13104 | 0.13161 |
|regression | single-table |[Restaurant Revenue](https://www.kaggle.com/c/restaurant-revenue-prediction/)| rmse | 2133204.32146 | 31913829.59876 | 28958013.69639 |
|regression | multi-table |[Elo Merchant Category Recommendation](https://www.kaggle.com/c/elo-merchant-category-recommendation/)| rmse | 3.72228 | 3.80801 | 22.88899 |
|regression-ts | single-table |[Demand Forecasting](https://www.kaggle.com/c/demand-forecasting-kernels-only/)| smape | 13.79241 | 25.39182 | 18.89678 |
|regression-ts | multi-table |[Walmart Recruiting](https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/)| wmae | 4660.99174 | 5024.16179 | 5128.31622 |
|regression-ts | multi-table |[Rossmann Store Sales](https://www.kaggle.com/c/rossmann-store-sales/)| RMSPE | 0.13850 | 0.20453 | 0.35757 |


# data type
- cat: Categorical,categorical unordered variable
- ord: Ordinal,categorical ordinal variable
- num: Numeric,Numeric, continuous variable
- datetime: Datetime time variable
- timestamp: timestamp type time variable

# table relationship
```
"relations": [ # 表关系(可以包含为1-1, 1-M, M-1, M-M四种)
{
"related_to_main_table": "true", # Whether it is a relationship with the main table
"left_entity": "overdue", # left table name
"left_on": ["new_user_id"], # 左表拼表键
"right_entity": "userinfo", # 右表名字
"right_on": ["new_user_id"], # 右表拼表键
"type": "1-1" # 左表与右表的连接关系
},
{
"related_to_main_table": "true",
"left_entity": "overdue",
"left_on": ["new_user_id"],
"left_time_col": "flag1",
"right_entity": "bank",
"right_on": ["new_user_id"],
"right_time_col": "flag1",
"type": "1-M"
},
{
"related_to_main_table": "true",
"left_entity": "overdue",
"left_on": ["new_user_id"],
"left_time_col": "flag1",
"right_entity": "browse",
"right_on": ["new_user_id"],
"right_time_col": "flag1",
"type": "1-M"
},
{
"related_to_main_table": "true",
"left_entity": "overdue",
"left_on": ["new_user_id"],
"left_time_col": "flag1",
"right_entity": "bill",
"right_on": ["new_user_id"],
"right_time_col": "flag1",
"type": "1-M"
}
]
```

# pipeline的逻辑
- 1.初始化AutoX类
```
1.1 读数据
1.2 合并train和test
1.3 识别数据表中列的类型
1.4 数据预处理
```
- 2.特征工程
```
特征工程包含单表特征和多表特征。
每一个特征工程类都包含以下功能:
一、自动筛选要执行当前操作的特征;
二、查看筛选出来的特征
三、修改要执行当前操作的特征
四、执行特征数据的计算,返回和主表样本条数以及顺序一致的特征
```
- 3.特征合并
```
将构造出来的特征进行合并,行数不变,列数增加,返回大的宽表
```
- 4.训练集和测试集的划分
```
将宽表划分成训练集和测试集
```
- 5.特征过滤
```
通过train和test的特征列数据分布情况,对构造出来的特征进行过滤,避免过拟合
```
- 6.模型训练
```
利用过滤后的宽表特征对模型进行训练
模型类提供功能包括:
一、查看模型默认参数;
二、模型训练;
三、模型调参;
四、查看模型对应的特征重要性;
五、模型预测
```
- 7.模型预测

# AutoX类
```
AutoX类自动为用户管理数据集和数据集信息。
初始化AutoX类之后会执行以下操作:
一、读数据;
二、合并train和test;
三、识别数据表中列的类型;
四、数据预处理。
```
## 属性
### info_: info_属性用于保存数据集的信息。
- info_['id']: List,用于标识样本的唯一Key
- info_['target']: String,用于标识数据表的标签列
- info_['shape_of_train']: Int,train数据集的数据样本条数
- info_['shape_of_test']: Int,test数据集的数据样本条数
- info_['feature_type']: Dict of Dict,标识数据表中特征列的数据类型
- info_['train_name']: String,用于训练集主表表名
- info_['test_name']: String,用于测试集主表表名

### dfs_: dfs_属性用于保存所有的DataFrame,包含原始表数据和构造的表数据。
- dfs_['train_test']: train数据和test数据合并后的数据
- dfs_['FE_feature_name']:特征工程所构造出的数据,例如FE_count,FE_groupby
- dfs_['FE_all']:原始特征和所有特征工程合并后的数据集

## 方法
- concat_train_test: 将训练集和测试集拼接起来,一般在读取数据之后执行
- split_train_test: 将训练集和测试集分开,一般在完成特征工程之后执行
- get_submit: 获得预测结果(中间过程执行了完成的机器学习pipeline,包括数据预处理,特征工程,模型训练,模型调参,模型融合,模型预测等)

# AutoX的pipeline中的操作对应的具体细节:

## 读数据
```
读取给定路径下的所有文件。默认情况下,会将训练集主表和测试集主表进行拼接,
再进行后续的数据预处理以及特征工程等操作,并在模型预测开始前,将训练集和测试进行拆分。
```

## 数据预处理
```
- 对时间列解析年, 月, 日, 时、星期几等信息
- 在每次训练前,会对输入到模型的数据删除无效(nunique为1)的特征
- 去除异常样本,去除label为nan的样本
```

## 特征工程
- 1-1拼表特征
```
```

- 1-M拼表特征
```
- time diff特征
- 聚合统计类特征
```

- count特征
```
对要操作的特征列,将全体数据集中,和当前样本特征属性一致的样本计数作为特征
```

- target encoding特征

- 统计类特征
```
使用两层for训练提取统计类特征。
第一层for循环遍历所有筛选出来的分组特征(group_col),
第二层for循环遍历所有筛选出来的聚合特征(agg_col),
在第二层for循环中,
若遇到类别型特征,计算的统计特征为nunique,
若遇到数值型特征,计算的统计特征包括[median, std, sum, max, min, mean].
```

- shift特征
```
```

## 模型训练
```
AutoX目前支持以下模型:
1. Lightgbm
2. Xgboost
3. TabNet
```

## 模型融合
```
AutoX支持的模型融合方式包括一下两种,默认情况下,使用Bagging的方式进行融合。
1. Stacking;
2. Bagging。
```


# 比赛上分点总结:
|比赛|magics|
|------|------|
|kaggle criteo|对于nunique很大的特征列,进行分桶操作。例如,对于nunique大于10000的特征,做hash后截断保留4位,再进行label_encode。|
|zhidemai|article_id隐含了时间信息,增加article_id的排序特征。例如,groupby(['date'])['article_id'].rank()。|
|kaggle StumbleUpon|以文本列特征作为输入,使用Bert模型进行训练。|
|kaggle ventilator|对breath_id聚合的shift、diff、cumsum特征 |
|kaggle Santander|识别出fake test,剔除之后再和train合并,构造全局的count特征。识别的方法:真实的样本至少有一个特征对应的值是全局唯一的,而fake的样本没有全局唯一的特征值。参考: [List of Fake Samples and Public/Private LB split](https://www.kaggle.com/yag320/list-of-fake-samples-and-public-private-lb-split)|
|kaggle Allstate Claims Severity|label取log1p后训练模型,获得结果后取expm1,mae能降低35+|

0 comments on commit 2d7a3f4

Please sign in to comment.