2021微信大数据挑战赛:https://algo.weixin.qq.com/
详细介绍见:https://zhuanlan.zhihu.com/p/399218898
Python: 3.6
tensorflow-gpu==1.15 # GPU version of TensorFlow
sentencepiece
gensim==3.8.3
pandas
PyYAML
tqdm
matplotlib
sklearn
recordclass
numba
./
├── README.md
├── requirements.txt, python package requirements
├── init.sh, script for installing package requirements
├── train.sh, script for preparing train/inference data and training models, including pretrained models
├── inference.sh, script for inference
├── src
│ ├── prepare, codes for preparing train/test dataset
| ├── train, codes for training
| ├── inference.py, main function for inference on test dataset
│ ├── model, codes for model architecture
├── data
│ ├── wedata, dataset of the competition
│ ├── wechat_algo_data1, preliminary dataset (初赛数据集)
│ ├── wechat_algo_data2, semi-final dataset (复赛数据集)
│ ├── submission, prediction result after running inference.sh
│ ├── model, model files (e.g. tensorflow checkpoints)
│ ├── preprocess, 预处理的数据
│ ├── deepwalk, deepwalk算法的数据
│ ├── match_tower, Match Tower模型的训练样本
├── config, (optional) configuration files for your method (e.g. yaml file)
chmod u+x init.sh
chmod u+x inference.sh
./init.sh
./inference.sh
ID类特征:用户ID、device、feedid、authorid、videoplayseconds、description、bgm_song_id、bgm_singer_id、manual_keyword_list、manual_tag_list;
统计特征:用户、feed、author分别统计前一天、前n天的总数、各label的总数,以及均值、标准差
- Match Tower模型:接口ID类特征,多层MLP
- Albert:历史序列经过albert模型产生一个logits,以及序列embedding;
- PLE模型:输入为ID类特征及统计特征及albert产生的序列embediing,输出7个对应标签的logits,然后与Match Tower和albert产生的logits进行融合,最后输出7个label的预测概率。