Skip to content

Commit

Permalink
[feature]: add docs for ev and sample_weight (#305)
Browse files Browse the repository at this point in the history
* add docs for ev and sample_weight
  • Loading branch information
chengmengli06 authored Nov 14, 2022
1 parent 272bd12 commit a87e641
Show file tree
Hide file tree
Showing 5 changed files with 195 additions and 11 deletions.
9 changes: 5 additions & 4 deletions docs/source/eval.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ python -m easy_rec.python.eval --pipeline_config_path dwd_avazu_ctr_deepmodel.co
pai -name easy_rec_ext -project algo_public
-Dcmd=evaluate
-Dconfig=oss://easyrec/config/MultiTower/dwd_avazu_ctr_deepmodel_ext.config
-Dtables=odps://pai_online_project/tables/dwd_avazu_ctr_deepmodel_test
-Deval_tables=odps://pai_online_project/tables/dwd_avazu_ctr_deepmodel_test
-Dcluster='{"worker" : {"count":1, "cpu":1000, "gpu":100, "memory":40000}}'
-Dmodel_dir=oss://easyrec/ckpt/MultiTower
-Darn=acs:ram::xxx:role/xxx
Expand All @@ -100,20 +100,21 @@ pai -name easy_rec_ext -project algo_public

- -Dcmd: evaluate 模型评估
- -Dconfig: 同训练
- -Dtables: 只需要指定测试 tables
- -Deval_tables: 指定测试 tables
- -Dcluster: 评估不需要PS节点,指定一个worker节点即可
- -Dmodel_dir: 如果指定了model_dir将会覆盖config里面的model_dir,一般在周期性调度的时候使用
- -Dcheckpoint_path: 使用指定的checkpoint_path,如oss://easyrec/ckpt/MultiTower/model.ckpt-1000。不指定的话,默认model_dir中最新的ckpt文件。
- 如果是pai内部版,则不需要指定arn和ossHost, arn和ossHost放在-Dbuckets里面
- -Dbuckets=oss://easyrec/?role_arn=acs:ram::xxx:role/ev-ext-test-oss&host=oss-cn-beijing-internal.aliyuncs.com
- -Deval_result_path: 保存评估结果的文件名, 默认是eval_result.txt, 保存目录是model_dir.

#### 分布式评估

```sql
pai -name easy_rec_ext -project algo_public
-Dcmd=evaluate
-Dconfig=oss://easyrec/config/MultiTower/dwd_avazu_ctr_deepmodel_ext.config
-Dtables=odps://pai_online_project/tables/dwd_avazu_ctr_deepmodel_test
-Deval_tables=odps://pai_online_project/tables/dwd_avazu_ctr_deepmodel_test
-Dcluster='{"ps":{"count":1, "cpu":1000}, "worker" : {"count":3, "cpu":1000, "gpu":100, "memory":40000}}'
-Dmodel_dir=oss://easyrec/ckpt/MultiTower
-Dextra_params=" --distribute_eval True"
Expand All @@ -124,4 +125,4 @@ pai -name easy_rec_ext -project algo_public

- -distribute_eval: 分布式 evaluate

评估的结果会写到model_dir目录下的文件"eval_result.txt"中
评估结果: 会写到model_dir目录下的-Deval_result_path指定的文件名(默认是eval_result.txt)里面
42 changes: 35 additions & 7 deletions docs/source/feature/feature.rst
Original file line number Diff line number Diff line change
Expand Up @@ -92,8 +92,9 @@ IdFeature: 离散值特征/ID类特征
- vocab\_list:
指定词表,适合取值比较少可以枚举的特征,如星期,月份,星座等

- vocab\_file:
使用文件指定词表,用于指定比较大的词表。在提交tf任务到pai集群的时候,可以把词典文件存储在oss中。
- vocab\_file: 使用文件指定词表,用于指定比较大的词表。
- 格式: 每行一个单词
- 路径: 在提交tf任务到pai集群的时候,可以把词典文件存储在oss中。

- NOTE: hash\_bucket\_size, num\_buckets, vocab\_list,
vocab\_file只能指定其中之一,不能同时指定
Expand Down Expand Up @@ -177,6 +178,7 @@ TagFeature
tags字段可以用于描述商品的多个属性

.. code:: protobuf
feature_config:{
features : {
input_names: 'properties'
Expand All @@ -196,6 +198,7 @@ tags字段可以用于描述商品的多个属性
我们同样支持有权重的tag特征,如"体育:0.3\|娱乐:0.2\|军事:0.5":

.. code:: protobuf
feature_config:{
features : {
input_names: 'tag_kvs'
Expand All @@ -206,9 +209,10 @@ tags字段可以用于描述商品的多个属性
embedding_dim: 24
}
}
或"体育\|娱乐\|军事"和"0.3\|0.2\|0.5"的输入形式
或"体育\|娱乐\|军事"和"0.3\|0.2\|0.5"的输入形式:

.. code:: protobuf
feature_config:{
features : {
input_names: 'tags'
Expand All @@ -224,9 +228,9 @@ NOTE:
~~~~~

- 如果使用csv文件进行存储,那么多个tag之间采用\ **列内分隔符**\ 进行分隔,
例如csv的列之间一般用逗号(,)分隔,那么可采用竖线(\|)作为多个tag之间的分隔符。
- weightstags对应的权重列,在表里面一般采用string类型存储。
- Weights的数目必须要和tag的数目一致,并且使用\ **列内分隔符**\ 进行分隔。
例如: csv的列之间一般用逗号(,)分隔,那么可采用竖线(\|)作为多个tag之间的分隔符。
- weights: tags对应的权重列,在表里面一般采用string类型存储。
- weights的数目必须要和tag的数目一致,并且使用\ **列内分隔符**\ 进行分隔。

SequenceFeature:行为序列类特征
----------------------------------------------------------------
Expand Down Expand Up @@ -324,6 +328,7 @@ ComboFeature:组合特征
对输入的离散值进行组合, 如age + sex:

.. code:: protobuf
feature_config:{
features {
input_names: ["age", "sex"]
Expand All @@ -347,6 +352,7 @@ ExprFeature:表达式特征
将表达式特征放在EasyRec中,一方面减少模型io,另一方面保证离在线一致。

.. code:: protobuf
data_config {
input_fields {
input_name: 'user_age'
Expand All @@ -369,7 +375,7 @@ ExprFeature:表达式特征
input_type: INT32
}
...
)
}
feature_config:{
features {
feature_name: "age_satisfy1"
Expand Down Expand Up @@ -408,6 +414,28 @@ ExprFeature:表达式特征
- 当前版本未定义"&","|"的符号优先级,建议使用括号保证优先级。
- customized normalization: "tf.math.log1p(user_age) / 10.0"

EmbeddingVariable
----------------------------------------------------------------
Key Value Hash, 减少hash冲突, 支持特征准入和特征淘汰。

.. code:: protobuf
model_config {
model_class: "MultiTower"
...
ev_params {
filter_freq: 2
}
}
- 配置方式:
- feature_config单独配置ev_params
- model_config里面统一配置ev_params

- ev_params : EVParams
- filter_freq: 频次过滤, 低频特征噪声大,过滤噪声让模型更鲁棒
- steps_to_live: 特征淘汰, 淘汰过期特征,防止模型过大
- Note: 仅在安装PAI-TF/DeepRec时可用

特征选择
----------------------------------------------------------------
对输入层使用变分dropout计算特征重要性,根据重要性排名进行特征选择。
Expand Down
29 changes: 29 additions & 0 deletions docs/source/feature/odl_sample.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,35 @@
- 离线样本可以使用SQL在MaxCompute或者Hive/Spark平台上构造.
- 可以使用 [推荐算法定制](https://pairec.yuque.com/books/share/72cb101c-e89d-453b-be81-0fadf09db4dd) 来自动生成离线特征 和 离线样本的流程.

## 样本权重

- 指定输入一列为sample_weight
- data_config.sample_weight
- 示例:
```protobuf
data_config {
input_fields {
input_name: 'clk'
input_type: DOUBLE
}
input_fields {
input_name: 'field1'
input_type: STRING
}
...
input_fields {
input_name: 'sw'
input_type: DOUBLE
}
sample_weight: 'sw'
label_fields: 'clk'
batch_size: 1024
input_type: CSVInput
}
```

## 实时样本

### 前置条件
Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ Welcome to easy_rec's documentation!
:maxdepth: 2
:caption: PREDICT

predict/input_output
predict/MaxCompute 离线预测
predict/Local 离线预测
predict/在线预测
Expand Down
125 changes: 125 additions & 0 deletions docs/source/predict/input_output.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# 输入输出

## 命令

```bash
saved_model_cli show --all --dir export/1650854967
```

## 输出

```
MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:
signature_def['serving_default']:
The given SavedModel SignatureDef contains the following input(s):
inputs['adgroup_id'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: input_3:0
inputs['age_level'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: input_12:0
inputs['brand'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: input_7:0
inputs['campaign_id'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: input_5:0
inputs['cate_id'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: input_4:0
inputs['cms_group_id'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: input_10:0
inputs['cms_segid'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: input_9:0
inputs['customer'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: input_6:0
inputs['final_gender_code'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: input_11:0
inputs['new_user_class_level'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: input_16:0
inputs['occupation'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: input_15:0
inputs['pid'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: input_2:0
inputs['price'] tensor_info:
dtype: DT_INT32
shape: (-1)
name: input_19:0
inputs['pvalue_level'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: input_13:0
inputs['shopping_level'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: input_14:0
inputs['tag_brand_list'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: input_18:0
inputs['tag_category_list'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: input_17:0
inputs['user_id'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: input_8:0
The given SavedModel SignatureDef contains the following output(s):
outputs['item_emb'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: ReduceJoin_1/ReduceJoin:0
outputs['item_tower_feature'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: ReduceJoin_3/ReduceJoin:0
outputs['logits'] tensor_info:
dtype: DT_FLOAT
shape: (-1)
name: Reshape:0
outputs['probs'] tensor_info:
dtype: DT_FLOAT
shape: (-1)
name: Sigmoid:0
outputs['user_emb'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: ReduceJoin/ReduceJoin:0
outputs['user_tower_feature'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: ReduceJoin_2/ReduceJoin:0
Method name is: tensorflow/serving/predict
```

- signature_def: 默认是serving_default
- inputs: 输入列表
- dtype: 输入tensor类型
- shape: 输入tensor的shape
- name: 输入Placeholder的名称
- outputs: 输出列表
- dtype: 输出tensor类型
- shape: 输入tensor的shape
- name: 输出tensor的名称

0 comments on commit a87e641

Please sign in to comment.