[feature]: add docs for ev and sample_weight (#305)

* add docs for ev and sample_weight
alibaba · Nov 14, 2022 · a87e641 · a87e641
1 parent 272bd12
commit a87e641
Show file tree

Hide file tree

Showing 5 changed files with 195 additions and 11 deletions.
diff --git a/docs/source/eval.md b/docs/source/eval.md
@@ -90,7 +90,7 @@ python -m easy_rec.python.eval --pipeline_config_path dwd_avazu_ctr_deepmodel.co
 pai -name easy_rec_ext -project algo_public
 -Dcmd=evaluate
 -Dconfig=oss://easyrec/config/MultiTower/dwd_avazu_ctr_deepmodel_ext.config
--Dtables=odps://pai_online_project/tables/dwd_avazu_ctr_deepmodel_test
+-Deval_tables=odps://pai_online_project/tables/dwd_avazu_ctr_deepmodel_test
 -Dcluster='{"worker" : {"count":1, "cpu":1000, "gpu":100, "memory":40000}}'
 -Dmodel_dir=oss://easyrec/ckpt/MultiTower
 -Darn=acs:ram::xxx:role/xxx
@@ -100,20 +100,21 @@ pai -name easy_rec_ext -project algo_public
 
 - -Dcmd: evaluate 模型评估
 - -Dconfig: 同训练
-- -Dtables: 只需要指定测试 tables
+- -Deval_tables: 指定测试 tables
 - -Dcluster: 评估不需要PS节点，指定一个worker节点即可
 - -Dmodel_dir: 如果指定了model_dir将会覆盖config里面的model_dir，一般在周期性调度的时候使用
 - -Dcheckpoint_path: 使用指定的checkpoint_path，如oss://easyrec/ckpt/MultiTower/model.ckpt-1000。不指定的话，默认model_dir中最新的ckpt文件。
 - 如果是pai内部版,则不需要指定arn和ossHost, arn和ossHost放在-Dbuckets里面
   - -Dbuckets=oss://easyrec/?role_arn=acs:ram::xxx:role/ev-ext-test-oss&host=oss-cn-beijing-internal.aliyuncs.com
+- -Deval_result_path: 保存评估结果的文件名, 默认是eval_result.txt, 保存目录是model_dir.
 
 #### 分布式评估
 
 ```sql
 pai -name easy_rec_ext -project algo_public
 -Dcmd=evaluate
 -Dconfig=oss://easyrec/config/MultiTower/dwd_avazu_ctr_deepmodel_ext.config
--Dtables=odps://pai_online_project/tables/dwd_avazu_ctr_deepmodel_test
+-Deval_tables=odps://pai_online_project/tables/dwd_avazu_ctr_deepmodel_test
 -Dcluster='{"ps":{"count":1, "cpu":1000}, "worker" : {"count":3, "cpu":1000, "gpu":100, "memory":40000}}'
 -Dmodel_dir=oss://easyrec/ckpt/MultiTower
 -Dextra_params=" --distribute_eval True"
@@ -124,4 +125,4 @@ pai -name easy_rec_ext -project algo_public
 
 - -distribute_eval: 分布式 evaluate
 
-评估的结果会写到model_dir目录下的文件"eval_result.txt"中。
+评估结果: 会写到model_dir目录下的-Deval_result_path指定的文件名(默认是eval_result.txt)里面。
diff --git a/docs/source/feature/feature.rst b/docs/source/feature/feature.rst
@@ -92,8 +92,9 @@ IdFeature: 离散值特征/ID类特征
 -  vocab\_list:
    指定词表，适合取值比较少可以枚举的特征，如星期，月份，星座等
 
--  vocab\_file:
-   使用文件指定词表，用于指定比较大的词表。在提交tf任务到pai集群的时候，可以把词典文件存储在oss中。
+-  vocab\_file: 使用文件指定词表，用于指定比较大的词表。
+    -  格式: 每行一个单词
+    -  路径: 在提交tf任务到pai集群的时候，可以把词典文件存储在oss中。
 
 -  NOTE: hash\_bucket\_size, num\_buckets, vocab\_list,
    vocab\_file只能指定其中之一，不能同时指定
@@ -177,6 +178,7 @@ TagFeature
 tags字段可以用于描述商品的多个属性
 
 .. code:: protobuf
+
   feature_config:{
     features : {
        input_names: 'properties'
@@ -196,6 +198,7 @@ tags字段可以用于描述商品的多个属性
 我们同样支持有权重的tag特征，如"体育:0.3\|娱乐:0.2\|军事:0.5"：
 
 .. code:: protobuf
+
   feature_config:{
     features : {
        input_names: 'tag_kvs'
@@ -206,9 +209,10 @@ tags字段可以用于描述商品的多个属性
        embedding_dim: 24
     }
   }
-或"体育\|娱乐\|军事"和"0.3\|0.2\|0.5"的输入形式：
+或"体育\|娱乐\|军事"和"0.3\|0.2\|0.5"的输入形式:
 
 .. code:: protobuf
+
   feature_config:{
     features : {
        input_names: 'tags'
@@ -224,9 +228,9 @@ NOTE:
 ~~~~~
 
 -  如果使用csv文件进行存储，那么多个tag之间采用\ **列内分隔符**\ 进行分隔，
-   例如：csv的列之间一般用逗号(,)分隔，那么可采用竖线(\|)作为多个tag之间的分隔符。
--  weights：tags对应的权重列，在表里面一般采用string类型存储。
--  Weights的数目必须要和tag的数目一致，并且使用\ **列内分隔符**\ 进行分隔。
+   例如: csv的列之间一般用逗号(,)分隔，那么可采用竖线(\|)作为多个tag之间的分隔符。
+-  weights: tags对应的权重列，在表里面一般采用string类型存储。
+-  weights的数目必须要和tag的数目一致，并且使用\ **列内分隔符**\ 进行分隔。
 
 SequenceFeature：行为序列类特征
 ----------------------------------------------------------------
@@ -324,6 +328,7 @@ ComboFeature：组合特征
 对输入的离散值进行组合, 如age + sex:
 
 .. code:: protobuf
+
   feature_config:{
     features {
         input_names: ["age", "sex"]
@@ -347,6 +352,7 @@ ExprFeature：表达式特征
 将表达式特征放在EasyRec中，一方面减少模型io，另一方面保证离在线一致。
 
 .. code:: protobuf
+
   data_config {
       input_fields {
         input_name: 'user_age'
@@ -369,7 +375,7 @@ ExprFeature：表达式特征
         input_type: INT32
       }
     ...
-  )
+  }
   feature_config:{
       features {
        feature_name: "age_satisfy1"
@@ -408,6 +414,28 @@ ExprFeature：表达式特征
     - 当前版本未定义"&","|"的符号优先级，建议使用括号保证优先级。
     - customized normalization: "tf.math.log1p(user_age) / 10.0"
 
+EmbeddingVariable
+----------------------------------------------------------------
+Key Value Hash, 减少hash冲突, 支持特征准入和特征淘汰。
+
+.. code:: protobuf
+
+  model_config {
+    model_class: "MultiTower"
+    ...
+    ev_params {
+      filter_freq: 2
+    }
+  }
+- 配置方式:
+   - feature_config单独配置ev_params
+   - model_config里面统一配置ev_params
+
+- ev_params : EVParams
+   - filter_freq: 频次过滤, 低频特征噪声大,过滤噪声让模型更鲁棒
+   - steps_to_live: 特征淘汰, 淘汰过期特征,防止模型过大
+- Note: 仅在安装PAI-TF/DeepRec时可用
+
 特征选择
 ----------------------------------------------------------------
 对输入层使用变分dropout计算特征重要性，根据重要性排名进行特征选择。

diff --git a/docs/source/feature/odl_sample.md b/docs/source/feature/odl_sample.md
@@ -5,6 +5,35 @@
 - 离线样本可以使用SQL在MaxCompute或者Hive/Spark平台上构造.
 - 可以使用 [推荐算法定制](https://pairec.yuque.com/books/share/72cb101c-e89d-453b-be81-0fadf09db4dd) 来自动生成离线特征 和 离线样本的流程.
 
+## 样本权重
+
+- 指定输入一列为sample_weight
+  - data_config.sample_weight
+- 示例:
+  ```protobuf
+    data_config {
+      input_fields {
+        input_name: 'clk'
+        input_type: DOUBLE
+      }
+      input_fields {
+        input_name: 'field1'
+        input_type: STRING
+      }
+      ...
+      input_fields {
+        input_name: 'sw'
+        input_type: DOUBLE
+      }
+
+      sample_weight: 'sw'
+
+      label_fields: 'clk'
+      batch_size: 1024
+      input_type: CSVInput
+    }
+  ```
+
 ## 实时样本
 
 ### 前置条件

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -51,6 +51,7 @@ Welcome to easy_rec's documentation!
    :maxdepth: 2
    :caption: PREDICT
 
+   predict/input_output
    predict/MaxCompute 离线预测
    predict/Local 离线预测
    predict/在线预测

diff --git a/docs/source/predict/input_output.md b/docs/source/predict/input_output.md
@@ -0,0 +1,125 @@
+# 输入输出
+
+## 命令
+
+```bash
+   saved_model_cli show --all --dir export/1650854967
+```
+
+## 输出
+
+```
+MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:
+
+signature_def['serving_default']:
+  The given SavedModel SignatureDef contains the following input(s):
+    inputs['adgroup_id'] tensor_info:
+        dtype: DT_STRING
+        shape: (-1)
+        name: input_3:0
+    inputs['age_level'] tensor_info:
+        dtype: DT_STRING
+        shape: (-1)
+        name: input_12:0
+    inputs['brand'] tensor_info:
+        dtype: DT_STRING
+        shape: (-1)
+        name: input_7:0
+    inputs['campaign_id'] tensor_info:
+        dtype: DT_STRING
+        shape: (-1)
+        name: input_5:0
+    inputs['cate_id'] tensor_info:
+        dtype: DT_STRING
+        shape: (-1)
+        name: input_4:0
+    inputs['cms_group_id'] tensor_info:
+        dtype: DT_STRING
+        shape: (-1)
+        name: input_10:0
+    inputs['cms_segid'] tensor_info:
+        dtype: DT_STRING
+        shape: (-1)
+        name: input_9:0
+    inputs['customer'] tensor_info:
+        dtype: DT_STRING
+        shape: (-1)
+        name: input_6:0
+    inputs['final_gender_code'] tensor_info:
+        dtype: DT_STRING
+        shape: (-1)
+        name: input_11:0
+    inputs['new_user_class_level'] tensor_info:
+        dtype: DT_STRING
+        shape: (-1)
+        name: input_16:0
+    inputs['occupation'] tensor_info:
+        dtype: DT_STRING
+        shape: (-1)
+        name: input_15:0
+    inputs['pid'] tensor_info:
+        dtype: DT_STRING
+        shape: (-1)
+        name: input_2:0
+    inputs['price'] tensor_info:
+        dtype: DT_INT32
+        shape: (-1)
+        name: input_19:0
+    inputs['pvalue_level'] tensor_info:
+        dtype: DT_STRING
+        shape: (-1)
+        name: input_13:0
+    inputs['shopping_level'] tensor_info:
+        dtype: DT_STRING
+        shape: (-1)
+        name: input_14:0
+    inputs['tag_brand_list'] tensor_info:
+        dtype: DT_STRING
+        shape: (-1)
+        name: input_18:0
+    inputs['tag_category_list'] tensor_info:
+        dtype: DT_STRING
+        shape: (-1)
+        name: input_17:0
+    inputs['user_id'] tensor_info:
+        dtype: DT_STRING
+        shape: (-1)
+        name: input_8:0
+  The given SavedModel SignatureDef contains the following output(s):
+    outputs['item_emb'] tensor_info:
+        dtype: DT_STRING
+        shape: (-1)
+        name: ReduceJoin_1/ReduceJoin:0
+    outputs['item_tower_feature'] tensor_info:
+        dtype: DT_STRING
+        shape: (-1)
+        name: ReduceJoin_3/ReduceJoin:0
+    outputs['logits'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1)
+        name: Reshape:0
+    outputs['probs'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1)
+        name: Sigmoid:0
+    outputs['user_emb'] tensor_info:
+        dtype: DT_STRING
+        shape: (-1)
+        name: ReduceJoin/ReduceJoin:0
+    outputs['user_tower_feature'] tensor_info:
+        dtype: DT_STRING
+        shape: (-1)
+        name: ReduceJoin_2/ReduceJoin:0
+  Method name is: tensorflow/serving/predict
+
+```
+
+- signature_def: 默认是serving_default
+- inputs: 输入列表
+  - dtype: 输入tensor类型
+  - shape: 输入tensor的shape
+  - name: 输入Placeholder的名称
+- outputs: 输出列表
+  - dtype: 输出tensor类型
+  - shape: 输入tensor的shape
+  - name: 输出tensor的名称