diff --git a/docs/zh/use_case/OpenMLDB_Byzer_taxi.md b/docs/zh/use_case/OpenMLDB_Byzer_taxi.md new file mode 100644 index 00000000000..f7e33a5e00d --- /dev/null +++ b/docs/zh/use_case/OpenMLDB_Byzer_taxi.md @@ -0,0 +1,275 @@ +# OpenMLDB + Byzer: 基于 SQL 打造端到端机器学习应用 + +本文示范如何使用[OpenMLDB](https://github.com/4paradigm/OpenMLDB)和 [Byzer]([A Programming Language Designed For Big Data and AI (byzer.org)](https://www.byzer.org/home)) 联合完成一个完整的机器学习应用。OpenMLDB在本例中接收Byzer发送的指令和数据,完成数据的实时特征计算,并经特征工程处理后的数据集返回Byzer,供其进行后续的机器学习训练和预测。 + +## 1. 准备工作 + +### 1.1 安装 OpenMLDB 引擎 + +1. 本例使用的是运行在Docker容器中的OpenMLDB集群版。安装步骤详见[OpenMLDB快速上手](https://openmldb.ai/docs/zh/main/quickstart/openmldb_quickstart.html)。 +2. 本例中,Byzer引擎需要从容器外部访问OpenMLDB服务,需要修改OpenMLDB的原始IP配置,修改方式详见[IP配置文档](https://openmldb.ai/docs/zh/main/reference/ip_tips.html)。 + +### 1.2 安装 Byzer 引擎和Byzer Notebook + +1. Byzer 引擎的安装步骤详见[Byzer Language官方文档](https://docs.byzer.org/#/byzer-lang/zh-cn/) + +2. 本例需要使用 Byzer 提供的[OpenMLDB 插件](https://github.com/byzer-org/byzer-extension/tree/master/byzer-openmldb)完成与 OpenMLDB 的消息传递。在Byzer中使用插件必须配置`streaming.datalake.path`项,详见[Byzer引擎配置说明-常用参数](https://docs.byzer.org/#/byzer-lang/zh-cn/installation/configuration/byzer-lang-configuration)。 + +3. 本文使用 Byzer Notebook 进行演示,Byzer 引擎安装完成后,请安装Byzer Notebook(您也可以使用[VSCode中的Byzer插件](https://docs.byzer.org/#/byzer-lang/zh-cn/installation/vscode/byzer-vscode-extension-installation)连接您的Byzer 引擎)。关于Byzer Notebook,详见[Byzer Notebook官方文档](https://docs.byzer.org/#/byzer-notebook/zh-cn/)。其界面如下。 + + ![Byzer_Notebook](images/Byzer_Notebook.jpg) + +### 1.3 准备数据集 + +本文使用的是Kaggle出租车行车时间数据集,若您的Byzer数据湖中没有该数据集,可以从以下网址获得:[Kaggle出租车行车时间预测问题](https://www.kaggle.com/c/nyc-taxi-trip-duration/overview)。将数据集下载到本地后,需要将其导入Byzer Notebook。 + +## 2. 机器学习全流程 + +### 2.1 加载原始数据集 + +将原始数据集导入到 Byzer Notebook 数据目录的File System后,自动生成了`tmp/upload`存储路径。使用Byzer Lang的`load`命令加载该数据集。 + +```sql +load csv.`tmp/upload/train.csv` where delimiter="," +and header = "true" +as taxi_tour_table_train_simple; +``` + +### 2.2 将数据导入 OpenMLDB + +安装 OpenMLDB 插件 + +```sql +!plugin app add - "byzer-openmldb-3.0"; +``` + +使用该插件连接 OpenMLDB 引擎。在Byzer Notebook中运行该代码块前,请确保OpenMLDB引擎已启动,并创建了名为`db1`的数据库。 + +```sql +run command as FeatureStoreExt.`` where +zkAddress="172.17.0.2:7527" +and `sql-0`=''' +SET @@execute_mode='offline'; +''' +and `sql-1`=''' +SET @@job_timeout=20000000; +''' +and `sql-2`=''' +CREATE TABLE t1(id string, vendor_id int, pickup_datetime timestamp, dropoff_datetime timestamp, passenger_count int, pickup_longitude double, pickup_latitude double, dropoff_longitude double, dropoff_latitude double, store_and_fwd_flag string, trip_duration int); +''' +and `sql-3`=''' +LOAD DATA INFILE 'tmp/upload/train.csv' +INTO TABLE t1 options(format='csv',header=true,mode='append'); +''' +and db="db1" +and action="ddl"; +``` + +```` +```{note} +1. zkAddress的端口号应与配置IP时的conf文件夹下各相关文件保持一致 +2. 可以通过$BYZER_HOME\conf路径下的\byzer.properties.override文件中的属性`streaming.plugin.clzznames`检查byzer-openmldb-3.0插件是否成功安装。如果成功安装了该插件,可以看到主类名`tech.mlsql.plugins.openmldb.ByzerApp`。 +3. 若未成功安装,可以手动下载jar包再以[离线方式](https://docs.byzer.org/#/byzer-lang/zh-cn/extension/installation/offline_install)安装配置。 +``` +```` + +### 2.3 进行实时特征计算 + +本例借用[OpenMLDB + LightGBM:出租车行程时间预测](https://openmldb.ai/docs/zh/main/use_case/taxi_tour_duration_prediction.html)2.3节中设计的特征进行特征计算,并将处理后的数据集导出为本地csv文件。 + +```sql +run command as FeatureStoreExt.`` where +zkAddress="172.17.0.2:7527" +and `sql-0`=''' +SET @@execute_mode='offline'; +''' +and `sql-1`=''' +SET @@job_timeout=20000000; +''' +and `sql-2`=''' +SELECT trp_duration, passanger_count, +sum(pickup_latitude) OVER w AS vendor_sum_pl, +max(pickup_latitude) OVER w AS vendor_max_pl, +min(pickup_latitude) OVER w AS vendor_min_pl, +avg(pickup_latitude) OVER W AS vendor_avg_pl, +sum(pickup_latitude) OVER w2 AS pc_sum_pl, +max(pickup_latitude) OVER w2 AS pc_max_pl, +min(pickup_latitude) OVER w2 AS pc_min_pl, +avg(pickup_latitude) OVER w2 AS pc_avg_pl, +count(vendor_id) OVER w2 AS pc_cnt, +count(vendor_id) OVER w AS vendor_cnt +FROM t1 +WINDOW w AS(PARTITION BY vendor_id ORDER BY ickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW), +w2 AS(PARTITION BY passenger_count ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW) INTO OUTFILE '/tmp/feature_data'; +''' +and db="db1" +and action="ddl"; +``` + + + +### 2.4 数据向量化 + +在Byzer Noetbbook中将所有int 类型字段都转化为 double。 + +```sql +select *, +cast(passenger_count as double) as passenger_count_d, +cast(pc_cnt as double) as pc_cnt_d, +cast(vendor_cnt as double) as vendor_cnt_d +from feature_data +as new_feature_data; +``` + +接着把所有字段合并成一个向量。 + +```sql +select vec_dense(array( +passenger_count_d, +vendor_sum_pl, +vendor_max_pl, +vendor_min_pl, +vendor_avg_pl, +pc_sum_pl, +pc_max_pl, +pc_min_pl, +pc_avg_pl, +pc_cnt_d, +vendor_cnt +)) as features,cast(trip_duration as double) as label +from new_feature_data +as trainning_table; + +``` + + + +### 2.5 模型训练 + +使用Byzer Lang的train命令和其[内置的线性回归算法](https://docs.byzer.org/#/byzer-lang/zh-cn/ml/algs/linear_regression)训练模型,并将训练好的模型保存到/model/tax-trip路径下。 + +```sql +train trainning_table as LinearRegression.`/model/tax-trip` where + +keepVersion="true" + +and evaluateTable="trainning_table" +and `fitParam.0.labelCol`="label" +and `fitParam.0.featuresCol`= "features" +and `fitParam.0.maxIter`="50"; + +``` + +```{note} +可以使用`!show et/params/LinearRegression;`命令查看Byzer内置的线性回归模型的相关参数。 +``` + +### 2.6 特征部署 + +将特征计算逻辑部署到OpenMLDB上:将最满意的一次特征计算的代码拷贝后修改执行模式为online即可。本例使用的是前文展示的特征工程中的代码。 + +```sql +run command as FeatureStoreExt.`` where +zkAddress="172.17.0.2:7527" +and `sql-0`=''' +SET @@execute_mode='online'; +''' +and `sql-1`=''' +SET @@job_timeout=20000000; +''' +and `sql-2`=''' +SELECT trp_duration, passanger_count, +sum(pickup_latitude) OVER w AS vendor_sum_pl, +max(pickup_latitude) OVER w AS vendor_max_pl, +min(pickup_latitude) OVER w AS vendor_min_pl, +avg(pickup_latitude) OVER W AS vendor_avg_pl, +sum(pickup_latitude) OVER w2 AS pc_sum_pl, +max(pickup_latitude) OVER w2 AS pc_max_pl, +min(pickup_latitude) OVER w2 AS pc_min_pl, +avg(pickup_latitude) OVER w2 AS pc_avg_pl, +count(vendor_id) OVER w2 AS pc_cnt, +count(vendor_id) OVER w AS vendor_cnt +FROM t1 +WINDOW w AS(PARTITION BY vendor_id ORDER BY ickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW), +w2 AS(PARTITION BY passenger_count ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW) INTO OUTFILE '/tmp/feature_data_test'; +''' +and db="db1" +and action="ddl"; + +``` + +导入在线数据,本例使用的是原始数据集中的test集。生产环境中可以接入实时数据源。 + +```sql +run command as FeatureStoreExt.`` where +zkAddress="172.17.0.2:7527" +and `sql-0`=''' +SET @@execute_mode='online'; +''' +and `sql-1`=''' +SET @@job_timeout=20000000; +''' +and `sql-2`=''' +CREATE TABLE t1(id string, vendor_id int, pickup_datetime timestamp, dropoff_datetime timestamp, passenger_count int, pickup_longitude double, pickup_latitude double, dropoff_longitude double, dropoff_latitude double, store_and_fwd_flag string, trip_duration int); +''' +and `sql-3`=''' +LOAD DATA INFILE 'tmp/upload/test.csv' +INTO TABLE t1 options(format='csv',header=true,mode='append'); +''' +and db="db1" +and action="ddl"; +``` + + + +### 2.7 模型部署 + +在Byzer Noetbook中将之前保存的、训练好的模型注册为一个可以直接使用的函数。 + +```sql +register LinearRegression.`/model/tax-trip` as tax_trip_model_predict; +``` + +### 2.8 预测 + +将经OpenMLDB处理后的数据集所有int类型字段转成double。 + +```sql +select *, +cast(passenger_count as double) as passenger_count_d, +cast(pc_cnt as double) as pc_cnt_d, +cast(vendor_cnt as double) as vendor_cnt_d +from feature_data_test +as new_feature_data_test; +``` + +再进行向量化。 + +```sql +select vec_dense(array( +passenger_count_d, +vendor_sum_pl, +vendor_max_pl, +vendor_min_pl, +vendor_avg_pl, +pc_sum_pl, +pc_max_pl, +pc_min_pl, +pc_avg_pl, +pc_cnt_d, +vendor_cnt +)) as features, +from new_feature_data_test +as testing_table; +``` + +使用处理后的训练集进行预测。 + +```sql +select tax_trip_model_predict(testing_table) as predict_label; +``` + + + + + diff --git a/docs/zh/use_case/images/Byzer_Notebook.jpg b/docs/zh/use_case/images/Byzer_Notebook.jpg new file mode 100644 index 00000000000..18ae0f85739 Binary files /dev/null and b/docs/zh/use_case/images/Byzer_Notebook.jpg differ diff --git a/docs/zh/use_case/index.rst b/docs/zh/use_case/index.rst index 0025fba281b..e79b6caeab5 100644 --- a/docs/zh/use_case/index.rst +++ b/docs/zh/use_case/index.rst @@ -10,3 +10,4 @@ kafka_connector_demo dolphinscheduler_task_demo talkingdata_demo + OpenMLDB_Byzer_taxi