-
Notifications
You must be signed in to change notification settings - Fork 322
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: added en version of tutorial/modes
- Loading branch information
1 parent
54dd3fe
commit bf643a9
Showing
3 changed files
with
136 additions
and
0 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
# The Workflow of Cluster Version and Execution Mode | ||
|
||
OpenMLDB provides different execution modes at different stages of whole feature engineering workflow. Execution modes have been classified in detail to match every working steps, especially for the cluster version that often used in a production environment. You can find out the whole process, from feature extraction to online deployment, and corresponding execution mode in this manual. | ||
|
||
|
||
## 1. Overview of OpenMLDB Workflow | ||
|
||
### 1.1 The Whole Workflow of Feature Engineering | ||
|
||
The typical feature engineering process based on openMLDB, from feature extraction to online deployment, is as follows. | ||
|
||
1. **Offline Data Import** | ||
|
||
Offline data should be imported in this stage for subsequent offline feature extraction. | ||
|
||
2. **Offline Feature Development** | ||
|
||
The feature engineering script is developed and optimized until the quality is satisfied. Note that machine learning model development and tunning are involved as well during this step. However this article only focuses on the feature engineering based on OpenMLDB. | ||
|
||
3. **Online Deployment for Feature Scripts** | ||
|
||
After obtaining satisfactory feature extraction script, it is deployed online. | ||
|
||
4. **Import Data for Cold-start** | ||
|
||
Data in the windows of the online storage engine must be imported before it goes online. For example, if the script will aggregate data for the last three months, the data for those three months needs to be imported for cold-start. | ||
|
||
5. **Real-time Data Import** | ||
|
||
After the system being deployed online, the latest data needs to be imported to maintain the window computing logic as time goes by. Therefore, real-time data import is required. | ||
|
||
6. **Online Data Preview (Optional)** | ||
|
||
You can preview online data by running SQL commands in this stage. | ||
|
||
7. **Request Service in Real Time** | ||
|
||
After the solution is deployed and the input data stream is correctly connected, a real-time feature computing service is ready to respond to real-time requests. | ||
|
||
|
||
|
||
### 1.2 Overview of Cluster Execution Modes | ||
|
||
Since data objects are different in offline and online scenarios, their underlying storage and computing nodes are different. Therefore, OpenMLDB provides different execution modes to complete the processes mentioned in 1.1. The following table summarizes the execution modes used for each step in feature engineering. Important concepts about execution modes will be introduced later. | ||
|
||
|
||
| Stage | Execution Mode | Development Tool | Introduction | | ||
| ----------------------------------------- | -------------- | -------------------------- | ------------------------------------------------------------ | | ||
| 1.**Offline Data Import** | the offline mode | CLI | - `LOAD DATA` command<br /> | | ||
| 2. **Offline Feature Development** | the offline mode | CLI | - all SQL statements of OpenMLDB are supported<br />- some SQL queries (e.g.,`SELECT`) run in non-blocking asynchronous mode | | ||
| 3. **Feature Extraction Plan Deployment** | the offline mode | CLI | - `DEPLOY` command | | ||
| 4. **Import Data for Cold-start** | the online preview mode | CLI, Import Tools | - `LOAD DATA` command for CLI <br />- or you can use the independent import tool `openmldb-import` | | ||
| 5. **Real-time Data Import** | the online preview mode | REST APIs, Java/Python SDK | - Data insert APIs of OpenMLDB are called by third-party data sources to import real-time data. | | ||
| 6. **Online Data Preview (Optional)** | the online preview mode | CLI, Java/Python SDK | - Currently, only `SELECT` on columns, expressions and single-line functions are supported to be used on data preview<br />- Complex computing functions like `LAST JOIN`, `GROUP BY`, `HAVING`, `WINDOW` are not supported temporarily<br /> | | ||
| 7. **Real-time Feature Processing** | the online request mode | REST APIs, Java/Python SDK | - all SQL syntax of OpenMLDB is supported<br />- both REST APIs and Java SDK support single-line and batch request<br />- Python SDK only support single-line request | | ||
|
||
As the table shown above, execution modes can be categorized as `the offline mode`, `the online preview mode` and `the online request mode`. The following figure summarizes the entire feature engineering process and corresponding execution modes. Detailed introduction of each of these modes will be shown later in this page. | ||
|
||
![image-20220310170024349](images/modes-flow-en.png) | ||
|
||
### 1.3 Notes for the Standalone Version | ||
|
||
Although this doc focuses on the cluster version, it is necessary to involve a brief description of standalone version's execution modes. The execution modes of the standalone version are relatively simple. Because the storage nodes and compute nodes are unified for its offline data and online data, standalone version does not distinguish between the offline and online modes. That is, standalone version doesn't have the concept of execution mode using CLI. Any SQL syntax supported by OpenMLDB can directly run on the CLI. Therefore, the standalone version is especially suitable for quick evaluation and learning OpenMLDB SQL. However, in the stage of **real-time feature processing**, the standalone version still run in online request mode, which is the same as the cluster version. | ||
|
||
:::{note} | ||
If you only want to try OpenMLDB in a non-production environment, or learn and practice SQL, it is highly recommended to use the standalone version because of its faster and easier deployment. | ||
::: | ||
|
||
## 2. The Offline Mode | ||
|
||
As mentioned earlier, the offline data import, offline feature development and feature extraction deployment stages of cluster version are all running in offline mode. The management and computing of offline data are completed in this mode. Related computing nodes are supported by Spark release version, which has [been optimized for feature engineering by OpenMLDB](https://openmldb.ai/docs/zh/main/tutorial/openmldbspark_distribution.html). Common storage systems such as HDFS can be used on the storage nodes. | ||
|
||
The offline mode has following main features. | ||
|
||
- Offline mode supports all SQL syntax provided by OpenMLDB, including extended and optimized LAST JOIN, WINDOW UNION and other complicated SQL queries. | ||
|
||
- In the offline mode, some SQL commands run in non-blocking asynchronous mode. These commands include `LOAD DATA`, `SELECT` and `SELECT INTO` . | ||
|
||
- The above-mentioned non-blocking SQL commands is managed by TaskManager, try following commands to check and manage the execution. | ||
|
||
```bash | ||
SHOW JOBS | ||
SHOW JOB | ||
STOP JOB | ||
``` | ||
|
||
+ :::{Tip} | ||
Please pay attention that the `SELECT` command runs asynchronously in offline mode, which is totally different with many common relational database. Therefore, it is highly recommended to use `SELECT INTO` instead of `SELECT` for development and debugging. With `SELECT INTO`, the results can be exported to an external file and checked. | ||
::: | ||
|
||
+ The feature extraction plan deploying command `DEPLOY` executes in offline mode as well. | ||
:::{note} | ||
The deployment criterion has certain requirements for SQL, see [The Specification and Requirements of OpenMLDB SQL Deployment]() for more details. | ||
::: | ||
The offline mode can be set in following ways: | ||
|
||
- CLI: `SET @@execute_mode='offline'` | ||
|
||
The default CLI mode is also offline. | ||
|
||
- REST APIs, Java/Python SDK: offline mode is not supported. | ||
|
||
## 3. The Online Preview Mode | ||
|
||
The cold-start's online data import, real-time data import and online data preview are all running in online preview mode. This mode is responsible for online data management and preview. Online data storage and computation is supported by the Tablet. | ||
|
||
The online preview mode has following main features. | ||
|
||
- In online preview mode, all commands are executed synchronously except the `LOAD DATA` command which is used when importing online data. `LOAD DATA` is executed asynchronously in non-blocking mode as the way it executed in offline mode. | ||
- In order to view related data, only simple `SELECT ` commands on columns are supported in online preview mode currently, complex SQL queries for this purpose are not supported. As a result, this execution mode is not suitable for SQL feature development and optimizing, whcih should be completed in the offline mode or by standalone version. | ||
|
||
The online preview mode can be set in following ways: | ||
|
||
- CLI: `SET @@execute_mode='online'` | ||
- REST APIs, Java/Python SDK: these tools can be executed only in online mode. | ||
|
||
## 4. The Online Request mode | ||
|
||
After feature extraction plan is deployed and online data is imported, the real-time feature extraction service is ready. Then real-time features can be extracted via request mode, which is supported by REST APIs and Java/Python SDK. The online request mode is OpenMLDB's unique execution mode for supporting real-time computation online, which is distinct from common SQL queries in other databases. | ||
|
||
The online request mode requires following three inputs. | ||
|
||
1. **SQL feature script**, which is the SQL script used during feature deployment stage, defining the computing logic for feature extraction. | ||
2. **Online data**, acting as the window data in cold-start stage and real-time imported data after cold-start. Generally, it is the latest data in the time window that is defined by the feature script. For example, the aggregation function of an SQL script defines a time window for the last three months, so the online storage needs to keep the corresponding data for the last three months. | ||
3. **A real-time request row** , containing real-time behaviors that are currently taking place. The real-time request row is used for real-time feature extraction, such as credit card information in anti-fraud scenarios or search keywords in recommendation scenarios. | ||
|
||
Based on these three inputs, the online request mode will return a feature extraction result for each real-time request row. Its computing logic is: the request row will be virtually inserted into the correct position of the online data table according to the logic of SQL script (such as `PARTITION BY`, `ORDER BY`, etc.), and then the feature script will be applied to that request row, finally the result feature will be returned. The following diagram illustrates the procedure of the request mode. | ||
|
||
![modes-request](images/modes-request-en.png) | ||
|
||
The online request mode can be set in the following ways: | ||
|
||
- CLI: The online request mode isn't supported currently. | ||
- REST APIs: Support single-line and batch **request rows** request, see: [REST APIs](https://openmldb.ai/docs/en/main/quickstart/rest_api.html) for deatil. | ||
- Java SDK: Support single-line and batch **request rows** request, see: [Java SDK Quickstart ](https://openmldb.ai/docs/en/main/quickstart/java_sdk.html) for detail. | ||
- Python SDK: Only support single-line **request row** request, see:[Python SDK Quickstart](https://openmldb.ai/docs/en/main/quickstart/python_sdk.html)for detail. |