An engine template is an almost-complete implementation of an engine. In this Engine Template, we have integrated Apache Spark MLlib's Gradient-Boosted Trees by default.
The default use case of this classification Engine Template is to predict the service plan (plan) a user will subscribe to based on his 3 properties: attr0, attr1 and attr2.
You can customize it easily to fit your specific use case and needs.
We are going to show you how to create your own classification engine for production use based on this template.
By default, the template requires the following events to be collected:
- user $set event, which set the attributes of the user
- array of features values ( 3 features)
{"features": [0, 2, 0]}
- the predicted label
{"label":0.0}
We will be using the sample data set from dataset
The training sample events have the following format (Generated by data/import_eventserver.py): client.create_event( event="$set", entity_type="user", entity_id=str(count), # use the count num as user ID properties= { "attr0" : int(attr[0]), "attr1" : int(attr[1]), "attr2" : int(attr[2]), "plan" : int(plan) }
First you need to install PredictionIO 0.9.5 (if you haven't done it).
Let's say you have installed PredictionIO at /home/yourname/PredictionIO/
. For convenience, add PredictionIO's binary command path to your PATH
, i.e. /home/yourname/PredictionIO/bin
$ PATH=$PATH:/home/yourname/PredictionIO/bin; export PATH
if you launched PredictionIO AWS instance, the path is located at /opt/PredictionIO/bin
.
Once you have completed the installation process, please make sure all the components (PredictionIO Event Server, Elasticsearch, and HBase) are up and running.
If you launched PredictionIO AWS instance, you can skip pio-start-all
. All components should have been started automatically.
If you are using PostgreSQL or MySQL, run the following to start PredictionIO Event Server:
$ pio eventserver &
If instead you are running HBase and Elasticsearch, run the following to start all PredictionIO Event Server, HBase, and Elasticsearch:
$ pio-start-all
You can check the status by running:
$ pio status
If everything is OK, you should see the following outputs:
...(sleeping 5 seconds for all messages to show up...)Your system is all ready to go.
To further troubleshoot, please see FAQ - Using PredictionIO.
Now let's create a new engine called GBRT_Classification by downloading the Classification Engine Template. Go to a directory where you want to put your engine and run the following:
$ git clone https://github.com/mohanaprasad1994/PredictionIO-MLlib-Decision-Trees-Template.git MyClassification
You will need to create a new App in PredictionIO to store all the data of your app. The data collected will be used for machine learning modeling.
Let's assume you want to use this engine in an application named "MyApp1". Run the following to create a new app "MyApp1":
$ pio app new MyApp1
You should find the following in the console output:
...
[INFO] [App$] Initialized Event Store for this app ID: 1.
[INFO] [App$] Created new app:
[INFO] [App$] Name: MyApp1
[INFO] [App$] ID: 1
[INFO] [App$] Access Key: 3mZWDzci2D5YsqAnqNnXH9SB6Rg3dsTBs8iHkK6X2i54IQsIZI1eEeQQyMfs7b3F
Note that App ID*, Access Key* are created for this App "MyApp1". You will need the Access Key when you collect data with EventServer for this App.
You can list all of the apps created its corresponding ID and Access Key by running the following command:
$ pio app list
You should see a list of apps created. For example:
[INFO] [App$] Name | ID | Access Key | Allowed Event(s)
[INFO] [App$] MyApp1 | 1 | 3mZWDzci2D5YsqAnqNnXH9SB6Rg3dsTBs8iHkK6X2i54IQsIZI1eEeQQyMfs7b3F | (all)
[INFO] [App$] MyApp2 | 2 | io5lz6Eg4m3Xe4JZTBFE13GMAf1dhFl6ZteuJfrO84XpdOz9wRCrDU44EUaYuXq5 | (all)
[INFO] [App$] Finished listing 2 app(s).
Next, let's collect some training data. By default, the Classification Engine Template reads 4 properties of a user record: attr0, attr1, attr2 and plan. This template requires '$set' user events.
This template can easily be customized to use different or more number of attributes.
You can send these data to PredictionIO Event Server in real-time easily by making a HTTP request or through the EventClient of an SDK. Please see App Integration Overview for more details how to integrate your app with SDK.
Although you can integrate your app with PredictionIO and collect training data in real-time, we are going to import a sample dataset with the provided scripts for demonstration purpose.
A Python import script import_eventserver.py
is provided to import the data to Event Server using Python SDK. Please upgrade to the latest Python SDK.
First, you will need to install Python SDK in order to run the sample data import script. To install Python SDK, run:
$ pip install predictionio
Make sure you are under the GBRT_Classification
directory. Replace the value of access_key parameter by your Access Key and run:
$ cd MyClassification
$ python data/import_eventserver.py --access_key obbiTuSOiMzyFKsvjjkDnWk1vcaHjcjrv9oT3mtN3y6fOlpJoVH459O1bPmDzCdv
You should see the following output:
Importing data...
N events are imported.
Now the training data is stored as events inside the Event Store.
Now you can build, train, and deploy the engine. First, make sure you are under the GBRT_Classification
directory.
Under the directory, you should find an engine.json
file; this is where you specify parameters for the engine. Make sure the appId defined in the file match your App ID. (This links the template engine with the App)
Parameters for the Gradient-Boosted Trees model are to be set here.
numIterations: the iterations will used in training numClasses: the number of classes maxDepth: the max depth of the tree
{
"id": "default",
"description": "Default settings",
"engineFactory": "org.template.classification.ClassificationEngine",
"datasource": {
"params": {
"appName": "AppName"
}
},
"algorithms": [
{
"name": "gbrt",
"params": {
"numIterations": 3,
"numClasses": 2,
"maxDepth": 5
}
}
]
}
Start with building your GBRT_Classification engine. Run the following command:
$ pio build --verbose
This command should take few minutes for the first time; all subsequent builds should be less than a minute. You can also run it without --verbose
if you don't want to see all the log messages.
Upon successful build, you should see a console message similar to the following.
[INFO] [Console$] Your engine is ready for training.
To train your engine, run the following command:
$ pio train
When your engine is trained successfully, you should see a console message similar to the following.
[INFO] [CoreWorkflow$] Training completed successfully.
Now your engine is ready to deploy. Run:
$ pio deploy
When the engine is deployed successfully and running, you should see a console message similar to the following:
[INFO] [HttpListener] Bound to /0.0.0.0:8000
[INFO] [MasterActor] Bind successful. Ready to serve.
Do not kill the deployed engine process.
By default, the deployed engine binds to http://localhost:8000. You can visit that page in your web browser to check its status.
Now, You can try to retrieve predicted results. For example, to predict the label (i.e. plan in this case) of a user with attr0=2, attr1=0 and attr2=0, you send this JSON { "features": [2, 0, 0] } to the deployed engine and it will return a JSON of the predicted plan. Simply send a query by making a HTTP request or through the EngineClient of an SDK:
import predictionio
engine_client = predictionio.EngineClient(url="http://localhost:8000")
print engine_client.send_query({"attr0":2, "attr1":0, "attr2":0})
The following is sample JSON response:
{"label":0.0}