-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paddle cloud web features design #378
Changes from all commits
5b95916
04ab2f5
63f1c95
f227e48
34bbe7a
9ce5c2d
46c5db8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,153 @@ | ||
# Web Interface design | ||
|
||
This design doc will talk about features and web pages needed to let users manage cloud paddle jobs. | ||
|
||
## Feature List | ||
|
||
- Account Management | ||
- Registration, send email to inform if registration succeeded | ||
- Account Login/Logout | ||
- Password changing, find back | ||
- Download SSL keys | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are the SSL key for? I thought currently authentication is done via token? |
||
- Jupiter Notebook | ||
- Private Jupiter Notebook environment to run Python scripts | ||
- Private workspace | ||
- Submit job from Jupiter Notebook | ||
- Job Dashboard | ||
- Job history and currently running jobs | ||
- Performance Monitoring | ||
- Quota Monitoring | ||
- Datasets | ||
- Public Dataset viewing | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Dataset => dataset |
||
- Upload/Download private datasets | ||
- Share datasets | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe need to be more specific: Does it mean can share the dataset to anyone by a link, or just set dataset visible to certain group (similar to the unix read file permission). |
||
- Models | ||
- Upload/Download models file | ||
- Share/Publish Models | ||
- Paddle Board | ||
- Training metrics visualization | ||
- cost | ||
- evaluator | ||
- user-defined metrics | ||
- Serving | ||
- Submit serving instances | ||
- Deactivate serving | ||
- Serving performance monitoring | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we also need a feature: |
||
|
||
## Account Management | ||
|
||
Account management page is designed to satisfy multi-tenant use cases. One account should have a unique account ID for login, and this account owns one access key to one unique [Kubernetes namespace](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/) cluster. Multiple users can log in to this account ID and operate jobs and data files. The only "master user" can do modifications like increase quota or manage account settings. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "log in to this account ID" or log in to its own account which belongs to the group. If so, "master user" could be "group owner". |
||
|
||
One example is [AWS IAM](https://aws.amazon.com/iam/?nc2=h_m1), but we can do more simpler than that. | ||
|
||
The current implementation under this repo can only have one user for one Kubernetes namespace. We can implement multi-tenant in the near future. | ||
|
||
Once a user logged in, s/he will be redirected to the "Job Dashboard" page. | ||
|
||
## Jupiter Notebook | ||
|
||
Start a [Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/) using image `docker.paddlepaddle.org/book` in the Kubernetes cluster and add an [Ingress](https://kubernetes.io/docs/concepts/services-networking/ingress/) endpoint when a user first enters the notebook page. This is already implemented. | ||
|
||
<img src="pictures/notebook.png" width="500px" align="center"> | ||
|
||
Users can write a program in python in the web page and save their programs, which will be saved at cloud storage. Users also can run a script like below to submit a cluster training job: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. python => Python |
||
|
||
```python | ||
create_cloud_job( | ||
name, | ||
num_trainer, | ||
mem_per_trainer, | ||
gpu_per_trainer, | ||
cpu_per_trainer, | ||
num_ps, | ||
mem_per_ps, | ||
cpu_per_ps, | ||
) | ||
``` | ||
|
||
After this, there will be a job description and performance monitoring pages able to view at "Job Dashboard" | ||
|
||
## Job Dashboard | ||
|
||
### Job history and currently running jobs | ||
|
||
A web page containing a table to list jobs satisfying user's filters. The user can only list jobs that were submitted by themselves. | ||
|
||
| jobname | start time | age | success | fails | actions | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we need mor information on the Job list, such as |
||
| ------- | ---------- | -------- | ------- | ----- | ------- | | ||
| test1 | 2017-01-01 | 17m | 0 | 0 | stop/log/perf | | ||
|
||
Users can filter the list using: | ||
|
||
- status: running/stopped/failed | ||
- time: job start time | ||
- jobname: search by jobname | ||
|
||
Viewing job logs: | ||
|
||
Click the "log" button on the right of the job will pop up a console frame at the bottom of the page showing the tail of the job log, here shows the first pod's log. On the left side of pop up console frame, there should be a vertical list containing the list of pods, then click one of the pod, the console will show it's log. | ||
|
||
### Performance Monitoring | ||
|
||
A web page containing graphs monitoring job's resource usages according to time change: | ||
|
||
- CPU usage | ||
- GPU usage | ||
- memory usage | ||
- network bandwidth | ||
- disk I/O | ||
|
||
### Quota Monitoring | ||
|
||
A web page displaying total quota and quota currently in use. | ||
|
||
Also display total CPU time, GPU time in latest 1day, 1week, and 1month. | ||
|
||
## Datasets and Models | ||
|
||
Datasets and Models are quite the same, both like a simple file management and sharing service. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we can supplement more information about the file sharing service.Such as we can share files between users, namespaces or just a publish link? |
||
|
||
- file listing and viewing page | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. file => File |
||
- Upload/Download page | ||
- file sharing page | ||
|
||
## Paddle Board | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't expect There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If not inserting function calls in user programs, we need to automatically find out which variable represents the cost and evaluator operator by default and draw the variable value to the web page. I'm not sure how to do that for now. Here is a short example of how TensorBoard config the metrics, using |
||
|
||
A web page containing graphs showing the internal status when the job is training, metrics like: | ||
|
||
- cost(can be multiple costs) | ||
- evaluator output | ||
- user-defined metric | ||
|
||
User can caculate metrics and define the graph like: | ||
|
||
```python | ||
cost = my_train_network() | ||
evaluator = my_evaluator(output, label) | ||
def my_metric_graph(output, label): | ||
metric = paddle.auc(output, label) | ||
my_metric = my_metric_graph(output, label) | ||
my_metric_value = output | ||
|
||
draw_board(cost, evaluator) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think draw_board(evaluator, "evaluate result") |
||
draw_board(my_metric) | ||
draw_board(my_metric_value) | ||
``` | ||
|
||
Calling `draw_board` will output graph files on the distributed storage, and then the web page can load the data and refresh the graph. | ||
|
||
## Serving | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems that a serving job is different from a training job in that the former doesn't have a master process. If so, each process in a serving job needs to be able to present its own metrics, and there is no chance for them to present a PaddleBoard? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure what metrics to display when running inference(serving), the neural network configuration may not define cost functions, and there's no label to evaluate the result. Metrics like QPS(queries per second) is more like "monitoring" but not PaddleBoard. |
||
|
||
After training or uploading pre-trained models, a user can start a serving instance to serve the model as an inference HTTP service: | ||
|
||
The Serving web page contains a table listing currently running serving instance and a "Launch" button to configure and start the serving program. | ||
|
||
Click the "Launch" button in this web page will pop up a modal dialogue to configure the job: | ||
|
||
1. model `tar.gz` files to the cloud. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Upper the first letter, the same as below. |
||
1. inference network configuration in `.proto` format or user can also define the network in Python in the web page. | ||
1. number of CPU/GPU resource in total to use for serving the model, the more resource there is, the more concurrent calls can be served. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we change "number of CPU/GPU resource" to number of instances and CPU / Mem / GPU per instance. Otherwise it's hard for us to figure out how many instances to run (we don't know the model property and user's serving requirement). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree with @helinwang , additional, we can caculate the total resources usage on the web site and display them on the web site. |
||
|
||
Then click the "Launch" button on the pop-up dialogue, a "Kubernetes Deployment" will be created to serve the model. The current serving instances will be listed at the current page. | ||
|
||
Users can also scale/shrink the resource used for the serving instances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need "find back", maybe change to "resetting". Since we probably would only store a hashed password.