Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paddle cloud web features design #378

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
File renamed without changes.
Binary file added doc/design/pictures/notebook.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
153 changes: 153 additions & 0 deletions doc/design/web.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# Web Interface design

This design doc will talk about features and web pages needed to let users manage cloud paddle jobs.

## Feature List

- Account Management
- Registration, send email to inform if registration succeeded
- Account Login/Logout
- Password changing, find back
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need "find back", maybe change to "resetting". Since we probably would only store a hashed password.

- Download SSL keys
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the SSL key for? I thought currently authentication is done via token?

- Jupiter Notebook
- Private Jupiter Notebook environment to run Python scripts
- Private workspace
- Submit job from Jupiter Notebook
- Job Dashboard
- Job history and currently running jobs
- Performance Monitoring
- Quota Monitoring
- Datasets
- Public Dataset viewing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataset => dataset

- Upload/Download private datasets
- Share datasets
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe need to be more specific: Does it mean can share the dataset to anyone by a link, or just set dataset visible to certain group (similar to the unix read file permission).

- Models
- Upload/Download models file
- Share/Publish Models
- Paddle Board
- Training metrics visualization
- cost
- evaluator
- user-defined metrics
- Serving
- Submit serving instances
- Deactivate serving
- Serving performance monitoring
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need a feature: scale serving instances, and we can use HPA to implement auto-scaling.


## Account Management

Account management page is designed to satisfy multi-tenant use cases. One account should have a unique account ID for login, and this account owns one access key to one unique [Kubernetes namespace](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/) cluster. Multiple users can log in to this account ID and operate jobs and data files. The only "master user" can do modifications like increase quota or manage account settings.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"log in to this account ID" or log in to its own account which belongs to the group. If so, "master user" could be "group owner".


One example is [AWS IAM](https://aws.amazon.com/iam/?nc2=h_m1), but we can do more simpler than that.

The current implementation under this repo can only have one user for one Kubernetes namespace. We can implement multi-tenant in the near future.

Once a user logged in, s/he will be redirected to the "Job Dashboard" page.

## Jupiter Notebook

Start a [Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/) using image `docker.paddlepaddle.org/book` in the Kubernetes cluster and add an [Ingress](https://kubernetes.io/docs/concepts/services-networking/ingress/) endpoint when a user first enters the notebook page. This is already implemented.

<img src="pictures/notebook.png" width="500px" align="center">

Users can write a program in python in the web page and save their programs, which will be saved at cloud storage. Users also can run a script like below to submit a cluster training job:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python => Python


```python
create_cloud_job(
name,
num_trainer,
mem_per_trainer,
gpu_per_trainer,
cpu_per_trainer,
num_ps,
mem_per_ps,
cpu_per_ps,
)
```

After this, there will be a job description and performance monitoring pages able to view at "Job Dashboard"

## Job Dashboard

### Job history and currently running jobs

A web page containing a table to list jobs satisfying user's filters. The user can only list jobs that were submitted by themselves.

| jobname | start time | age | success | fails | actions |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need mor information on the Job list, such as PS_READY, PS_TOTAL, TRAINER_READY, TRAINER_TOTAL.

| ------- | ---------- | -------- | ------- | ----- | ------- |
| test1 | 2017-01-01 | 17m | 0 | 0 | stop/log/perf |

Users can filter the list using:

- status: running/stopped/failed
- time: job start time
- jobname: search by jobname

Viewing job logs:

Click the "log" button on the right of the job will pop up a console frame at the bottom of the page showing the tail of the job log, here shows the first pod's log. On the left side of pop up console frame, there should be a vertical list containing the list of pods, then click one of the pod, the console will show it's log.

### Performance Monitoring

A web page containing graphs monitoring job's resource usages according to time change:

- CPU usage
- GPU usage
- memory usage
- network bandwidth
- disk I/O

### Quota Monitoring

A web page displaying total quota and quota currently in use.

Also display total CPU time, GPU time in latest 1day, 1week, and 1month.

## Datasets and Models

Datasets and Models are quite the same, both like a simple file management and sharing service.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can supplement more information about the file sharing service.Such as we can share files between users, namespaces or just a publish link?


- file listing and viewing page
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file => File

- Upload/Download page
- file sharing page

## Paddle Board

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't expect draw_board functions calls in user programs. I am not sure how configurable the TensorBoard is, but in my mind, PaddleBoard just needs to be able to present outputs from Evaluator operators aggregated/accumulated over minibatches.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not inserting function calls in user programs, we need to automatically find out which variable represents the cost and evaluator operator by default and draw the variable value to the web page. I'm not sure how to do that for now.

Here is a short example of how TensorBoard config the metrics, using tf.summary. User explicitly specify values to output for drawing.


A web page containing graphs showing the internal status when the job is training, metrics like:

- cost(can be multiple costs)
- evaluator output
- user-defined metric

User can caculate metrics and define the graph like:

```python
cost = my_train_network()
evaluator = my_evaluator(output, label)
def my_metric_graph(output, label):
metric = paddle.auc(output, label)
my_metric = my_metric_graph(output, label)
my_metric_value = output

draw_board(cost, evaluator)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think draw_board should take only one variable that returns a scalar, and a optional name. E.g.,

draw_board(evaluator, "evaluate result")

draw_board(my_metric)
draw_board(my_metric_value)
```

Calling `draw_board` will output graph files on the distributed storage, and then the web page can load the data and refresh the graph.

## Serving

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that a serving job is different from a training job in that the former doesn't have a master process. If so, each process in a serving job needs to be able to present its own metrics, and there is no chance for them to present a PaddleBoard?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what metrics to display when running inference(serving), the neural network configuration may not define cost functions, and there's no label to evaluate the result. Metrics like QPS(queries per second) is more like "monitoring" but not PaddleBoard.


After training or uploading pre-trained models, a user can start a serving instance to serve the model as an inference HTTP service:

The Serving web page contains a table listing currently running serving instance and a "Launch" button to configure and start the serving program.

Click the "Launch" button in this web page will pop up a modal dialogue to configure the job:

1. model `tar.gz` files to the cloud.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upper the first letter, the same as below.
model tar.gz files to the cloud
=>
The path of model files with suffix tar.ge on the Cloud.

1. inference network configuration in `.proto` format or user can also define the network in Python in the web page.
1. number of CPU/GPU resource in total to use for serving the model, the more resource there is, the more concurrent calls can be served.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change "number of CPU/GPU resource" to number of instances and CPU / Mem / GPU per instance. Otherwise it's hard for us to figure out how many instances to run (we don't know the model property and user's serving requirement).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @helinwang , additional, we can caculate the total resources usage on the web site and display them on the web site.


Then click the "Launch" button on the pop-up dialogue, a "Kubernetes Deployment" will be created to serve the model. The current serving instances will be listed at the current page.

Users can also scale/shrink the resource used for the serving instances.