|
1 |
| -## Design Doc for Optimization as a Service [WIP] |
2 |
| - |
3 |
| - |
4 |
| - |
5 |
| -### Contents |
6 |
| - |
7 |
| -- [Design Doc for Optimization as a Service \[WIP\]](#design-doc-for-optimization-as-a-service-wip) |
8 |
| - - [Contents](#contents) |
9 |
| - - [Overview](#overview) |
10 |
| - - [Workflow of OaaS](#workflow-of-oaas) |
11 |
| - - [Class definition diagram](#class-definition-diagram) |
12 |
| - - [Extensibility](#extensibility) |
13 |
| - |
14 |
| -### Overview |
15 |
| - |
16 |
| -Optimization as a service(OaaS) is a platform that enables users to submit quantization tasks for their models and automatically dispatches these tasks to one or multiple nodes for accuracy-aware tuning. OaaS is designed to parallelize the tuning process in two levels: tuning and model. At the tuning level, OaaS execute the tuning process across multiple nodes for one model. At the model level, OaaS allocate free nodes to incoming requests automatically. |
17 |
| - |
18 |
| - |
19 |
| -### Workflow of OaaS |
20 |
| - |
21 |
| -```mermaid |
22 |
| -sequenceDiagram |
23 |
| - participant Studio |
24 |
| - participant TaskMonitor |
25 |
| - participant Scheduler |
26 |
| - participant Cluster |
27 |
| - participant TaskLauncher |
28 |
| - participant ResultMonitor |
29 |
| - Par receive task |
30 |
| - Studio ->> TaskMonitor: P1-1. Post quantization Request |
31 |
| - TaskMonitor ->> TaskMonitor: P1-2. Add task to task DB |
32 |
| - TaskMonitor ->> Studio: P1-3. Task received notification |
33 |
| - and Schedule task |
34 |
| - loop |
35 |
| - Scheduler ->> Scheduler: P2-1. Pop task from task DB |
36 |
| - Scheduler ->> Cluster: P2-2. Apply for resources |
37 |
| - Note over Scheduler, Cluster: the number of Nodes |
38 |
| - Cluster ->> Cluster: P2-3. Check the status of nodes in cluster |
39 |
| - Cluster ->> Scheduler: P2-4. Resources info |
40 |
| - Note over Scheduler, Cluster: host:socket list |
41 |
| - Scheduler ->> TaskLauncher: P2-5. Dispatch task |
42 |
| - end |
43 |
| - and Run task |
44 |
| - TaskLauncher ->> TaskLauncher: P3-1. Run task |
45 |
| - Note over TaskLauncher, TaskLauncher: mpirun -np 4 -hostfile hostfile python main.py |
46 |
| - TaskLauncher ->> TaskLauncher: P3-2. Wait task to finish... |
47 |
| - TaskLauncher ->> Cluster: P3-3. Free resource |
48 |
| - TaskLauncher ->> ResultMonitor: P3-4. Report the Acc and Perf |
49 |
| - ResultMonitor ->> Studio: P3-5. Post result to Studio |
50 |
| - and Query task status |
51 |
| - Studio ->> ResultMonitor: P4-1. Query the status of the submitted task |
52 |
| - ResultMonitor ->> Studio: P4-2. Post the status of queried task |
53 |
| - End |
54 |
| -
|
| 1 | +# Get started |
| 2 | + |
| 3 | +- [Get started](#get-started) |
| 4 | + - [Install Neural Solution](#install-neural-solution) |
| 5 | + - [Prerequisites](#prerequisites) |
| 6 | + - [Method 1. Using pip](#method-1-using-pip) |
| 7 | + - [Method 2. Building from source](#method-2-building-from-source) |
| 8 | + - [Start service](#start-service) |
| 9 | + - [Submit task](#submit-task) |
| 10 | + - [Query task status](#query-task-status) |
| 11 | + - [Stop service](#stop-service) |
| 12 | + - [Inspect logs](#inspect-logs) |
| 13 | + |
| 14 | +## Install Neural Solution |
| 15 | +### Prerequisites |
| 16 | +- Install [Anaconda](https://docs.anaconda.com/free/anaconda/install/) |
| 17 | +- Install [Open MPI](https://www.open-mpi.org/faq/?category=building#easy-build) |
| 18 | +- Python 3.8 or later |
| 19 | + |
| 20 | +There are two ways to install the neural solution: |
| 21 | +### Method 1. Using pip |
55 | 22 | ```
|
| 23 | +pip install neural-solution |
| 24 | +``` |
| 25 | +### Method 2. Building from source |
56 | 26 |
|
57 |
| -The optimization process is divided into four parts, each executed in separate threads. |
58 |
| - |
59 |
| -- Part 1. Posting new quantization task. (P1-1 -> P1-2 -> P1-3) |
60 |
| - |
61 |
| -- Part 2. Resource allocation and scheduling. (P2-1 -> P2-2 -> P2-3 -> P2-4 -> P2-5) |
62 |
| - |
63 |
| -- Part 3. Task execution and reporting. (P3-1 -> P3-2 -> P3-3 -> P3-4 -> P3-5) |
| 27 | +```shell |
| 28 | +# get source code |
| 29 | +git clone https://github.com/intel/neural-compressor |
| 30 | +cd neural-compressor |
64 | 31 |
|
65 |
| -- Part 4. Updating the status. (P4-1 -> P4-2) |
| 32 | +# install neural compressor |
| 33 | +pip install -r requirements.txt |
| 34 | +python setup.py install |
66 | 35 |
|
67 |
| -### Class definition diagram |
| 36 | +# install neural solution |
| 37 | +pip install -r neural_solution/requirements.txt |
| 38 | +python setup.py neural_solution install |
| 39 | +``` |
68 | 40 |
|
| 41 | +## Start service |
| 42 | + |
| 43 | +```shell |
| 44 | +# Start neural solution service with custom configuration |
| 45 | +neural_solution start --task_monitor_port=22222 --result_monitor_port=33333 --restful_api_port=8001 |
| 46 | + |
| 47 | +# Help Manual |
| 48 | +neural_solution -h |
| 49 | +# Help output |
| 50 | + |
| 51 | +usage: neural_solution {start,stop} [-h] [--hostfile HOSTFILE] [--restful_api_port RESTFUL_API_PORT] [--grpc_api_port GRPC_API_PORT] |
| 52 | + [--result_monitor_port RESULT_MONITOR_PORT] [--task_monitor_port TASK_MONITOR_PORT] [--api_type API_TYPE] |
| 53 | + [--workspace WORKSPACE] [--conda_env CONDA_ENV] [--upload_path UPLOAD_PATH] |
| 54 | + |
| 55 | +Neural Solution |
| 56 | + |
| 57 | +positional arguments: |
| 58 | + {start,stop} start/stop service |
| 59 | + |
| 60 | +optional arguments: |
| 61 | + -h, --help show this help message and exit |
| 62 | + --hostfile HOSTFILE start backend serve host file which contains all available nodes |
| 63 | + --restful_api_port RESTFUL_API_PORT |
| 64 | + start restful serve with {restful_api_port}, default 8000 |
| 65 | + --grpc_api_port GRPC_API_PORT |
| 66 | + start gRPC with {restful_api_port}, default 8000 |
| 67 | + --result_monitor_port RESULT_MONITOR_PORT |
| 68 | + start serve for result monitor at {result_monitor_port}, default 3333 |
| 69 | + --task_monitor_port TASK_MONITOR_PORT |
| 70 | + start serve for task monitor at {task_monitor_port}, default 2222 |
| 71 | + --api_type API_TYPE start web serve with all/grpc/restful, default all |
| 72 | + --workspace WORKSPACE |
| 73 | + neural solution workspace, default "./ns_workspace" |
| 74 | + --conda_env CONDA_ENV |
| 75 | + specify the running environment for the task |
| 76 | + --upload_path UPLOAD_PATH |
| 77 | + specify the file path for the tasks |
69 | 78 |
|
| 79 | +``` |
70 | 80 |
|
71 |
| -```mermaid |
72 |
| -classDiagram |
| 81 | +## Submit task |
73 | 82 |
|
| 83 | +- For RESTful API: `[user@server hf_model]$ curl -H "Content-Type: application/json" --data @./task.json http://localhost:8000/task/submit/` |
| 84 | +- For gRPC API: `python -m neural_solution.frontend.gRPC.client submit --request="test.json"` |
74 | 85 |
|
| 86 | +> For more details, please reference the [API description](./description_api.md) and [examples](../../examples/README.md). |
75 | 87 |
|
76 |
| -TaskDB "1" --> "*" Task |
77 |
| -TaskMonitor --> TaskDB |
78 |
| -ResultMonitor --> TaskDB |
79 |
| -Scheduler --> TaskDB |
80 |
| -Scheduler --> Cluster |
| 88 | +## Query task status |
81 | 89 |
|
| 90 | +Query the task status and result according to the `task_id`. |
82 | 91 |
|
83 |
| -class Task{ |
84 |
| - + status |
85 |
| - + get_status() |
86 |
| - + update_status() |
87 |
| -} |
| 92 | +- For RESTful API: `[user@server hf_model]$ curl -X GET http://localhost:8000/task/status/{task_id}` |
| 93 | +- For gRPC API: `python -m neural_solution.frontend.gRPC.client query --task_id={task_id}` |
88 | 94 |
|
89 |
| -class TaskDB{ |
90 |
| - - task_collections |
91 |
| - + append_task() |
92 |
| - + get_all_pending_tasks() |
93 |
| - + update_task_status() |
94 |
| -} |
95 |
| -class TaskMonitor{ |
96 |
| - - task_db |
97 |
| - + wait_new_task() |
98 |
| -} |
99 |
| -class Scheduler{ |
100 |
| - - task_db |
101 |
| - - cluster |
102 |
| - + schedule_tasks() |
103 |
| - + dispatch_task() |
104 |
| - + launch_task() |
105 |
| -} |
| 95 | +> For more details, please reference the [API description](./description_api.md) and [examples](../../examples/README.md). |
106 | 96 |
|
107 |
| -class ResultMonitor{ |
108 |
| - - task_db |
109 |
| - + query_task_status() |
110 |
| -} |
111 |
| -class Cluster{ |
112 |
| - - node_list |
113 |
| - + free() |
114 |
| - + reserve_resource() |
115 |
| - + get_node_status() |
116 |
| -} |
| 97 | +## Stop service |
117 | 98 |
|
| 99 | +```shell |
| 100 | +# Stop neural solution service with default configuration |
| 101 | +neural_solution stop |
118 | 102 | ```
|
119 | 103 |
|
| 104 | +## Inspect logs |
| 105 | + |
| 106 | +The default logs locate in `./ns_workspace/`. Users can specify a custom workspace by using `neural_solution ---workspace=/path/to/custom/workspace`. |
| 107 | + |
| 108 | +There are several logs under workspace: |
| 109 | + |
| 110 | +```shell |
| 111 | +(ns) [username@servers ns_workspace]$ tree |
| 112 | +. |
| 113 | +├── db |
| 114 | +│ └── task.db # database to save the task-related information |
| 115 | +├── serve_log # service running log |
| 116 | +│ ├── backend.log # backend log |
| 117 | +│ ├── frontend_grpc.log # grpc frontend log |
| 118 | +│ └── frontend.log # HTTP/RESTful frontend log |
| 119 | +├── task_log # overall log for each task |
| 120 | +│ ├── task_bdf0bd1b2cc14bc19bce12d4f9b333c7.txt # task log |
| 121 | +│ └── ... |
| 122 | +└── task_workspace # the log for each task |
| 123 | + ... |
| 124 | + ├── bdf0bd1b2cc14bc19bce12d4f9b333c7 # task_id |
| 125 | + ... |
120 | 126 |
|
121 |
| -### Extensibility |
122 |
| - |
123 |
| -- The service can be deployed on various resource pool, including a set of worker nodes, such as a local cluster or cloud cluster (AWS and GCP). |
| 127 | +``` |
124 | 128 |
|
0 commit comments