[DO NOT MERGE] Python binding / wrapper look and feel #238

GuanLuo · 2023-08-03T18:58:56Z

FIXME are the sections that further discussion is desired.

The wrapper is restricted that a lot of interaction with in-process API is pre-defined (i.e. how to handle released TRITONSERVER_Request and TRITONSERVER_Response). The intention is that user shouldn't need to handle object lifecycle explicitly. The lower level binding will also be provided if they want finer control (i.e. reuse underlying objects to avoid alloc/release overhead).

At the end of _server.py is a basic example usage of the wrapper. At the end of _infer.py is an example implementation of allocator.

oandreeva-nv · 2023-08-03T19:22:57Z

python/_trace.py

+#         [](){
+#             obj.trace_release;
+#             TraceDelete();
+#             // assumping root trace is always released last..


Just a note. This assumption is not correct for OpenTelemetry tracing of bls + ensemble cases

In those cases, when can the userp be safely released? I believe within the trace release callback, it needs to figure out that the current invocation is for the last trace and then can delete userp.

The BLS tracing PR adds a tracker of spawned traces, and if no more release callbacks is expected, userp is deleted:

https://github.com/triton-inference-server/server/pull/6063/files#diff-c05667609085c993fd282ff18d0429945c44492b95c0d01fcbeec6946fda2db2R614-R626

Feel free to add any suggestions while it is in review

Tabrizian

One issue with the current design is that it is tied closely to the Server APIs instead of starting with the user workflow and working our way backwards to the C-API.

Ideally I think we would want the user workflow to look like this:

from tritonserver import Model, TritonServer, Request


server = TritonServer()
model = Model('<onnx-model-object>|<path-to-an-onnx-model>|...')
server.load(model)

request = Request(...)
response_iterator = server.infer(model, request)
async_response_iterator = await server.async_infer(model, request)

We can also start with just the core inferencing API and gradually add more APIs once we see how the initial design gets used in the wild. I think we don't have to expose all the APIs in one pass.

I think one other important part that we need to figure out is how we would want to expose the async APIs. We could perhaps follow the gRPC example and have a complete separate set of APIs for async. For example:

from tritonserver import Model, AsyncTritonServer, Request


server = AsyncTritonServer()
model = Model('<onnx-model-object>|<path-to-an-onnx-model>|...')
await server.load(model)

request = Request(...)
response_iterator = await server.infer(model, request)

The AsyncTritonServer would return a coroutine in all the API calls.

Tabrizian · 2023-08-03T22:34:52Z

python/__init__.py

+# [FIXME] exposing the actual C API binding below for users who want
+# more customization over what the Python wrapper provides.
+# one may invoke the API using bindings.xxx()
+import _pybind as bindings


I would prefer to not expose this unless explicitly requested.

GuanLuo · 2023-08-04T20:00:11Z

I was thinking about extracting the model interface as well, but I didn't extract it because it may emphasize the misconception that Triton model operations are with respect to the current object with the specified name/version. Which can be confusing to user in the below scenario:

model_0 = Model("simple", <vision model>)
server.load(model_0)
.... # some later time
model_1 = Model("simple", <language model>)
server.load(model_1)

where user may expect model_0 is unchanged as it seems to be different entity, but the fact that any API usage against model_0 will be pointing to the language model.

However, I do agree that the API can be condensed to preparation of model loading easier, which is the same as what you have proposed, but the name will be something like ModelStorage instead of Model to indicate that this is the static representation of the model. The runtime model API will still have to run through a Triton object.

That being said, I still want to explore the possibility of providing the model abstraction. I think in such a case the wrapper will need to encapsulate the Triton model management and impose the limitation / assumption that models must be managed through this wrapper (so it can track all the model changes). For example, now the load API will return a model handle that is just the name of the loaded model, but with additional "valid" attribute that will hint the user that the model may have been changed and let them decide what to do with the "stale" model handle:

model_0 = server.load(ModelStore("simple", <vision model>))
assert(model_0.valid)
.... # some later time
model_1 = server.load(ModelStore("simple", <language model>))
assert(not model_0.valid)

I don't see the necessary of providing synchorinzed infer API, user can read from response iterator after async_infer return and it will block until response is ready.

Tabrizian · 2023-08-11T16:09:33Z

I don't see the necessary of providing synchronized infer API, user can read from response iterator after async_infer return and it will block until response is ready.

I think that's a valid argument for the C-API but for Python we need to integrate that with asyncIO. Also, I think it is not just the infer API. We probably need async version for all the time-consuming Triton APIs (e.g., load/unload). I'm just wondering whether there is a way to avoid ending up with two async infer APIs (one that uses callback based and another that uses asyncIO). We could provide only provide an asyncIO based async infer API but that would make the sync code a bit awkward. For example, they'd have to run asyncio.run(server.async_infer).

Regarding having multiple models with the same name, I think we can document that all the models with same name would point to the same underlying object or we could also error out if the user tries to load another model with the same name.

oandreeva-nv · 2023-08-14T18:57:10Z

python/_trace.py

+        pass
+
+
+class TraceReportor(abc.ABC):


TraceReportor -> TraceReporter

rmccorm4 · 2023-08-21T21:34:40Z

python/_infer.py

+        # buffer attributes..
+        self.buffer_address: int = 0
+        self.byte_size: int = 0
+        self.memory_type: Tuple(MemoryType, int) = (MemoryType.CPU, 0)


Explicit DeviceId even if just an alias or light wrapper of int for memory typing? I think it will read easier.

Suggested change

self.memory_type: Tuple(MemoryType, int) = (MemoryType.CPU, 0)

self.memory_type: Tuple(MemoryType, DeviceID) = (MemoryType.CPU, 0)

Tabrizian · 2023-08-29T15:12:35Z

python/_infer.py

+import queue
+
+
+class Tensor:


Is it possible to have only a single Tensor class? I was wondering why the allocator implementation is tied to Tensor. We can have Tensor.from_numpy and Tensor.from_dlpack to create a tensor from other datatypes.

Is it possible to have only a single Tensor class?

Can you elaborate on this? I think there will only be one Tensor class, to store the information needed for Triton (buffer attribute, tensor data etc.). Although the InferenceRequest accepts other tensor types that implements dlpack interface, internally it is converted to Tensor and then process with the Triton APIs. On this note I agree that we should have Tensor.from_dlpack. However, the conversion doesn't involve data copy so the numpy / other dlpack-compatible tensor must be valid.

The part about allocator in Tensor is related to the output buffer allocation. The user controls how the tensor data is allocated, that is probably where having multiple Tensor classes seems to be happening: even the Tensor interface is unchanged, each implementation of the allocator will have to set additional attributes in Tensor for allocation purpose.

I see what you are saying now, the NumpyAllocator does extend Tensor class to add allocator specific detail instead of directly adding attributes to a Tensor instance. I think this is just a different way to achieve my earlier comment about buffer allocation, users should process the output buffer via DLPack / Tensor interface which is allocator agnostic, my example usage is not appropriate as it assumes the allocator detail..

Exactly, I'm just saying that we should try to decouple the buffer allocation logic from the tensor representation.

Tabrizian · 2023-08-29T15:43:46Z

python/_server.py

+        pass
+
+
+class TritonCore:


TritonServer?

Tabrizian · 2023-08-29T15:45:44Z

python/_server.py

+# Example usage of the module:
+#   # Init
+#   options = Options()
+#   options.model_repositories = ["/path/to/models"]


I think it might be good if we can work on providing an in-memory model repository (or model definition) as a part of this ticket. I think having to setup your model repository outside is not ideal for the optimal user experience.

python/_infer.py

oandreeva-nv · 2023-08-30T00:06:28Z

python/_infer.py

+#         else:
+#             # May also interact with Tensor attributes directly
+#             res.append(numpy.from_dlpack(output))
+import numpy


Can we move this import to the top?

oandreeva-nv · 2023-08-30T00:09:11Z

python/_logger.py

+    ISO8601: int = 1
+
+
+# [WIP] figure out "singleton"


Quick question is this still a WIP and this will be resolved before merging? is it possible to elaborate figure out if it is left for future work?

This will be resolved before merging. This PR will include the actual implementation of the wrapper.

oandreeva-nv · 2023-08-30T00:12:51Z

python/_trace.py

+
+
+# Activity related to timeline tracing
+class TraceActivity(Enum):


Just an FYI, @Tabrizian is working (IIRC) on custom tracing for python backend, which introduces new TraceActivity: #172. So wee need to coordinate this at some point

krishung5 · 2023-08-30T19:29:17Z

python/_server.py

+        self.buffer_manager_thread_count: int = None
+
+        # Model stuff..
+        self.model_repositories: Iterable[str] = None


I was wondering if it would make sense to set the model repository path as a required field.

As in making the paths part of init parameter? i.e. __init__(self, model_paths)

Yes. I just thought that instead of doing this

options = Options() options.model_repositories = ["/path/to/models"]

we can do

options = Options(["/path/to/models"])

given that the model repo paths are required when starting the server.

Tabrizian · 2023-08-31T16:34:43Z

python/_logger.py

+
+
+# [WIP] figure out "singleton"
+class GlobalLogger:


Suggested change

class GlobalLogger:

class Logger:

Tabrizian · 2023-08-31T16:36:24Z

python/_metrics.py

+        # finally: delete TRITONSERVER_Parameter
+        pass
+
+    def close(self):


Suggested change

def close(self):

def _close(self):

Tabrizian · 2023-08-31T16:47:13Z

python/_infer.py

+    def __init__(self,
+                 name: str,
+                 version: int = -1,
+                 consumed_callback: Callable = None,


Is there anyway we could remove the need for this callback?

It's not needed unless the user want to know when he may modify the content of the InferenceRequest, in most context the input tensors. But if there is no reuse of the request and its content, then it is not necessary:

input_np = np.array(...) request = InferenceRequest(...) request.inputs = {"input", input_np} # async infer.. input_np[0] = ... # modifying tensor content can cause issue if inference hasn't fully consume the input

GuanLuo added 3 commits August 1, 2023 19:07

First draft

34bb0b7

Clean up

cea106d

Add example

9699a9f

GuanLuo requested review from Tabrizian and tanmayv25 August 3, 2023 18:59

oandreeva-nv reviewed Aug 3, 2023

View reviewed changes

Tabrizian reviewed Aug 4, 2023

View reviewed changes

oandreeva-nv reviewed Aug 14, 2023

View reviewed changes

python/_trace.py

pass

class TraceReportor(abc.ABC):

Copy link

Contributor

oandreeva-nv Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TraceReportor -> TraceReporter

rmccorm4 reviewed Aug 21, 2023

View reviewed changes

GuanLuo mentioned this pull request Aug 28, 2023

[DO NOT MERGE] Python binding of Triton server C API #248

Closed

Tabrizian reviewed Aug 29, 2023

View reviewed changes

oandreeva-nv reviewed Aug 30, 2023

View reviewed changes

krishung5 reviewed Aug 30, 2023

View reviewed changes

Tabrizian reviewed Aug 31, 2023

View reviewed changes

GuanLuo closed this Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] Python binding / wrapper look and feel #238

[DO NOT MERGE] Python binding / wrapper look and feel #238

GuanLuo commented Aug 3, 2023

oandreeva-nv Aug 3, 2023

GuanLuo Aug 3, 2023

oandreeva-nv Aug 3, 2023

Tabrizian left a comment •

edited

Loading

Tabrizian Aug 3, 2023

GuanLuo commented Aug 4, 2023

Tabrizian commented Aug 11, 2023

oandreeva-nv Aug 14, 2023

rmccorm4 Aug 21, 2023 •

edited

Loading

Tabrizian Aug 29, 2023

GuanLuo Aug 29, 2023

GuanLuo Aug 29, 2023

Tabrizian Aug 31, 2023

Tabrizian Aug 29, 2023

Tabrizian Aug 29, 2023

oandreeva-nv Aug 30, 2023

oandreeva-nv Aug 30, 2023

GuanLuo Aug 30, 2023

oandreeva-nv Aug 30, 2023

krishung5 Aug 30, 2023

GuanLuo Aug 30, 2023

krishung5 Aug 30, 2023

Tabrizian Aug 31, 2023

Tabrizian Aug 31, 2023

Tabrizian Aug 31, 2023

GuanLuo Sep 5, 2023

	self.memory_type: Tuple(MemoryType, int) = (MemoryType.CPU, 0)
	self.memory_type: Tuple(MemoryType, DeviceID) = (MemoryType.CPU, 0)



		# Activity related to timeline tracing
		class TraceActivity(Enum):

[DO NOT MERGE] Python binding / wrapper look and feel #238

[DO NOT MERGE] Python binding / wrapper look and feel #238

Conversation

GuanLuo commented Aug 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tabrizian left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GuanLuo commented Aug 4, 2023

Tabrizian commented Aug 11, 2023

Choose a reason for hiding this comment

rmccorm4 Aug 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tabrizian left a comment •

edited

Loading

rmccorm4 Aug 21, 2023 •

edited

Loading