Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Python binding / wrapper look and feel #238

Closed
wants to merge 3 commits into from

Conversation

GuanLuo
Copy link
Contributor

@GuanLuo GuanLuo commented Aug 3, 2023

FIXME are the sections that further discussion is desired.

The wrapper is restricted that a lot of interaction with in-process API is pre-defined (i.e. how to handle released TRITONSERVER_Request and TRITONSERVER_Response). The intention is that user shouldn't need to handle object lifecycle explicitly. The lower level binding will also be provided if they want finer control (i.e. reuse underlying objects to avoid alloc/release overhead).

At the end of _server.py is a basic example usage of the wrapper. At the end of _infer.py is an example implementation of allocator.

@GuanLuo GuanLuo requested review from Tabrizian and tanmayv25 August 3, 2023 18:59
# [](){
# obj.trace_release;
# TraceDelete();
# // assumping root trace is always released last..
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note. This assumption is not correct for OpenTelemetry tracing of bls + ensemble cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In those cases, when can the userp be safely released? I believe within the trace release callback, it needs to figure out that the current invocation is for the last trace and then can delete userp.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BLS tracing PR adds a tracker of spawned traces, and if no more release callbacks is expected, userp is deleted:

https://github.com/triton-inference-server/server/pull/6063/files#diff-c05667609085c993fd282ff18d0429945c44492b95c0d01fcbeec6946fda2db2R614-R626

Feel free to add any suggestions while it is in review

Copy link
Member

@Tabrizian Tabrizian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue with the current design is that it is tied closely to the Server APIs instead of starting with the user workflow and working our way backwards to the C-API.

Ideally I think we would want the user workflow to look like this:

from tritonserver import Model, TritonServer, Request


server = TritonServer()
model = Model('<onnx-model-object>|<path-to-an-onnx-model>|...')
server.load(model)

request = Request(...)
response_iterator = server.infer(model, request)
async_response_iterator = await server.async_infer(model, request)

We can also start with just the core inferencing API and gradually add more APIs once we see how the initial design gets used in the wild. I think we don't have to expose all the APIs in one pass.

I think one other important part that we need to figure out is how we would want to expose the async APIs. We could perhaps follow the gRPC example and have a complete separate set of APIs for async. For example:

from tritonserver import Model, AsyncTritonServer, Request


server = AsyncTritonServer()
model = Model('<onnx-model-object>|<path-to-an-onnx-model>|...')
await server.load(model)

request = Request(...)
response_iterator = await server.infer(model, request)

The AsyncTritonServer would return a coroutine in all the API calls.

# [FIXME] exposing the actual C API binding below for users who want
# more customization over what the Python wrapper provides.
# one may invoke the API using bindings.xxx()
import _pybind as bindings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to not expose this unless explicitly requested.

@GuanLuo
Copy link
Contributor Author

GuanLuo commented Aug 4, 2023

I was thinking about extracting the model interface as well, but I didn't extract it because it may emphasize the misconception that Triton model operations are with respect to the current object with the specified name/version. Which can be confusing to user in the below scenario:

model_0 = Model("simple", <vision model>)
server.load(model_0)
.... # some later time
model_1 = Model("simple", <language model>)
server.load(model_1)

where user may expect model_0 is unchanged as it seems to be different entity, but the fact that any API usage against model_0 will be pointing to the language model.

However, I do agree that the API can be condensed to preparation of model loading easier, which is the same as what you have proposed, but the name will be something like ModelStorage instead of Model to indicate that this is the static representation of the model. The runtime model API will still have to run through a Triton object.

That being said, I still want to explore the possibility of providing the model abstraction. I think in such a case the wrapper will need to encapsulate the Triton model management and impose the limitation / assumption that models must be managed through this wrapper (so it can track all the model changes). For example, now the load API will return a model handle that is just the name of the loaded model, but with additional "valid" attribute that will hint the user that the model may have been changed and let them decide what to do with the "stale" model handle:

model_0 = server.load(ModelStore("simple", <vision model>))
assert(model_0.valid)
.... # some later time
model_1 = server.load(ModelStore("simple", <language model>))
assert(not model_0.valid)

I don't see the necessary of providing synchorinzed infer API, user can read from response iterator after async_infer return and it will block until response is ready.

@Tabrizian
Copy link
Member

I don't see the necessary of providing synchronized infer API, user can read from response iterator after async_infer return and it will block until response is ready.

I think that's a valid argument for the C-API but for Python we need to integrate that with asyncIO. Also, I think it is not just the infer API. We probably need async version for all the time-consuming Triton APIs (e.g., load/unload). I'm just wondering whether there is a way to avoid ending up with two async infer APIs (one that uses callback based and another that uses asyncIO). We could provide only provide an asyncIO based async infer API but that would make the sync code a bit awkward. For example, they'd have to run asyncio.run(server.async_infer).

Regarding having multiple models with the same name, I think we can document that all the models with same name would point to the same underlying object or we could also error out if the user tries to load another model with the same name.

pass


class TraceReportor(abc.ABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TraceReportor -> TraceReporter

# buffer attributes..
self.buffer_address: int = 0
self.byte_size: int = 0
self.memory_type: Tuple(MemoryType, int) = (MemoryType.CPU, 0)
Copy link
Contributor

@rmccorm4 rmccorm4 Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicit DeviceId even if just an alias or light wrapper of int for memory typing? I think it will read easier.

Suggested change
self.memory_type: Tuple(MemoryType, int) = (MemoryType.CPU, 0)
self.memory_type: Tuple(MemoryType, DeviceID) = (MemoryType.CPU, 0)

import queue


class Tensor:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have only a single Tensor class? I was wondering why the allocator implementation is tied to Tensor. We can have Tensor.from_numpy and Tensor.from_dlpack to create a tensor from other datatypes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have only a single Tensor class?

Can you elaborate on this? I think there will only be one Tensor class, to store the information needed for Triton (buffer attribute, tensor data etc.). Although the InferenceRequest accepts other tensor types that implements dlpack interface, internally it is converted to Tensor and then process with the Triton APIs. On this note I agree that we should have Tensor.from_dlpack. However, the conversion doesn't involve data copy so the numpy / other dlpack-compatible tensor must be valid.

The part about allocator in Tensor is related to the output buffer allocation. The user controls how the tensor data is allocated, that is probably where having multiple Tensor classes seems to be happening: even the Tensor interface is unchanged, each implementation of the allocator will have to set additional attributes in Tensor for allocation purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you are saying now, the NumpyAllocator does extend Tensor class to add allocator specific detail instead of directly adding attributes to a Tensor instance. I think this is just a different way to achieve my earlier comment about buffer allocation, users should process the output buffer via DLPack / Tensor interface which is allocator agnostic, my example usage is not appropriate as it assumes the allocator detail..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, I'm just saying that we should try to decouple the buffer allocation logic from the tensor representation.

pass


class TritonCore:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TritonServer?

# Example usage of the module:
# # Init
# options = Options()
# options.model_repositories = ["/path/to/models"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be good if we can work on providing an in-memory model repository (or model definition) as a part of this ticket. I think having to setup your model repository outside is not ideal for the optimal user experience.

python/_infer.py Show resolved Hide resolved
# else:
# # May also interact with Tensor attributes directly
# res.append(numpy.from_dlpack(output))
import numpy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this import to the top?

ISO8601: int = 1


# [WIP] figure out "singleton"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question is this still a WIP and this will be resolved before merging? is it possible to elaborate figure out if it is left for future work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be resolved before merging. This PR will include the actual implementation of the wrapper.



# Activity related to timeline tracing
class TraceActivity(Enum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an FYI, @Tabrizian is working (IIRC) on custom tracing for python backend, which introduces new TraceActivity: #172. So wee need to coordinate this at some point

self.buffer_manager_thread_count: int = None

# Model stuff..
self.model_repositories: Iterable[str] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if it would make sense to set the model repository path as a required field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in making the paths part of init parameter? i.e. __init__(self, model_paths)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I just thought that instead of doing this

options = Options()
options.model_repositories = ["/path/to/models"]

we can do

options = Options(["/path/to/models"])

given that the model repo paths are required when starting the server.



# [WIP] figure out "singleton"
class GlobalLogger:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class GlobalLogger:
class Logger:

# finally: delete TRITONSERVER_Parameter
pass

def close(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def close(self):
def _close(self):

def __init__(self,
name: str,
version: int = -1,
consumed_callback: Callable = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anyway we could remove the need for this callback?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not needed unless the user want to know when he may modify the content of the InferenceRequest, in most context the input tensors. But if there is no reuse of the request and its content, then it is not necessary:

input_np = np.array(...)
request = InferenceRequest(...)
request.inputs = {"input", input_np}
# async infer..
input_np[0] = ... # modifying tensor content can cause issue if inference hasn't fully consume the input

@GuanLuo GuanLuo closed this Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants