-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DO NOT MERGE] Python binding / wrapper look and feel #238
Conversation
# [](){ | ||
# obj.trace_release; | ||
# TraceDelete(); | ||
# // assumping root trace is always released last.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note. This assumption is not correct for OpenTelemetry tracing of bls + ensemble cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In those cases, when can the userp
be safely released? I believe within the trace release callback, it needs to figure out that the current invocation is for the last trace and then can delete userp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BLS tracing PR adds a tracker of spawned traces, and if no more release callbacks is expected, userp
is deleted:
Feel free to add any suggestions while it is in review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue with the current design is that it is tied closely to the Server APIs instead of starting with the user workflow and working our way backwards to the C-API.
Ideally I think we would want the user workflow to look like this:
from tritonserver import Model, TritonServer, Request
server = TritonServer()
model = Model('<onnx-model-object>|<path-to-an-onnx-model>|...')
server.load(model)
request = Request(...)
response_iterator = server.infer(model, request)
async_response_iterator = await server.async_infer(model, request)
We can also start with just the core inferencing API and gradually add more APIs once we see how the initial design gets used in the wild. I think we don't have to expose all the APIs in one pass.
I think one other important part that we need to figure out is how we would want to expose the async APIs. We could perhaps follow the gRPC example and have a complete separate set of APIs for async. For example:
from tritonserver import Model, AsyncTritonServer, Request
server = AsyncTritonServer()
model = Model('<onnx-model-object>|<path-to-an-onnx-model>|...')
await server.load(model)
request = Request(...)
response_iterator = await server.infer(model, request)
The AsyncTritonServer
would return a coroutine in all the API calls.
# [FIXME] exposing the actual C API binding below for users who want | ||
# more customization over what the Python wrapper provides. | ||
# one may invoke the API using bindings.xxx() | ||
import _pybind as bindings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to not expose this unless explicitly requested.
I was thinking about extracting the model interface as well, but I didn't extract it because it may emphasize the misconception that Triton model operations are with respect to the current object with the specified name/version. Which can be confusing to user in the below scenario: model_0 = Model("simple", <vision model>)
server.load(model_0)
.... # some later time
model_1 = Model("simple", <language model>)
server.load(model_1) where user may expect However, I do agree that the API can be condensed to preparation of model loading easier, which is the same as what you have proposed, but the name will be something like That being said, I still want to explore the possibility of providing the model abstraction. I think in such a case the wrapper will need to encapsulate the Triton model management and impose the limitation / assumption that models must be managed through this wrapper (so it can track all the model changes). For example, now the load API will return a model handle that is just the name of the loaded model, but with additional "valid" attribute that will hint the user that the model may have been changed and let them decide what to do with the "stale" model handle: model_0 = server.load(ModelStore("simple", <vision model>))
assert(model_0.valid)
.... # some later time
model_1 = server.load(ModelStore("simple", <language model>))
assert(not model_0.valid) I don't see the necessary of providing synchorinzed infer API, user can read from response iterator after |
I think that's a valid argument for the C-API but for Python we need to integrate that with asyncIO. Also, I think it is not just the Regarding having multiple models with the same name, I think we can document that all the models with same name would point to the same underlying object or we could also error out if the user tries to load another model with the same name. |
pass | ||
|
||
|
||
class TraceReportor(abc.ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TraceReportor -> TraceReporter
# buffer attributes.. | ||
self.buffer_address: int = 0 | ||
self.byte_size: int = 0 | ||
self.memory_type: Tuple(MemoryType, int) = (MemoryType.CPU, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicit DeviceId
even if just an alias or light wrapper of int
for memory typing? I think it will read easier.
self.memory_type: Tuple(MemoryType, int) = (MemoryType.CPU, 0) | |
self.memory_type: Tuple(MemoryType, DeviceID) = (MemoryType.CPU, 0) |
import queue | ||
|
||
|
||
class Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to have only a single Tensor class? I was wondering why the allocator implementation is tied to Tensor
. We can have Tensor.from_numpy
and Tensor.from_dlpack
to create a tensor from other datatypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to have only a single Tensor class?
Can you elaborate on this? I think there will only be one Tensor
class, to store the information needed for Triton (buffer attribute, tensor data etc.). Although the InferenceRequest
accepts other tensor types that implements dlpack interface, internally it is converted to Tensor
and then process with the Triton APIs. On this note I agree that we should have Tensor.from_dlpack
. However, the conversion doesn't involve data copy so the numpy / other dlpack-compatible tensor must be valid.
The part about allocator in Tensor
is related to the output buffer allocation. The user controls how the tensor data is allocated, that is probably where having multiple Tensor classes seems to be happening: even the Tensor
interface is unchanged, each implementation of the allocator will have to set additional attributes in Tensor
for allocation purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you are saying now, the NumpyAllocator
does extend Tensor
class to add allocator specific detail instead of directly adding attributes to a Tensor
instance. I think this is just a different way to achieve my earlier comment about buffer allocation, users should process the output buffer via DLPack
/ Tensor
interface which is allocator agnostic, my example usage is not appropriate as it assumes the allocator detail..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, I'm just saying that we should try to decouple the buffer allocation logic from the tensor representation.
pass | ||
|
||
|
||
class TritonCore: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TritonServer?
# Example usage of the module: | ||
# # Init | ||
# options = Options() | ||
# options.model_repositories = ["/path/to/models"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be good if we can work on providing an in-memory model repository (or model definition) as a part of this ticket. I think having to setup your model repository outside is not ideal for the optimal user experience.
# else: | ||
# # May also interact with Tensor attributes directly | ||
# res.append(numpy.from_dlpack(output)) | ||
import numpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this import to the top?
ISO8601: int = 1 | ||
|
||
|
||
# [WIP] figure out "singleton" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick question is this still a WIP and this will be resolved before merging? is it possible to elaborate figure out
if it is left for future work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be resolved before merging. This PR will include the actual implementation of the wrapper.
|
||
|
||
# Activity related to timeline tracing | ||
class TraceActivity(Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an FYI, @Tabrizian is working (IIRC) on custom tracing for python backend, which introduces new TraceActivity
: #172. So wee need to coordinate this at some point
self.buffer_manager_thread_count: int = None | ||
|
||
# Model stuff.. | ||
self.model_repositories: Iterable[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if it would make sense to set the model repository path as a required field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in making the paths part of init parameter? i.e. __init__(self, model_paths)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I just thought that instead of doing this
options = Options()
options.model_repositories = ["/path/to/models"]
we can do
options = Options(["/path/to/models"])
given that the model repo paths are required when starting the server.
|
||
|
||
# [WIP] figure out "singleton" | ||
class GlobalLogger: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class GlobalLogger: | |
class Logger: |
# finally: delete TRITONSERVER_Parameter | ||
pass | ||
|
||
def close(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def close(self): | |
def _close(self): |
def __init__(self, | ||
name: str, | ||
version: int = -1, | ||
consumed_callback: Callable = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anyway we could remove the need for this callback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not needed unless the user want to know when he may modify the content of the InferenceRequest
, in most context the input tensors. But if there is no reuse of the request and its content, then it is not necessary:
input_np = np.array(...)
request = InferenceRequest(...)
request.inputs = {"input", input_np}
# async infer..
input_np[0] = ... # modifying tensor content can cause issue if inference hasn't fully consume the input
FIXME
are the sections that further discussion is desired.The wrapper is restricted that a lot of interaction with in-process API is pre-defined (i.e. how to handle released
TRITONSERVER_Request
andTRITONSERVER_Response
). The intention is that user shouldn't need to handle object lifecycle explicitly. The lower level binding will also be provided if they want finer control (i.e. reuse underlying objects to avoid alloc/release overhead).At the end of
_server.py
is a basic example usage of the wrapper. At the end of_infer.py
is an example implementation of allocator.