Skip to content

Triton inference automatic resubmit #1190

@yimuchen

Description

@yimuchen

Right now, the triton_wrapper seems to work fine as long as the server hosting the model scales correctly with the number of active jobs. If the server doesn't keep up, a random chunk would raise a tritonclient.utils.InferenceServerExpection, and the whole evaluation is terminated (Which if very difficult to track down).

We can have this handled either upstream in the trion_wrapper instance, or if we think these server configuration handling should be done on the analyst side, it can be patched into any subclass of triton_wrapper with something like:

from coffea.ml_tools.triton_wrapper import triton_wrapper
from tritonclient.utils import InferenceServerException

class my_model_wrapper(triton_wrapper):
    def prepare_awkward(self, *args, **kwargs):
        pass # Or whatever is needed
        
    def numpy_call(self, output_list, input_dict, ncalls=0):
        """
        Overloading the upstream numpy_call to allow for up to 3 times failure 
        from transient issues with server. (Adding default ncalls will allow this to 
        work seemlessly with the the existing upstream value)
        """
        try:
            return super().numpy_call(output_list, input_dict)
        except InferenceServerException as err:
            print("Caught inference server exception:", err)
            if ncalls > 3:
                print("Not resolved after 3 retries... this is an actual error")
                raise err
            else:
                return self.numpy_call(output_list, input_dict, ncalls + 1)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions