You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am tasked to optimize a model repository consisting of ~40 models, most of which demand a couple of GB's of VRAM. The models will not receive a lot of traffic but they do have to be on standby for the occasional inference.
After reading the docs, I concluded I should use the rate limiting feature. I defined a resource MEM and defined an instance group like so for each VRAM bound model:
I set --rate-limit-resource=MEM:12 on the command to start Triton, but Triton then proceeded to load all models in the repository regardless of the resource limit.
I think I am missing something. Does rate limiting not apply to the initial loading? Or am I loading my models in the wrong place? Or am I not unloading correctly?
All models are Python based and they operate like below:
classTritonPythonModel:
definitialize(self, args):
# Model is loaded here.defexecute(self, requests):
# Inferrence...deffinalize(self):
# Model is unloaded here.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi!
I am tasked to optimize a model repository consisting of ~40 models, most of which demand a couple of GB's of VRAM. The models will not receive a lot of traffic but they do have to be on standby for the occasional inference.
After reading the docs, I concluded I should use the rate limiting feature. I defined a resource
MEM
and defined an instance group like so for each VRAM bound model:I set
--rate-limit-resource=MEM:12
on the command to start Triton, but Triton then proceeded to load all models in the repository regardless of the resource limit.I think I am missing something. Does rate limiting not apply to the initial loading? Or am I loading my models in the wrong place? Or am I not unloading correctly?
All models are Python based and they operate like below:
Help and any pointers are welcome!
Beta Was this translation helpful? Give feedback.
All reactions