Limitting concurrently loaded models #8103

polyfloyd · 2025-03-25T10:55:37Z

polyfloyd
Mar 25, 2025

Hi!

I am tasked to optimize a model repository consisting of ~40 models, most of which demand a couple of GB's of VRAM. The models will not receive a lot of traffic but they do have to be on standby for the occasional inference.

After reading the docs, I concluded I should use the rate limiting feature. I defined a resource MEM and defined an instance group like so for each VRAM bound model:

instance_group [
  {
    count: 1
    kind: KIND_GPU
    rate_limiter {
      resources [
        {
          name: "VRAM"
          count: 6
        }
      ]
    }
  }
]

I set --rate-limit-resource=MEM:12 on the command to start Triton, but Triton then proceeded to load all models in the repository regardless of the resource limit.

I think I am missing something. Does rate limiting not apply to the initial loading? Or am I loading my models in the wrong place? Or am I not unloading correctly?

All models are Python based and they operate like below:

class TritonPythonModel:
    def initialize(self, args):
        # Model is loaded here.
    def execute(self, requests):
        # Inferrence...
    def finalize(self):
        # Model is unloaded here.

Help and any pointers are welcome!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Limitting concurrently loaded models #8103

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Limitting concurrently loaded models #8103

Uh oh!

Uh oh!

polyfloyd Mar 25, 2025

Replies: 0 comments

polyfloyd
Mar 25, 2025