Skip to content

Conversation

prokotg
Copy link
Collaborator

@prokotg prokotg commented Oct 6, 2025

This MR introduces multiple type definition on both benchmark and endpoint ends. With these changes, we can define a list of required types on benchmark end. For example: [chat, vlm]. On the endpoint side, we define capabilites which must be a superset of required typed. More information is provided in the documentation

@prokotg prokotg requested review from a team as code owners October 6, 2025 15:43
Copy link

copy-pr-bot bot commented Oct 6, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@prokotg
Copy link
Collaborator Author

prokotg commented Oct 6, 2025

/ok to test cc7936d

Copy link
Contributor

@wprazuch wprazuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments :)

benchmark_type_combination = [benchmark_type_combination]

if model_types.issuperset(set(benchmark_type_combination)):
is_target_compatible = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only keep information about last successful check in this variable - is this not a potential issue? To be completely honest with you, I have a hard time understanding this part, could you elaborate on that? :D

Copy link
Collaborator Author

@prokotg prokotg Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark requirements are strict. From a benchmark perspective, you define a list of sets that are accepted. For example, let's image that your benchmark is a VLM benchmark but accepts both base and chat models. Then, on benchmark side, you would define evaluation.config.supported_endpoint_types as a list of :

config:
  supported_endpoint_types:
    - [vlm, chat]
    - [vlm, completions]

This means, that your model must have all capabilities in at least one of these sets (those sets are implemented as lists but that does not matter here IMO)

So if we define on model endpoint capabilities it must be a superset of at least one element from supported_endpoint_types list.

Let's have a look at examples.

  1. Model [chat] - netiher a superset of [vlm, chat] nor [vlm, completions]
  2. Model [chat, vlm] - superset of [vlm, chat] but not [vlm, completions]
  3. Model [chat, vlm, completions] - superset of both [vlm, chat] and [vlm, completions]
  4. Model [chat, vlm, completions, audio] - superset of both [vlm, chat] and [vlm, completions]

It does not matter which one is accepted - it only matters that at least one is compatible. So we keep the last one

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I do see a crash where we have multiple combinations compatible - which one is passed down to the evaluation harness?

if model_types.issuperset(set(benchmark_type_combination)):
is_target_compatible = True

if evaluation.target.api_endpoint.type is None:
Copy link
Contributor

@wprazuch wprazuch Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I got this right, this will never be true? Since it's going to be wrapped in a list, if it's None, in line 359

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants