-
Notifications
You must be signed in to change notification settings - Fork 2.7k
[NPUW] Enabled model with multiple outputs in LLMInferRequest #31520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[NPUW] Enabled model with multiple outputs in LLMInferRequest #31520
Conversation
28d3e37
to
b5a0806
Compare
b5a0806
to
11ab2cb
Compare
// second and third and combine them using XOR | ||
// and bit shifting: | ||
|
||
return ((hash<std::size_t>()(port.get_index()) ^ (hash<const ov::Node*>()(port.get_node()) << 1)) >> 1) ^ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be enough just to have hash from port.get_node()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, as ov::Node
can have multiple outputs (with different indices)
LOG_DEBUG("Input name " << input_name << " doesn't contain kv cache. Skipping."); | ||
continue; | ||
} | ||
NPUW_ASSERT(m_kvcache_in_ports.find(input_name) == m_kvcache_in_ports.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assert looks very suspicious
// model's I/O are appended to original model's I/O at the end, | ||
// thus it is safe to loop over KVCache I/O blocks just using some | ||
// start offsets. | ||
std::size_t start_idx_in_outputs = 0u; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we set this one to something different than 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In constructor of LLMCompiledModel
: https://github.com/openvinotoolkit/openvino/pull/31520/files#diff-7c0906d2d24827a3f37868e79470c1a41ea4bf66eae148a462c0442f8e9cc191R1087
@@ -531,9 +530,9 @@ void ov::npuw::LLMInferRequest::infer_prefill(ov::SoPtr<ov::ITensor> input_ids, | |||
if (m_lm_head_request) { | |||
LOG_DEBUG("Calling inference for LM head model."); | |||
m_lm_head_request->infer(); | |||
m_logits = m_lm_head_request->get_tensor(m_lm_head_logits_port); | |||
update_out_tensors_from(m_lm_head_request); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong. Only logits
will be in LM head, other outputs should be gathered from prefill.
Details:
Tickets: