Description
What would you like to be added:
I would like to propose the introduction of a new prefix aware scorer. This new scorer will assign a numerical value to a target pod part of an inference pool based on prefix matching. This scorer leverages historical prompt patterns to route requests to pods that have previously handled similar prompt segments.
The scorer keeps track of the prefixes using an in-memory store based on a fast hashing algorithm (xxHash) and a LRU (Least Recently Used) eviction policy to avoid uncontrolled memory consumption.
Despite its similarities with proposal #602 / #768 , this scorer is a self-contained plugin that can be enabled or utilised in conjunction with other scorers, and has no impact on the internal structure of the EPP/scheduler.
Why is this needed:
This scorer can improve cache hits and efficiency without depending on the availability of a distributed KV-cache index, being lightweight and self-contained. It does not guarantee that the pod has the exact cached context for the current request, so it can be considered a best-effort, opportunistic, heuristic-based approach.