AIBrix KV cache API #6

DwyaneShi · 2024-11-06T18:27:53Z

Summary

This PR introduces AIBrixBlobStorage, a new storage layer for KV tensors in which each unit of storage corresponds to a "chunk" of tokens (a fixed-size grouping of tokens). This approach improves storage efficiency and metadata management by:

Reducing metadata overhead compared to the v6d v1 API.
Avoiding the access overhead of the Posix File API seen in v6d v2 API.

Design Details

1. Chunk-based Storage Model: AIBrixBlobStorage stores KV tensors at the chunk level, allowing efficient retrieval and storage by grouping data in fixed-size chunks of tokens.

2. S3-FIFO Replacement Policy

The S3-FIFO policy optimizes chunk retention by being scan-resistant and recognizing frequently accessed ("hot") chunks.
The Main FIFO list mirrors the global chunk list.
The LocalSync function periodically persists new chunks in the Main FIFO list to the global chunk list. Persisted chunks evicted from the Main FIFO list will be deleted from the global chunk list., maintaining a streamlined cache.

3. Chunk Naming

Each chunk has a unique name calculated by name = namespace + "_" + hash(hash(previous chunk) + tokens of current chunk).
This hashing mechanism provides unique chunk names that support efficient lookup and conflict avoidance
This naming mechanism also includes the chunk lineage info.

4. TTL-based Global GC

Each chunk’s last access time is recorded in its access_time label, which is only pushed to the global metadata during LocalSync to reduce frequent updates.
The GlobalGC function periodically checks all chunks for expiration by TTL.

5. KVCacheChunk Abstraction

KVCacheChunk represents a fixed number of tokens within a chunk and organizes its object blob with:

All KV tensors stored first.
All tokens (including prefix and current tokens) appended afterward, allowing for exact token list matching.

Metadata in each chunk includes:

Namespace: Used as a chunk name prefix and to list all chunks.
Access time: Tracks chunk access for TTL-based GC.
md5sum: Covers all tokens to detect corruption when reconstructing chunks. This checksum excludes tensors to reduce compute overhead.

For more details on the S3-FIFO policy, refer to the post here.

modules/llm-cache/ds/kv_cache_chunk.cc

FuturisticWater · 2024-11-18T18:22:43Z

modules/llm-cache/ds/kv_cache_chunk.h

+// store all the tokens (including prefix tokens and current tokens
+// cached in the chunk), which will be used to avoid hash conflicts.
+//
+// In its metadata, we store the namespace (i.e., `ns_`), which will


Please also describe the intended usage for namespaces: e.g., is it intended to be used in a multi-tenant deployment scheme to distinguish the different tenants/models that store their KV chunks into the same KV cache?

FuturisticWater · 2024-11-18T18:25:11Z

modules/llm-cache/ds/kv_cache_chunk.h

+// compare it with the md5sum in the metadata. If they are the same,
+// we consider the chunk is valid. Otherwise, we consider the chunk is
+// corrupted. By far, we don't use the md5sum of the tensors to alleviate
+// the compute overhead.


Please explain why this decision is made: e.g., for simplicity or for performance?

BTW: "By far" => "So far".

FuturisticWater · 2024-11-18T18:28:50Z

modules/llm-cache/ds/kv_cache_chunk.h

+
+ private:
+  std::shared_ptr<Buffer> buffer_;
+  // number of prefix tokens and current tokens in the chunk


Define the number of prefix tokens seems more natural here. In the comment, you can define "total #. tokens in the prompt up to this chunk = <num_prefix_tokens> + <chunk_size>".

FuturisticWater · 2024-11-18T18:34:53Z

modules/llm-cache/ds/kv_cache_chunk.h

+ public:
+  static std::unique_ptr<Object> Create() __attribute__((used)) {
+    return std::static_pointer_cast<Object>(
+        std::unique_ptr<KVCacheChunk>{new KVCacheChunk()});


How about simply std::make_unique<KVCacheChunk>() here instead of new explicitly?

FuturisticWater · 2024-11-18T18:35:59Z

modules/llm-cache/ds/kv_cache_chunk.h

+
+  void Construct(const ObjectMeta& meta) override;
+
+  int GetChunkSize() { return chunk_size_; }


const method

FuturisticWater · 2024-11-18T19:07:46Z

modules/llm-cache/ds/kv_cache_chunk.h

+      const std::vector<std::vector<std::pair<LLMKV, LLMKV>>>& kv_tensor);
+
+  Status Query(const std::vector<int>& prefix, const std::vector<int>& tokens,
+               std::vector<std::vector<std::pair<LLMKV, LLMKV>>>& kv_tensor);


Use pointer type for output parameters.

FuturisticWater · 2024-11-18T19:12:30Z

modules/llm-cache/ds/kv_cache_chunk.h

+  static Status Make(std::shared_ptr<KVCacheChunkBuilder>& chunk_builder,
+                     RPCClient& rpc_client, int tensor_nbytes, int layer,
+                     int chunk_size, const std::string& kv_cache_ns,
+                     ObjectID chunk_id);


When querying/loading KV chunk from vineyard, the shape of the chunk , e.g., KV tensors dimensions are actually determined by the existing vineyard objects, so there is no need to repeat them here.

Alternatively, if the shape is specified here, it should be validated against the shape loaded from vineyard.

FuturisticWater · 2024-11-18T19:28:20Z

modules/llm-cache/hash/md5.h

+
+namespace vineyard {
+
+std::string md5(const std::string& content) {


Please add method comment to point out this method produce a MD5 as a human-readable string, instead of the byte sequence.

FuturisticWater · 2024-11-18T19:33:40Z

modules/llm-cache/hash/hasher.h

+      }
+      candidates.insert(candidates.end(), tokens.begin() + i,
+                        tokens.begin() + i + chunkSize);
+      auto currHash =


An explicit type specifier is more suitable here so it's clear what is the hash value (bit-)size.

FuturisticWater · 2024-11-18T21:50:56Z

modules/llm-cache/hash/hasher.h

+   * The function thus produces a series of interdependent hash values, each
+   * influenced by the previous hash.
+   */
+  Status computeChunkHashesForTokens(const std::vector<int>& tokens,


Use pointer type for output parameter.

Please update the comment to mention the hash value bit-size.

FuturisticWater · 2024-11-18T22:55:51Z

modules/llm-cache/ds/kv_cache_chunk.h

+        std::unique_ptr<KVCacheChunk>{new KVCacheChunk()});
+  }
+
+  void Construct(const ObjectMeta& meta) override;


IIUC, this method can only be called after meta has been populated by fetching object metadata & data (since you're populating buffers also)? If so, please clearly document this or refer readers to the base class for this requirement.

- rolling hash based approach to maintain the token lineage - S3-FIFO inspired sync and GC mechanism Signed-off-by: DwyaneShi <[email protected]>

some unit tests fail if using arrow 18.0.0 Signed-off-by: DwyaneShi <[email protected]>

Signed-off-by: DwyaneShi <[email protected]>

FuturisticWater · 2024-11-25T21:16:47Z

modules/llm-cache/ds/kv_cache_chunk.cc

+  RETURN_ON_ASSERT(meta.GetKeyValue<int>(KVCacheChunk::kFieldNameLayer) ==
+                   layer_);
+  // We assume it's not possilbe to have same name and md5 of all tokens
+  RETURN_ON_ASSERT(meta.GetKeyValue<std::string>(KVCacheChunk::kFieldNameMd5) ==


`We assume that it's not possible for two different sequences of tokens to have the same name and md5.

How safe is this assumption? Can we afford to compare the sequence of tokens here to guard against hash-collision?

FuturisticWater · 2024-11-25T21:18:13Z

modules/llm-cache/ds/kv_cache_chunk.h

+  ~KVCacheChunkBuilder() = default;
+
+ private:
+  Status Construct();


Please add API comments here, in particular about its intended usage.

FuturisticWater · 2024-11-25T21:20:45Z

modules/llm-cache/ds/kv_cache_chunk.cc

+
+  RETURN_ON_ASSERT(rpc_client_.Connected());
+
+  VLOG(100) << "Constructing " << ObjectIDToString(chunk_id_);


In case of constructing a new KVCacheChunk instance, the chunk id initialized by Make above is an invalid id. Is chunk_id_ here still the invalid id? If not, when and where does it gets initialized?

FuturisticWater · 2024-11-25T21:25:56Z

modules/llm-cache/ds/kv_cache_chunk.cc

+  }
+
+  // fetch from remote
+  if (object == nullptr) {


This section (from line 140 to 147) could be better structured as:

Suggested change

if (object == nullptr) {

if (!status.ok()) {

...

return Status::ObjectNotExists();

}

if (object_meta.IsLocal()) {

object = rpc_client_.GetObject(chunk_id_);

RETURN_ON_ASSERT(object != nullptr);

return OkStatus();

}

// fetch from remote

...

FuturisticWater · 2024-11-25T21:28:02Z

modules/llm-cache/ds/kv_cache_chunk.cc

+    RETURN_ON_ERROR(rpc_client_.ClusterInfo(cluster_info));
+    std::string rpc_endpoint =
+        cluster_info[object_meta.GetInstanceId()].value("rpc_endpoint", "");
+    if (!rpc_endpoint.empty()) {


Suggested change

if (!rpc_endpoint.empty()) {

if (rpc_endpoint.empty()) {

// log error obtaining rpc endpoint

return error_status;

}

// continue with the remaining steps

This is more specific than return a generic object is nullptr error below.

DwyaneShi requested review from Jeffwan, FuturisticWater and happyandslow November 6, 2024 18:29

DwyaneShi force-pushed the dev.hybrid-api branch from 158fc25 to 50b87af Compare November 11, 2024 22:32

FuturisticWater reviewed Nov 14, 2024

View reviewed changes

modules/llm-cache/ds/kv_cache_chunk.cc Show resolved Hide resolved

DwyaneShi force-pushed the dev.hybrid-api branch from 42693cd to 553489b Compare November 14, 2024 04:54

FuturisticWater reviewed Nov 18, 2024

View reviewed changes

DwyaneShi force-pushed the dev.hybrid-api branch from 553489b to 27531d8 Compare November 20, 2024 22:26

DwyaneShi added 5 commits November 20, 2024 16:00

Add AIBrix LLM kv cache API

bc40a94

- rolling hash based approach to maintain the token lineage - S3-FIFO inspired sync and GC mechanism Signed-off-by: DwyaneShi <[email protected]>

build: use arrow 17.0.0 for testing

def5a45

some unit tests fail if using arrow 18.0.0 Signed-off-by: DwyaneShi <[email protected]>

build: fix dependency issue

09ebf7f

Signed-off-by: DwyaneShi <[email protected]>

Enhance comments for major components

06be803

Signed-off-by: DwyaneShi <[email protected]>

Add exception handling

a034f48

Signed-off-by: DwyaneShi <[email protected]>

DwyaneShi force-pushed the dev.hybrid-api branch from 27531d8 to a034f48 Compare November 21, 2024 00:00

FuturisticWater reviewed Nov 25, 2024

View reviewed changes


		void Construct(const ObjectMeta& meta) override;

		int GetChunkSize() { return chunk_size_; }


		namespace vineyard {

		std::string md5(const std::string& content) {


		RETURN_ON_ASSERT(rpc_client_.Connected());

		VLOG(100) << "Constructing " << ObjectIDToString(chunk_id_);

-  if (object == nullptr) {
+  if (!status.ok()) {
+    ...
+    return Status::ObjectNotExists();
+  }
+  if (object_meta.IsLocal()) {
+    object = rpc_client_.GetObject(chunk_id_);
+    RETURN_ON_ASSERT(object != nullptr);
+    return OkStatus();
+  }
+  // fetch from remote
+  ...

-    if (!rpc_endpoint.empty()) {
+    if (rpc_endpoint.empty()) {
+       // log error obtaining rpc endpoint
+       return error_status;
+    }
+    // continue with the remaining steps

AIBrix KV cache API #6

Are you sure you want to change the base?

AIBrix KV cache API #6

Uh oh!

Conversation

DwyaneShi commented Nov 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design Details

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DwyaneShi commented Nov 6, 2024 •

edited

Loading