From dad7d3c7fc1ede2b3395d35456035298d9f167bb Mon Sep 17 00:00:00 2001 From: dzlab Date: Sun, 11 May 2025 11:30:23 -0700 Subject: [PATCH 1/6] Python RocksDB WideColumn Database --- _posts/2025-05-11.md | 118 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 118 insertions(+) create mode 100644 _posts/2025-05-11.md diff --git a/_posts/2025-05-11.md b/_posts/2025-05-11.md new file mode 100644 index 0000000..9ea745a --- /dev/null +++ b/_posts/2025-05-11.md @@ -0,0 +1,118 @@ +--- +layout: post +comments: true +title: Pinterest's Wide Column Database in Python with RocksDB +excerpt: Building a simplified version of Pinterest's a Wide Column Database in Python with RocksDB +categories: database +tags: [python,rocksdb] +toc: true +img_excerpt: +--- + + +In a recent article on [Pinterest Engineering Blog](https://medium.com/pinterest-engineering/building-pinterests-new-wide-column-database-using-rocksdb-f5277ee4e3d2), they desribed in a details how they implemented in C++ a RocksDB-based distributed wide column database called **Rockstorewidecolumn**. While their system tackles petabytes and millions of requests per second with a distributed architecture, the core concepts of mapping a wide column data model onto a key-value store like RocksDB are fascinating. + +This article explore how to implement a simpler, single-instance version of Pinterest's **Rockstorewidecolumn** in Python using the power and efficiency of RocksDB. + + +### What's a Wide Column Database, Anyway? + +Think beyond traditional relational tables with fixed schemas. A wide column database offers: + +* **Rows:** Each identified by a unique `row_key`. +* **Flexible Columns:** Each row can have a different set and number of columns. No predefined schema for all rows! +* **Columnar Data:** Data is organized by columns within a row. +* **Versioned Cells:** Often, values within a column can have multiple versions, typically timestamped. + +This model is great for use cases like user profiles (where users have varying attributes), time-series data, or, as Pinterest showed, storing user event sequences. + +### The Challenge: From Wide Columns to Simple Keys & Values + +RocksDB is an incredibly fast embedded key-value store. It doesn't inherently understand "rows," "columns," or "versions." It just knows keys and values, both of which are byte strings. Our main task is to cleverly design a **key structure** that lets us represent our wide column model. + +### Our Data Model & RocksDB Key Design + +1. **Logical View:** + * **Row Key:** Uniquely identifies a row (e.g., `user123`). + * **Column Name:** Identifies a specific attribute within a row (e.g., `email`, `last_login_event`). + * **Timestamp:** When this specific piece of data was recorded (e.g., milliseconds since epoch). + * **Value:** The actual data for that row, column, and timestamp. + +2. **Storage View (The RocksDB Key):** + To store a specific cell (a value for a given row, column, and time), we'll concatenate these elements into a single RocksDB key. A separator (like the null byte `\x00`) is crucial. + + `RocksDB_Key = row_key_bytes + SEPARATOR + column_name_bytes + SEPARATOR + encoded_timestamp_bytes` + + The `RocksDB_Value` will simply be the `column_value_bytes`. + +3. **The Timestamp Trick for Versioning:** + We want to retrieve the latest versions of a column first. RocksDB sorts keys lexicographically in ascending order. To get descending order for timestamps: + * Use integer timestamps (e.g., milliseconds since epoch). + * Store `MAX_POSSIBLE_TIMESTAMP - actual_timestamp`. + * Pack this inverted timestamp as a fixed-length, big-endian byte string (e.g., using Python's `struct.pack('>Q', inverted_timestamp)` for an 8-byte unsigned integer). + + This way, newer (smaller inverted) timestamps will sort before older ones. + + +``` ++----------------------+-----------------+-------------------+-----------------+-----------------------+-----------------+-------------------------------------------+ +| dataset_name_bytes | KEY_SEPARATOR | row_key_bytes | KEY_SEPARATOR | column_name_bytes | KEY_SEPARATOR | timestamp_bytes | ++----------------------+-----------------+-------------------+-----------------+-----------------------+-----------------+-------------------------------------------+ +| (String as UTF-8) | (Null Byte `\0`)| (String as UTF-8) | (Null Byte `\0`)| (String as UTF-8) | (Null Byte `\0`)| (8-byte uint64, Big-Endian, Inverted) | ++----------------------+-----------------+-------------------+-----------------+-----------------------+-----------------+-------------------------------------------+ + +``` + +### A Python Implementation Sketch + +We'll use the `python-rocksdb` library. Let's outline a `WideColumnDB` class: + + +**Key Implementation Points:** + +* **`_encode_key` / `_decode_key`:** These are the heart of the system, translating our logical model to and from RocksDB's byte strings. +* **`put_row`:** Takes a list of items for a row. Each item can optionally specify a timestamp. If not, the current server time is used. All writes for a single `put_row` call are wrapped in a `rocksdb.WriteBatch` for atomicity at the row-key level for that call. +* **`get_row`:** This is the most complex. + * It uses RocksDB's iterators and `seek()` operations. + * To get all columns for a row, it seeks to `row_key_bytes + SEPARATOR`. + * To get specific columns, it can either iterate and filter or seek to `row_key_bytes + SEPARATOR + column_name_bytes + SEPARATOR`. + * It collects up to `num_versions` for each requested column, respecting the (optional) time range. +* **`delete_row`:** Also uses iterators to find all keys matching the criteria (entire row, specific columns, or even specific versions) and deletes them using a `WriteBatch`. + +### Unlocking Wide Column Features + +With this key structure, several wide column features become quite natural: + +* **Versioned Values:** Automatically handled by including the timestamp in the key. Each update (even an "overwrite" of a conceptual column) with a new timestamp creates a new, distinct entry in RocksDB. +* **Time Range Queries:** The `get_row` method can filter versions based on `start_timestamp` and `end_timestamp` by examining the decoded timestamp from the key. +* **Out-of-Order Updates:** Clients can provide their own timestamps for data, allowing for backfills or event-time recording. +* **TTL (Time-to-Live):** + * **Read-time enforcement:** When reading, check `key_timestamp + configured_ttl < current_timestamp`. If expired, don't return it. + * **Physical deletion:** This is trickier for a simple implementation. RocksDB's compactions will eventually remove deleted data. A more advanced system might use RocksDB's compaction filters or a background process to scan and delete expired keys. + +### Simplicity's Trade-offs + +This Python implementation is a great learning tool and can be surprisingly useful for smaller-scale applications: + +* **Single Instance:** It's not distributed, so no built-in replication, sharding, or high availability like Pinterest's Rockstorewidecolumn. +* **Basic Compaction:** Relies on RocksDB's default compaction unless you delve into advanced configurations or custom filters for TTL. +* **Pagination:** The `get_row` example above doesn't include pagination for very wide rows (many columns). This would require returning a "continuation token" (e.g., the last key part processed) for the client to pass in the next request. + +### What's Next? + +If you were to expand this: + +* **Dataset/Table Management:** Consider using RocksDB's "Column Families" for better logical separation of different datasets (tables) within a single DB instance. +* **Advanced TTL:** Implement custom compaction filters or background jobs for efficient TTL enforcement. +* **Robust Pagination:** Add proper marker-based pagination to `get_row`. +* **Serialization:** Use a more robust serialization format than plain strings for values (e.g., JSON, MessagePack, Protobuf). + +### Conclusion + +Building a simplified wide column database on top of RocksDB in Python is an achievable and insightful exercise. By carefully designing your key structure, you can map complex data models onto a high-performance key-value store. While it won't rival enterprise-grade distributed systems, it provides a powerful local storage solution and a deeper understanding of how these systems work under the hood. + +Happy coding! + +## That's all folks + +I hope you enjoyed this article, feel free to leave a comment or reach out on twitter [@bachiirc](https://twitter.com/bachiirc). From 7adcbdaa22249b5b3591daa4d2bbedb22e7262c2 Mon Sep 17 00:00:00 2001 From: dzlab Date: Sun, 11 May 2025 12:07:11 -0700 Subject: [PATCH 2/6] added section for data modeling and key construction --- _posts/2025-05-11.md | 64 +++++++++++++++++++++++++++----------------- 1 file changed, 39 insertions(+), 25 deletions(-) diff --git a/_posts/2025-05-11.md b/_posts/2025-05-11.md index 9ea745a..88bddac 100644 --- a/_posts/2025-05-11.md +++ b/_posts/2025-05-11.md @@ -10,7 +10,7 @@ img_excerpt: --- -In a recent article on [Pinterest Engineering Blog](https://medium.com/pinterest-engineering/building-pinterests-new-wide-column-database-using-rocksdb-f5277ee4e3d2), they desribed in a details how they implemented in C++ a RocksDB-based distributed wide column database called **Rockstorewidecolumn**. While their system tackles petabytes and millions of requests per second with a distributed architecture, the core concepts of mapping a wide column data model onto a key-value store like RocksDB are fascinating. +In a recent article on [Pinterest Engineering Blog](https://medium.com/pinterest-engineering/building-pinterests-new-wide-column-database-using-rocksdb-f5277ee4e3d2), they desribed in details how they implemented in C++ a RocksDB-based distributed wide column database called **Rockstorewidecolumn**. While their system tackles petabytes and millions of requests per second with a distributed architecture, the core concepts of mapping a wide column data model onto a key-value store like RocksDB are fascinating. This article explore how to implement a simpler, single-instance version of Pinterest's **Rockstorewidecolumn** in Python using the power and efficiency of RocksDB. @@ -24,43 +24,57 @@ Think beyond traditional relational tables with fixed schemas. A wide column dat * **Columnar Data:** Data is organized by columns within a row. * **Versioned Cells:** Often, values within a column can have multiple versions, typically timestamped. -This model is great for use cases like user profiles (where users have varying attributes), time-series data, or, as Pinterest showed, storing user event sequences. +This model is great for use cases like user profiles where users have varying attributes which can be available for some users and not for others, time-series data, or, as Pinterest showed, storing user event sequences. -### The Challenge: From Wide Columns to Simple Keys & Values +### From Wide Columns to Simple Keys & Values -RocksDB is an incredibly fast embedded key-value store. It doesn't inherently understand "rows," "columns," or "versions." It just knows keys and values, both of which are byte strings. Our main task is to cleverly design a **key structure** that lets us represent our wide column model. +RocksDB is an incredibly fast embedded key-value store. However, It doesn't inherently understand "rows," "columns," or "versions." It just knows keys and values, and both of which are byte strings. Our main task is to cleverly design a **key structure** that lets us represent our wide column model. -### Our Data Model & RocksDB Key Design +From Pinterest's article, the Data Model Mapping (or **Logical View**) from Wide Columns to Key-Value looks like this -1. **Logical View:** - * **Row Key:** Uniquely identifies a row (e.g., `user123`). - * **Column Name:** Identifies a specific attribute within a row (e.g., `email`, `last_login_event`). - * **Timestamp:** When this specific piece of data was recorded (e.g., milliseconds since epoch). - * **Value:** The actual data for that row, column, and timestamp. +* **Dataset:** A collection of data for a use case (like a table). +* **Row:** Identified by a `row_key` (e.g., `user123`), contains items. +* **Item:** A `column_name` identifying a specific attribute within a row (e.g., `email`, `last_login_event`) with a list of versioned cells. +* **Cell:** A `timestamp` when this specific piece of data was recorded (e.g., milliseconds since epoch) and a `column_value` (the actual data). -2. **Storage View (The RocksDB Key):** - To store a specific cell (a value for a given row, column, and time), we'll concatenate these elements into a single RocksDB key. A separator (like the null byte `\x00`) is crucial. +To store a specific cell (a value for a given dataset, row, column, and time), we can simply concatenate these elements into a single RocksDB key and use a separator like the null byte `\x00`. The choice of the good separator is crutial as so what we don't confuse it with characters from the other attributes. This **Storage View** is visually explained with the following diagram. - `RocksDB_Key = row_key_bytes + SEPARATOR + column_name_bytes + SEPARATOR + encoded_timestamp_bytes` +``` ++----------------------+-----------------+-------------------+-----------------+-----------------------+-----------------+-----------------------------------------+ +| dataset_name_bytes | KEY_SEPARATOR | row_key_bytes | KEY_SEPARATOR | column_name_bytes | KEY_SEPARATOR | timestamp_bytes | ++----------------------+-----------------+-------------------+-----------------+-----------------------+-----------------+-----------------------------------------+ +| (String as UTF-8) | (Null Byte `\0`)| (String as UTF-8) | (Null Byte `\0`)| (String as UTF-8) | (Null Byte `\0`)| (8-byte uint64, Big-Endian, Inverted) | ++----------------------+-----------------+-------------------+-----------------+-----------------------+-----------------+-----------------------------------------+ +``` - The `RocksDB_Value` will simply be the `column_value_bytes`. +One other thing to consider in the implementation is the versioning, and the ability to retrieve the latest versions of a column first. -3. **The Timestamp Trick for Versioning:** - We want to retrieve the latest versions of a column first. RocksDB sorts keys lexicographically in ascending order. To get descending order for timestamps: - * Use integer timestamps (e.g., milliseconds since epoch). - * Store `MAX_POSSIBLE_TIMESTAMP - actual_timestamp`. - * Pack this inverted timestamp as a fixed-length, big-endian byte string (e.g., using Python's `struct.pack('>Q', inverted_timestamp)` for an 8-byte unsigned integer). +For this, we can use a Timestamp trick that leverages the fact that RocksDB sorts keys lexicographically in ascending order. In fact, we can get a descending order for timestamps as follows: +* Use integer timestamps (e.g., milliseconds since epoch). +* Store `MAX_POSSIBLE_TIMESTAMP - actual_timestamp`. +* Pack this inverted timestamp as a fixed-length, big-endian byte string (e.g., using Python's `struct.pack('>Q', inverted_timestamp)` for an 8-byte unsigned integer). - This way, newer (smaller inverted) timestamps will sort before older ones. +This way, newer (smaller inverted) timestamps will sort before older ones. +Here is a complete Python snippet that demostrates how a Key is constructed: +```python +import struct +SEPARATOR = b"\x00" +dataset = b"user_profile" +row_key = b"user123" +column_name = b"email" +timestamp_ms = 1678886400000 +MAX_UINT64 = 2**64 - 1 +inverted_ts_bytes = struct.pack('>Q', MAX_UINT64 - timestamp_ms) +# The RocksDB key might look like +dataset + SEPARATOR + row_key + SEPARATOR + column_name + SEPARATOR + inverted_ts_bytes ``` -+----------------------+-----------------+-------------------+-----------------+-----------------------+-----------------+-------------------------------------------+ -| dataset_name_bytes | KEY_SEPARATOR | row_key_bytes | KEY_SEPARATOR | column_name_bytes | KEY_SEPARATOR | timestamp_bytes | -+----------------------+-----------------+-------------------+-----------------+-----------------------+-----------------+-------------------------------------------+ -| (String as UTF-8) | (Null Byte `\0`)| (String as UTF-8) | (Null Byte `\0`)| (String as UTF-8) | (Null Byte `\0`)| (8-byte uint64, Big-Endian, Inverted) | -+----------------------+-----------------+-------------------+-----------------+-----------------------+-----------------+-------------------------------------------+ +Which results in a Key that looks like this: + +```python +b'user_profile\x00user123\x00email\x00\xff\xff\xfey\x1a\x92\x8f\xff' ``` ### A Python Implementation Sketch From 8df8119dfbcd27be56fea02c70695d3f89689531 Mon Sep 17 00:00:00 2001 From: dzlab Date: Sun, 11 May 2025 12:25:16 -0700 Subject: [PATCH 3/6] added description for implementation --- _posts/2025-05-11.md | 23 +++++++++++------------ 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/_posts/2025-05-11.md b/_posts/2025-05-11.md index 88bddac..e9b7061 100644 --- a/_posts/2025-05-11.md +++ b/_posts/2025-05-11.md @@ -77,21 +77,20 @@ Which results in a Key that looks like this: b'user_profile\x00user123\x00email\x00\xff\xff\xfey\x1a\x92\x8f\xff' ``` -### A Python Implementation Sketch +### Python Implementation -We'll use the `python-rocksdb` library. Let's outline a `WideColumnDB` class: +The full implementation of this Datastore can be found at this [GitHub KVWC project](https://github.com/dzlab/vibecoding/tree/main/kvwc), specifically in the [WideColumnDB](https://github.com/dzlab/vibecoding/blob/main/kvwc/wide_column_db.py) class. +Here are some key points from this implementation: -**Key Implementation Points:** - -* **`_encode_key` / `_decode_key`:** These are the heart of the system, translating our logical model to and from RocksDB's byte strings. -* **`put_row`:** Takes a list of items for a row. Each item can optionally specify a timestamp. If not, the current server time is used. All writes for a single `put_row` call are wrapped in a `rocksdb.WriteBatch` for atomicity at the row-key level for that call. -* **`get_row`:** This is the most complex. - * It uses RocksDB's iterators and `seek()` operations. - * To get all columns for a row, it seeks to `row_key_bytes + SEPARATOR`. - * To get specific columns, it can either iterate and filter or seek to `row_key_bytes + SEPARATOR + column_name_bytes + SEPARATOR`. - * It collects up to `num_versions` for each requested column, respecting the (optional) time range. -* **`delete_row`:** Also uses iterators to find all keys matching the criteria (entire row, specific columns, or even specific versions) and deletes them using a `WriteBatch`. +* **`_encode_key` / `_decode_key`:** facilitate translating our logical model to and from RocksDB's byte strings. +* **`put_row`:** Takes a list of items for a row. Each item can optionally specify a timestamp. If not, the current server time is used. All writes for a single `put_row` call are wrapped in a `rocksdb.WriteBatch` for atomicity at the row-key level for that call. +* **`get_row`:** the most complex method + * It uses RocksDB's iterators and `seek()` operations. + * To get all columns for a row, it seeks to `row_key_bytes + SEPARATOR`. + * To get specific columns, it can either iterate and filter or seek to `row_key_bytes + SEPARATOR + column_name_bytes + SEPARATOR`. + * It collects up to `num_versions` for each requested column, respecting the (optional) time range. +* **`delete_row`:** Also uses iterators to find all keys matching the criteria (entire row, specific columns, or even specific versions) and deletes them using a `WriteBatch`. ### Unlocking Wide Column Features From b0ed7403291c8758bdde5de2db181eb28fe79706 Mon Sep 17 00:00:00 2001 From: dzlab Date: Sun, 11 May 2025 12:39:59 -0700 Subject: [PATCH 4/6] sections for reflection --- _posts/2025-05-11.md | 36 +++++++++++++++++------------------- 1 file changed, 17 insertions(+), 19 deletions(-) diff --git a/_posts/2025-05-11.md b/_posts/2025-05-11.md index e9b7061..cac174c 100644 --- a/_posts/2025-05-11.md +++ b/_posts/2025-05-11.md @@ -94,31 +94,29 @@ Here are some key points from this implementation: ### Unlocking Wide Column Features -With this key structure, several wide column features become quite natural: +With the chosen key structure in our implementation, several wide column features become quite natural: -* **Versioned Values:** Automatically handled by including the timestamp in the key. Each update (even an "overwrite" of a conceptual column) with a new timestamp creates a new, distinct entry in RocksDB. -* **Time Range Queries:** The `get_row` method can filter versions based on `start_timestamp` and `end_timestamp` by examining the decoded timestamp from the key. -* **Out-of-Order Updates:** Clients can provide their own timestamps for data, allowing for backfills or event-time recording. -* **TTL (Time-to-Live):** - * **Read-time enforcement:** When reading, check `key_timestamp + configured_ttl < current_timestamp`. If expired, don't return it. - * **Physical deletion:** This is trickier for a simple implementation. RocksDB's compactions will eventually remove deleted data. A more advanced system might use RocksDB's compaction filters or a background process to scan and delete expired keys. +* **Versioned Values:** Automatically handled by including the timestamp in the key. Each update (even an "overwrite" of a conceptual column) with a new timestamp creates a new, distinct entry in RocksDB. +* **Time Range Queries:** The `get_row` method can filter versions based on `start_timestamp` and `end_timestamp` by examining the decoded timestamp from the key. +* **Out-of-Order Updates:** Clients can provide their own timestamps for data, allowing for backfills or event-time recording. +* **TTL (Time-to-Live):** + * **Read-time enforcement:** When reading, check `key_timestamp + configured_ttl < current_timestamp`. If expired, don't return it. + * **Physical deletion:** This is trickier for a simple implementation. RocksDB's compactions will eventually remove deleted data. A more advanced system might use RocksDB's compaction filters or a background process to scan and delete expired keys. -### Simplicity's Trade-offs - -This Python implementation is a great learning tool and can be surprisingly useful for smaller-scale applications: +### What's Next? -* **Single Instance:** It's not distributed, so no built-in replication, sharding, or high availability like Pinterest's Rockstorewidecolumn. -* **Basic Compaction:** Relies on RocksDB's default compaction unless you delve into advanced configurations or custom filters for TTL. -* **Pagination:** The `get_row` example above doesn't include pagination for very wide rows (many columns). This would require returning a "continuation token" (e.g., the last key part processed) for the client to pass in the next request. +The Trade-offs made in our Python implementation, makes it the datastore surprisingly useful for smaller-scale applications: -### What's Next? +* **Single Instance:** It's not distributed, so no built-in replication, sharding, or high availability like Pinterest's Rockstorewidecolumn. +* **Basic Compaction:** Relies on RocksDB's default compaction unless you delve into advanced configurations or custom filters for TTL. +* **Pagination:** The `get_row` example above doesn't include pagination for very wide rows (many columns). This would require returning a "continuation token" (e.g., the last key part processed) for the client to pass in the next request. -If you were to expand this: +Here are few things to consider if we were to expand this implementation: -* **Dataset/Table Management:** Consider using RocksDB's "Column Families" for better logical separation of different datasets (tables) within a single DB instance. -* **Advanced TTL:** Implement custom compaction filters or background jobs for efficient TTL enforcement. -* **Robust Pagination:** Add proper marker-based pagination to `get_row`. -* **Serialization:** Use a more robust serialization format than plain strings for values (e.g., JSON, MessagePack, Protobuf). +* **Dataset/Table Management:** Consider using RocksDB's "Column Families" for better logical separation of different datasets (tables) within a single DB instance. +* **Advanced TTL:** Implement custom compaction filters or background jobs for efficient TTL enforcement. +* **Robust Pagination:** Add proper marker-based pagination to `get_row`. +* **Serialization:** Use a more robust serialization format than plain strings for values (e.g., JSON, MessagePack, Protobuf). ### Conclusion From c77fd37a4ba3deeeb776d124ae644bad75b2a35d Mon Sep 17 00:00:00 2001 From: dzlab Date: Sun, 11 May 2025 12:43:34 -0700 Subject: [PATCH 5/6] added conclusion --- _posts/2025-05-11.md | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/_posts/2025-05-11.md b/_posts/2025-05-11.md index cac174c..a9e6034 100644 --- a/_posts/2025-05-11.md +++ b/_posts/2025-05-11.md @@ -118,12 +118,8 @@ Here are few things to consider if we were to expand this implementation: * **Robust Pagination:** Add proper marker-based pagination to `get_row`. * **Serialization:** Use a more robust serialization format than plain strings for values (e.g., JSON, MessagePack, Protobuf). -### Conclusion - -Building a simplified wide column database on top of RocksDB in Python is an achievable and insightful exercise. By carefully designing your key structure, you can map complex data models onto a high-performance key-value store. While it won't rival enterprise-grade distributed systems, it provides a powerful local storage solution and a deeper understanding of how these systems work under the hood. - -Happy coding! ## That's all folks +This article walkthrough the implementation of a simplified version of Pinterest's Rockstorewidecolumn. We demonstrated that by carefully designing a key structure, we can map complex data models onto a high-performance key-value store like RocksDB. I hope you enjoyed this article, feel free to leave a comment or reach out on twitter [@bachiirc](https://twitter.com/bachiirc). From 9c62b29450d335b5abdc6c77fde1b21d63702dc6 Mon Sep 17 00:00:00 2001 From: dzlab Date: Sun, 11 May 2025 12:44:35 -0700 Subject: [PATCH 6/6] rename article --- _posts/{2025-05-11.md => 2025-05-11-wide-column-database.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename _posts/{2025-05-11.md => 2025-05-11-wide-column-database.md} (100%) diff --git a/_posts/2025-05-11.md b/_posts/2025-05-11-wide-column-database.md similarity index 100% rename from _posts/2025-05-11.md rename to _posts/2025-05-11-wide-column-database.md