Skip to content
This repository was archived by the owner on Oct 10, 2025. It is now read-only.

Conversation

1amageek
Copy link

@1amageek 1amageek commented Oct 4, 2025

Description

Fixes #6045

Fixes a critical bug where OverflowFile::checkpoint() unconditionally allocated a header page even when no data had been written, causing PrimaryKeyIndexStorageInfo corruption and database reopen failures.

Problem

When creating a VectorIndex without inserting any data, the database checkpoint completes successfully but corrupts the metadata. Reopening the database fails with an assertion error in hash_index.cpp:487:

KU_ASSERT(hashIndexStorageInfo.overflowHeaderPage == INVALID_PAGE_IDX);

Minimal Reproduction

auto db = std::make_unique<Database>("test.db");
auto conn = std::make_unique<Connection>(db.get());

conn->query("CREATE NODE TABLE Item(id STRING PRIMARY KEY, embedding FLOAT[3])");
conn->query("CALL CREATE_VECTOR_INDEX('Item', 'item_idx', 'embedding', metric := 'l2')");

conn.reset();
db.reset();

auto db2 = std::make_unique<Database>("test.db");  // ❌ ASSERTION FAILURE

Root Cause

In src/storage/overflow_file.cpp:236, OverflowFile::checkpoint() was unconditionally allocating a page even when no data had been written:

if (headerPageIdx == INVALID_PAGE_IDX) {
    this->headerPageIdx = getNewPageIdx(&pageAllocator);  // ❌ Allocates even when empty
    headerChanged = true;
}

Sequence of events:

  1. VectorIndex creation creates a PrimaryKeyIndex (for STRING primary key)
  2. PrimaryKeyIndex creates an OverflowFile (for strings >12 bytes) with headerPageIdx = INVALID_PAGE_IDX
  3. During checkpoint, OverflowFile::checkpoint() allocates a page unnecessarily
  4. This sets PrimaryKeyIndexStorageInfo.overflowHeaderPage = 1 (should be INVALID_PAGE_IDX)
  5. Corrupted metadata is serialized to disk
  6. On database reopen, assertion fails

Solution

Skip checkpoint when headerChanged == false, following the same design pattern as NodeTable::checkpoint() and RelTable::checkpoint():

void OverflowFile::checkpoint(PageAllocator& pageAllocator) {
    KU_ASSERT(fileHandle);
    // Skip checkpoint if no data has been written
    if (!headerChanged) {
        return;
    }
    if (headerPageIdx == INVALID_PAGE_IDX) {
        this->headerPageIdx = getNewPageIdx(&pageAllocator);
    }
    // ... rest of checkpoint logic
}

The headerChanged flag is only set to true when actual string data (>12 bytes) is written via OverflowFileHandle::setStringOverflow().

Benefits

  1. ✅ Eliminates metadata corruption
  2. ✅ Consistent with system-wide checkpoint design pattern
  3. ✅ Reduces unnecessary disk I/O for empty overflow files
  4. ✅ Fixes VectorIndex creation without data insertion

Testing

Added comprehensive test suite in test/storage/overflow_file_checkpoint_test.cpp with 5 test cases:

  1. InMemOverflowFileAlwaysAllocatesHeader - Verifies in-memory behavior
  2. ShortStringsDoNotTriggerOverflow - Verifies strings ≤12 bytes are inlined
  3. LongStringsDoTriggerOverflow - Verifies strings >12 bytes use overflow
  4. EmptyOverflowFileHeaderNotChanged - Documents the core bug fix
  5. VectorIndexCreationSequence - Documents the bug scenario

All tests pass:

[==========] Running 5 tests from 1 test suite.
[  PASSED  ] 5 tests.

Files Changed

  • src/storage/overflow_file.cpp - Added early return when headerChanged == false
  • test/storage/CMakeLists.txt - Added new test target
  • test/storage/overflow_file_checkpoint_test.cpp - New test file

Impact

This fix resolves crashes when:

  • Creating a VectorIndex without inserting data
  • Using tables with STRING primary keys and no long strings (≤12 bytes)
  • Working with empty databases that have indexes

Fixes a bug where OverflowFile::checkpoint() unconditionally allocated a header page even when no data had been written, causing PrimaryKeyIndexStorageInfo corruption and database reopen failures.

**Root cause:**
When creating a VectorIndex without inserting data, OverflowFile::checkpoint() allocated a page unnecessarily, setting overflowHeaderPage to a valid page index instead of INVALID_PAGE_IDX.

**Fix:**
Skip checkpoint when headerChanged == false, following the same design pattern as NodeTable and RelTable. The headerChanged flag is only set to true when actual string data (>12 bytes) is written via OverflowFileHandle::setStringOverflow().

**Test coverage:**
Added overflow_file_checkpoint_test.cpp with 5 test cases documenting the bug scenario and verifying correct behavior.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] OverflowFile::checkpoint() corrupts PrimaryKeyIndexStorageInfo when no data has been written

1 participant