Skip to content

Commit 0ca9b98

Browse files
authored
Fix spellcheck flow (#441)
* move spellcheck to pull request flow * fixing attempt * another attempt * another attempt * another * another * another one * another one
1 parent 9212165 commit 0ca9b98

File tree

6 files changed

+102
-107
lines changed

6 files changed

+102
-107
lines changed

.github/spellcheck-settings.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,5 +23,5 @@ matrix:
2323
- blockquote
2424
- img
2525
sources:
26-
- '*.md'
27-
- 'docs/*.md'
26+
- "*.md"
27+
- "docs/*.md|!docs/benchmarks_developer.md"

.github/wordlist.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,3 +49,7 @@ valgrind
4949
vecsim
5050
virtualenv
5151
whl
52+
backport
53+
io
54+
redis
55+
hoc

.github/workflows/event-pull_request.yml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,12 +46,23 @@ jobs:
4646
uses: ./.github/workflows/codeql-analysis.yml
4747
secrets: inherit
4848

49+
spellcheck:
50+
runs-on: ubuntu-latest
51+
steps:
52+
- uses: actions/checkout@v4
53+
- name: Spellcheck
54+
uses: rojopolis/spellcheck-github-actions@v0
55+
with:
56+
config_path: .github/spellcheck-settings.yml
57+
task_name: Markdown
58+
4959
pr-validation:
5060
needs:
5161
- check-if-docs-only
5262
- basic-tests
5363
- coverage
5464
- codeql-analysis
65+
- spellcheck
5566
runs-on: ubuntu-latest
5667
if: ${{ !cancelled() }}
5768
steps:

.github/workflows/spellcheck.yml

Lines changed: 0 additions & 20 deletions
This file was deleted.

SECURITY.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,5 +30,5 @@ vulnerability, and only if we have a good reason to believe information about
3030
it is not yet public.
3131

3232
This process involves providing an early notification about the vulnerability,
33-
its impact and mitigations to a short list of vendors under a time-limited
33+
its impact and mitigation to a short list of vendors under a time-limited
3434
embargo on public disclosure.

docs/benchmarks.md

Lines changed: 84 additions & 84 deletions
Original file line numberDiff line numberDiff line change
@@ -3,23 +3,23 @@
33
## Table of contents
44
* [Overview](#overview)
55
* [Run benchmarks](#run-benchmarks)
6-
* [Available sets](#available-sets)
7-
- [BM_VecSimBasics](#bm_vecsimbasics)
8-
- [BM_BatchIterator](#bm_batchiterator)
9-
- [BM_VecSimUpdatedIndex](#bm_vecsimupdatedindex)
6+
* [Available sets](#available-sets)
7+
- [BM_VecSimBasics](#bm_vecsimbasics)
8+
- [BM_BatchIterator](#bm_batchiterator)
9+
- [BM_VecSimUpdatedIndex](#bm_vecsimupdatedindex)
1010
* [ann-benchmarks](#ann-benchmark)
1111

1212
# Overview
13-
We use the [Google benchmark](https://github.com/google/benchmark) tool to run micro-benchmarks for the vector indexes functionality.
14-
Google benchmark is a popular tool for benchmark code snippets, similar to unit tests. It allows a quick way to estimate the runtime of each use case based on several (identical) runs, and prints the results as output. For some tests, the output includes an additional "Recall" metric, which indicates the accuracy in the case of approximate search.
15-
**The recall** is calculated as the size of the intersection set between the number of the ground truth results (calculated by the flat algorithm) and the returned results from the approximate algorithm (HNSW in this case), divided by the number of ground truth results:
13+
We use the [Google benchmark](https://github.com/google/benchmark) tool to run micro-benchmarks for the vector indexes functionality.
14+
Google benchmark is a popular tool for benchmark code snippets, similar to unit tests. It allows a quick way to estimate the runtime of each use case based on several (identical) runs, and prints the results as output. For some tests, the output includes an additional "Recall" metric, which indicates the accuracy in the case of approximate search.
15+
**The recall** is calculated as the size of the intersection set between the number of the ground truth results (calculated by the flat algorithm) and the returned results from the approximate algorithm (HNSW in this case), divided by the number of ground truth results:
1616
$$ recall = \frac{\text{approximate algorithm results } \cap
1717
\text{ ground truth results } } {\text{ground truth results}}
1818
$$
1919
# Run benchmarks
2020
## Required files
2121
The serialized indices files that are used for micro-benchmarking and running ann-benchmark can be found in
22-
`tests/benchmark/data/hnsw_indices.txt`.
22+
`tests/benchmark/data/hnsw_indices.txt`.
2323
To download all the required files, run from the repository root directory:
2424
```sh
2525
wget --no-check-certificate -q -i tests/benchmark/data/hnsw_indices/hnsw_indices_all.txt -P tests/benchmark/data
@@ -29,7 +29,7 @@ To run all test sets, call the following commands from the project root dir:
2929
make benchmark
3030
```
3131
### Running a Subset of Benchmarks
32-
To run only a subset of benchmarks that match a specified `<regex>`, set `BENCHMARK_FILTER=<regex>` environment variable. For example:
32+
To run only a subset of benchmarks that match a specified `<regex>`, set `BENCHMARK_FILTER=<regex>` environment variable. For example:
3333
```sh
3434
make benchmark BENCHMARK_FILTER=fp32*
3535
```
@@ -38,108 +38,108 @@ make benchmark BENCHMARK_FILTER=fp32*
3838
There are currently 3 sets of benchmarks available: `BM_VecSimBasics`, `BM_BatchIterator`, and `BM_VecSimUpdatedIndex`. Each is templated according to the index data type. We run 10 iterations of each test case, unless otherwise specified.
3939
## BM_VecSimBasics
4040
For each combination of data type (fp32/fp64), index type (single/multi), and indexing algorithm (flat, HNSW and tiered-HNSW) the following test cases are included:
41-
1. Measure index total `memory` (runtime and iterations number are irrelevant for this use case, just the memory metric)
42-
2. `AddLabel` - runs for `DEFAULT_BLOCK_SIZE (= 1024)` iterations, in each we add one new label to the index from the `queries` list. Note that for a single value index each label contains one vector, meaning that the number of new labels equals the number of new vectors.
43-
**results:** average time per label, average memory addition per vector, vectors per label.
41+
1. Measure index total `memory` (runtime and iterations number are irrelevant for this use case, just the memory metric)
42+
2. `AddLabel` - runs for `DEFAULT_BLOCK_SIZE (= 1024)` iterations, in each we add one new label to the index from the `queries` list. Note that for a single value index each label contains one vector, meaning that the number of new labels equals the number of new vectors.
43+
**results:** average time per label, average memory addition per vector, vectors per label.
4444
*At the end of the benchmark, we delete the added labels*
45-
3. `DeleteLabel` - runs for `DEFAULT_BLOCK_SIZE (= 1024)` iterations, in each we delete one label from the index. Note that for a single value index each label contains one vector, meaning that the number of deleted labels equals the number of deleted vectors.
46-
**results:** average time per label, average memory addition per vector (a negative value means that the memory has decreased).
45+
3. `DeleteLabel` - runs for `DEFAULT_BLOCK_SIZE (= 1024)` iterations, in each we delete one label from the index. Note that for a single value index each label contains one vector, meaning that the number of deleted labels equals the number of deleted vectors.
46+
**results:** average time per label, average memory addition per vector (a negative value means that the memory has decreased).
4747
*At the end of the benchmark, we restore the deleted vectors under the same labels*
4848

4949
For tiered-HNSW index, we also run these additional tests:
5050

5151
4. `AddLabel_Async` - which is the same as `AddLabel` test, but here we also take into account the time that it takes to the background threads to ingest vectors into HNSW asynchronously (while in `AddLabel` test for the tiered index we only measure the time until vectors are stored in the flat buffer).
5252
5. `DeleteLabel_Async` - which is the same as `DeleteLabel` test, but here we also take into account the time that it takes to the background threads to repair the HNSW graph due to the deletion (while in `DeleteLabel` test for the tiered index we only measure the time until vectors are marked as deleted). Note that the garbage collector of the tiered index is triggered when at least `swapJobsThreshold` vectors are ready to be evicted (this happens when all of their affected neighbors in the graph are repaired). We run this test for `swapJobsThreshold` in `{1, 100, 1024(DEFAULT)}`, and we collect two additional metrics: `num_zombies`, which is the number of vectors that are left to be evicted *after* the test has finished, and `cleanup_time`, which is the number of milliseconds that it took to clean these zombies.
5353

54-
In both tests, we should only consider the `real_time` metric (rather than the `cpu_time`), since `cpu_time` only accounts for the time that the main thread is running.
54+
In both tests, we should only consider the `real_time` metric (rather than the `cpu_time`), since `cpu_time` only accounts for the time that the main thread is running.
5555
#### **TopK benchmarks**
56-
Search for the `k` nearest neighbors of the query.
57-
6. Run `Top_K` query over the flat index (using brute-force search), for each `k=10`, `k=100` and `k=500`
58-
**results:** average time per iteration
59-
7. Run `Top_K` query over the HNSW index, for each pair of `{ef_runtime, k}` from the following:
60-
`{ef_runtime=10, k=10}`
61-
`{ef_runtime=200, k=10}`
62-
`{ef_runtime=100, k=100}`
63-
`{ef_runtime=200, k=100}`
64-
`{ef_runtime=500, k=500}`
56+
Search for the `k` nearest neighbors of the query.
57+
6. Run `Top_K` query over the flat index (using brute-force search), for each `k=10`, `k=100` and `k=500`
58+
**results:** average time per iteration
59+
7. Run `Top_K` query over the HNSW index, for each pair of `{ef_runtime, k}` from the following:
60+
`{ef_runtime=10, k=10}`
61+
`{ef_runtime=200, k=10}`
62+
`{ef_runtime=100, k=100}`
63+
`{ef_runtime=200, k=100}`
64+
`{ef_runtime=500, k=500}`
6565
**results:** average time per iteration, recall
6666
8. Run `Top_K` query over the tiered-HNSW index (in parallel by several threads), for each pair of `{ef_runtime, k}` as in the previous test (assuming all vector are indexed into the HNSW graph). We run for `50` iterations to get a better sense of the parallelism that can be achieved. Here as well, we should only consider the `real_time` measurement rather than the `cpu_time`.
6767
#### **Range query benchmarks**
68-
In range query, we search for all the vectors in the index whose distance to the query vector is lower than the range.
69-
9. Run `Range` query over the flat index (using brute-force search), for each `radius=0.2`, `radius=0.35` and `radius=0.5`
70-
**results:** average time per iteration, average results number per iteration
71-
10. Run `Range` query over the HNSW index, for each pair of `{radius, epsilon}` from the following:
72-
`{radius=0.2, epsilon=0.001}`
73-
`{radius=0.2, epsilon=0.01}`
74-
`{radius=0.2, epsilon=0.1}`
75-
`{radius=0.35, epsilon=0.001}`
76-
`{radius=0.35, epsilon=0.01}`
77-
`{radius=0.35, epsilon=0.1}`
78-
`{radius=0.5, epsilon=0.001}`
79-
`{radius=0.5, epsilon=0.01}`
80-
`{radius=0.5, epsilon=0.1}`
81-
**results:** average time per iteration, average results number per iteration, recall
68+
In range query, we search for all the vectors in the index whose distance to the query vector is lower than the range.
69+
9. Run `Range` query over the flat index (using brute-force search), for each `radius=0.2`, `radius=0.35` and `radius=0.5`
70+
**results:** average time per iteration, average results number per iteration
71+
10. Run `Range` query over the HNSW index, for each pair of `{radius, epsilon}` from the following:
72+
`{radius=0.2, epsilon=0.001}`
73+
`{radius=0.2, epsilon=0.01}`
74+
`{radius=0.2, epsilon=0.1}`
75+
`{radius=0.35, epsilon=0.001}`
76+
`{radius=0.35, epsilon=0.01}`
77+
`{radius=0.35, epsilon=0.1}`
78+
`{radius=0.5, epsilon=0.001}`
79+
`{radius=0.5, epsilon=0.01}`
80+
`{radius=0.5, epsilon=0.1}`
81+
**results:** average time per iteration, average results number per iteration, recall
8282

8383
## BM_BatchIterator
84-
The purpose of these tests is to benchmark batch iterator functionality. The batch iterator is a handle that enables running `Top_K` query in batches, by asking for the next best `n` results repeatedly, until there are no more results to return. We use for this test cases the same indices that were built for the "basic benchmark" for this test case as well.
84+
The purpose of these tests is to benchmark batch iterator functionality. The batch iterator is a handle that enables running `Top_K` query in batches, by asking for the next best `n` results repeatedly, until there are no more results to return. We use for this test cases the same indices that were built for the "basic benchmark" for this test case as well.
8585
The test cases are:
86-
1. Fixed batch size - Run `Top_K` query for each pair of `{batch size, number of batches}` from the following:
87-
`{batch size=10, number of batches=1}`
88-
`{batch size=10, number of batches=3}`
89-
`{batch size=10, number of batches=5}`
90-
`{batch size=100, number of batches=1}`
91-
`{batch size=100, number of batches=3}`
92-
`{batch size=100, number of batches=5}`
93-
`{batch size=1000, number of batches=1}`
94-
`{batch size=1000, number of batches=3}`
95-
`{batch size=1000, number of batches=5}`
86+
1. Fixed batch size - Run `Top_K` query for each pair of `{batch size, number of batches}` from the following:
87+
`{batch size=10, number of batches=1}`
88+
`{batch size=10, number of batches=3}`
89+
`{batch size=10, number of batches=5}`
90+
`{batch size=100, number of batches=1}`
91+
`{batch size=100, number of batches=3}`
92+
`{batch size=100, number of batches=5}`
93+
`{batch size=1000, number of batches=1}`
94+
`{batch size=1000, number of batches=3}`
95+
`{batch size=1000, number of batches=5}`
9696
**Flat index results:** Time per iteration, memory delta per iteration
9797
**HNSW index results:** Time per iteration, Recall, memory delta per iteration
98-
2. Variable batch size - Run `Top_K` query where in each iteration the batch size is increased by a factor of 2, for each pair of `{batch initial size, number of batches}` from the following:
99-
`{batch initial size=10, number of batches=2}`
100-
`{batch initial size=10, number of batches=4}`
101-
`{batch initial size=100, number of batches=2}`
102-
`{batch initial size=100, number of batches=4}`
103-
`{batch initial size=1000, number of batches=2}`
104-
`{batch initial size=1000, number of batches=4}`
98+
2. Variable batch size - Run `Top_K` query where in each iteration the batch size is increased by a factor of 2, for each pair of `{batch initial size, number of batches}` from the following:
99+
`{batch initial size=10, number of batches=2}`
100+
`{batch initial size=10, number of batches=4}`
101+
`{batch initial size=100, number of batches=2}`
102+
`{batch initial size=100, number of batches=4}`
103+
`{batch initial size=1000, number of batches=2}`
104+
`{batch initial size=1000, number of batches=4}`
105105
**Flat index results:** Time per iteration
106106
**HNSW index results:** Time per iteration, Recall, memory delta per iteration
107-
3. Batches to Adhoc BF - In each iteration we run `Top_K` with an increasing `batch size` (initial size=10, increase factor=2) for `number of batches` and then switch to ad-hoc BF. We define `step` as the ratio between the index size to the number of vectors to go over in ad-hoc BF. The tests run for each pair of `{step, number of batches}` from the following:
108-
`{step=5, number of batches=0}`
109-
`{step=5, number of batches=2}`
110-
`{step=5, number of batches=5}`
111-
`{step=10, number of batches=0}`
112-
`{step=10, number of batches=2}`
113-
`{step=10, number of batches=5}`
114-
`{step=20, number of batches=0}`
115-
`{step=20, number of batches=2}`
116-
`{step=20, number of batches=5}`
107+
3. Batches to Ad-hoc BF - In each iteration we run `Top_K` with an increasing `batch size` (initial size=10, increase factor=2) for `number of batches` and then switch to ad-hoc BF. We define `step` as the ratio between the index size to the number of vectors to go over in ad-hoc BF. The tests run for each pair of `{step, number of batches}` from the following:
108+
`{step=5, number of batches=0}`
109+
`{step=5, number of batches=2}`
110+
`{step=5, number of batches=5}`
111+
`{step=10, number of batches=0}`
112+
`{step=10, number of batches=2}`
113+
`{step=10, number of batches=5}`
114+
`{step=20, number of batches=0}`
115+
`{step=20, number of batches=2}`
116+
`{step=20, number of batches=5}`
117117
**Flat index results:** Time per iteration
118118
**HNSW index results:** Time per iteration, memory delta per iteration
119119

120120
## BM_VecSimUpdatedIndex
121-
For this use case, we create two indices for each algorithm (flat and HNSW). The first index contains 500K vectors added to an empty index. The other index also contains 500K vectors, but this time they were added by overriding the 500K vectors of the aforementioned indices. Currently, we only test this for FP32 single-value index.
121+
For this use case, we create two indices for each algorithm (flat and HNSW). The first index contains 500K vectors added to an empty index. The other index also contains 500K vectors, but this time they were added by overriding the 500K vectors of the aforementioned indices. Currently, we only test this for FP32 single-value index.
122122
The test cases are:
123123
1. Index `total memory` **before** updating
124124
2. Index `total memory` **after** updating
125-
3. Run `Top_K` query over the flat index **before** updating (using brute-force search), for each `k=10`, `k=100` and `k=500`
125+
3. Run `Top_K` query over the flat index **before** updating (using brute-force search), for each `k=10`, `k=100` and `k=500`
126126
**results:** average time per iteration
127-
4. Run `Top_K` query over the flat index **after** updating (using brute-force search), for each `k=10`, `k=100` and `k=500`
127+
4. Run `Top_K` query over the flat index **after** updating (using brute-force search), for each `k=10`, `k=100` and `k=500`
128128
**results:** average time per iteration
129-
5. Run **100** iterations of `Top_K` query over the HNSW index **before** updating, for each pair of `{ef_runtime, k}` from the following:
130-
`{ef_runtime=10, k=10}`
131-
`{ef_runtime=200, k=10}`
132-
`{ef_runtime=100, k=100}`
133-
`{ef_runtime=200, k=100}`
134-
`{ef_runtime=500, k=500}`
129+
5. Run **100** iterations of `Top_K` query over the HNSW index **before** updating, for each pair of `{ef_runtime, k}` from the following:
130+
`{ef_runtime=10, k=10}`
131+
`{ef_runtime=200, k=10}`
132+
`{ef_runtime=100, k=100}`
133+
`{ef_runtime=200, k=100}`
134+
`{ef_runtime=500, k=500}`
135+
**results:** average time per iteration, recall
136+
6. Run **100** iterations of `Top_K` query over the HNSW index **after** updating, for each pair of `{ef_runtime, k}` from the following:
137+
`{ef_runtime=10, k=10}`
138+
`{ef_runtime=200, k=10}`
139+
`{ef_runtime=100, k=100}`
140+
`{ef_runtime=200, k=100}`
141+
`{ef_runtime=500, k=500}`
135142
**results:** average time per iteration, recall
136-
6. Run **100** iterations of `Top_K` query over the HNSW index **after** updating, for each pair of `{ef_runtime, k}` from the following:
137-
`{ef_runtime=10, k=10}`
138-
`{ef_runtime=200, k=10}`
139-
`{ef_runtime=100, k=100}`
140-
`{ef_runtime=200, k=100}`
141-
`{ef_runtime=500, k=500}`
142-
**results:** average time per iteration, recall
143143

144144
# ANN-Benchmark
145145

@@ -153,7 +153,7 @@ The `bm_dataset.py` script uses some of ANN-Benchmark datasets to measure this l
153153
5. mnist-784-euclidean
154154
6. sift-128-euclidean
155155

156-
For each dataset, the script will build an HNSW index with pre-defined build parameters and persist it to a local file in `./data` directory that will be generated (index file name for example: `glove-25-angular-M=16-ef=100.hnsw`). Note that if the file already exists in this path, the entire index will be loaded instead of rebuilding it.
156+
For each dataset, the script will build an HNSW index with pre-defined build parameters and persist it to a local file in `./data` directory that will be generated (index file name for example: `glove-25-angular-M=16-ef=100.hnsw`). Note that if the file already exists in this path, the entire index will be loaded instead of rebuilding it.
157157
To download the serialized indices run from the project's root directory:
158158
```sh
159159
wget -q -i tests/benchmark/data/hnsw_indices_all/hnsw_indices_ann.txt -P tests/benchmark/data
@@ -166,4 +166,4 @@ Then, for 3 different pre-defined `ef_runtime` values, 1000 `Top_K` queries will
166166
To reproduce this benchmark, first install the project's python bindings, and then invoke the script. From the project's root directory, you should run:
167167
```py
168168
python3 tests/benchmark/bm_datasets.py
169-
```
169+
```

0 commit comments

Comments
 (0)