Skip to content

Commit 75510df

Browse files
authored
first commit
1 parent 196ad92 commit 75510df

File tree

1 file changed

+20
-18
lines changed

1 file changed

+20
-18
lines changed

README.md

Lines changed: 20 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,6 @@ table value 0 -> Cuda Out Of Memory
4141
```
4242

4343

44-
4544
## Introduction
4645

4746
This project is centered around using tensorflow-java to accelerate computation with GPUs in a jvm environment and utilizing it as a service api.
@@ -51,37 +50,37 @@ Cosine Similarity has many uses, as it is typically computed over the eigenvalue
5150

5251
### ANN method has the following problems.
5352

54-
- It requires an initial build time for the vectors to be computed for similarity.
55-
- When extracting the top k similarities, the approximation performance decreases for values of k above a certain level.
56-
- Approximation performance decreases significantly as the dimensionality of the vector increases, i.e., if it is more than 100 to 256 dimensions when using an ANN, proper dimensionality reduction is required.**
53+
- **It requires an initial build time for the vectors to be computed for similarity.**
54+
- **When extracting the top k similarities, the approximation performance decreases for values of k above a certain level.**
55+
- **Approximation performance decreases significantly as the dimensionality of the vector increases, i.e., if it is more than 100 to 256 dimensions when using an ANN, proper dimensionality reduction is required.**
5756
- **When the number of digits in the vector is less than 100,000, ANNs do not have a significant computational performance gain due to their structure.**
5857
- **This means that if you are dealing with relatively high dimensional vectors and need a large top k, or if there are not enough target vectors, ANNs will be less useful.**
5958

6059
### In this project, we address the problem as follows.
6160

62-
- **The target vectors are fixed in constant memory on the GPU by dynamically generating a model graph.
63-
- The inner operations, represented by metric operations, are dynamically batched with tensor operations in the Tensorflow Graph to process many at once.
64-
- Pre-processes and stores and loads L2norm and Transpose operations in advance to avoid unnecessary runtime operations.
65-
- **Implement Dynamic Batch through akka-http, akka-stream, and asynchronous processing of Akka Http to process hundreds to thousands of requests simultaneously.
61+
- **The target vectors are fixed in constant memory on the GPU by dynamically generating a model graph.**
62+
- **The inner operations, represented by metric operations, are dynamically batched with tensor operations in the Tensorflow Graph to process many at once.**
63+
- **Pre-processes and stores and loads L2norm and Transpose operations in advance to avoid unnecessary runtime operations.**
64+
- **Implement Dynamic Batch through akka-http, akka-stream, and asynchronous processing of Akka Http to process hundreds to thousands of requests simultaneously.**
6665

6766
### Achieve the following performance and advantages over traditional best practices and SOTA.
6867

69-
- **Gain approximately 55 to 65% request per second (RPS) without sacrificing recall compared to SOTA (ScaNN, 0.9876) for http://ann-benchmarks.com.** **Gain approximately 55 to 65% RPS compared to SOTA (ScaNN, 0.9876) for http://ann-benchmarks.com.
68+
- **Gain approximately 55 to 65% request per second (RPS) without sacrificing recall compared to SOTA (ScaNN, 0.9876) for http://ann-benchmarks.com.**
7069
- **Loads in less than 2 seconds versus SOTA (ScaNN, 182 seconds) on the glove-100-angular benchmark dataset and spins up servers in less than 5 seconds when deployed.**
7170
- **For a 100,000-level vector, we get between 4000 and 260 requests per second (RPS) for 100 to 2048 dimensions.**
72-
- Target vectors can be loaded as npy files via python's numpy format.
73-
- It uses the tensorflow runtime which is built for multiple environments, so it can be easily used on linux, windows, mac, etc.
71+
- **Target vectors can be loaded as npy files via python's numpy format.**
72+
- **It uses the tensorflow runtime which is built for multiple environments, so it can be easily used on linux, windows, mac, etc.**
7473
- **We recommend using examples in relatively small production environments to consider throughput, latency, and to simplify the deployment pipeline without reducing recall.**
7574

7675
### Caveats.
77-
- **Comparison with ann-benchmarks is a lossless calculation with a Recall of 1 and measured with end2end of the REST API, not batch library calls.** **Comparison with ann-benchmarks is a lossless calculation with a Recall of 1.
78-
- Comparisons to ann-benchmarks are not a fair comparison. ann-benchmarks were measured on a CPU r5.4xlarge on AWS, which is a very different environment than the GPU in the current example.**
79-
- Numerical errors may be caused by implicit GEMM algorithm changes due to the behavior of cublas' MatmulAlgoGetHeuristic in dynamic batch situations.
80-
- **The maximum available Dynamic Batch size depends on the specifications of the GPU memory. In general, giving it as large a value as your memory allows will result in higher RPS performance.
76+
- **Comparison with ann-benchmarks is a lossless calculation with a Recall of 1 and measured with end2end of the REST API, not batch library calls.**
77+
- **Comparisons to ann-benchmarks are not a fair comparison. ann-benchmarks were measured on a CPU r5.4xlarge on AWS, which is a very different environment than the GPU in the current example.**
78+
- **Numerical errors may be caused by implicit GEMM algorithm changes due to the behavior of cublas' MatmulAlgoGetHeuristic in dynamic batch situations.**
79+
- **The maximum available Dynamic Batch size depends on the specifications of the GPU memory. In general, giving it as large a value as your memory allows will result in higher RPS performance.**
8180

8281

8382
## Default Configuration
84-
- Minimal code**, **Minimal dependencies**.
83+
- **Minimal code**, **Minimal dependencies**.
8584
- Use **Tensorflow-java-gpu** as the Serving Runtime
8685
- Configure the REST API via **AKKA-HTTP**.
8786
- Implementing dynamic batching via **akka-stream**.
@@ -90,8 +89,8 @@ Cosine Similarity has many uses, as it is typically computed over the eigenvalue
9089

9190
## docker
9291
```
93-
docker build . -f Dockerfile -t akka:0.1
94-
docker run --gpus all -p 8080:8080 akka:0.1
92+
docker build . -f Dockerfile -t flasma:0.1
93+
docker run --gpus all -p 8080:8080 flasma:0.1
9594
```
9695

9796
## local build & run
@@ -140,3 +139,6 @@ print(c.dtype) #float32
140139
141140
np.save(f"./{item}-{dim}",c)
142141
```
142+
143+
np.save(f"./{item}-{dim}",c)
144+
```

0 commit comments

Comments
 (0)