Skip to content

Commit c4ba63e

Browse files
made small edits to ChatQnA README.md file (#61)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent b68f385 commit c4ba63e

File tree

1 file changed

+6
-8
lines changed

1 file changed

+6
-8
lines changed

ChatQnA/README.md

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -126,17 +126,17 @@ cd ../../
126126
> [!NOTE]
127127
> If you modified any files and want that change introduced in this step, add `--build` to the end of the command to build the container image instead of pulling it from dockerhub.
128128
129-
## Ingest data into redis
129+
## Ingest data into Redis
130130

131-
After every time of redis container is launched, data should be ingested in the container ingestion steps:
131+
Each time the Redis container is launched, data should be ingested into the container using the commands:
132132

133133
```bash
134134
docker exec -it qna-rag-redis-server bash
135135
cd /ws
136136
python ingest.py
137137
```
138138

139-
Note: `ingest.py` will download the embedding model, please set the proxy if necessary.
139+
Note: `ingest.py` will download the embedding model. Please set the proxy if necessary.
140140

141141
# Start LangChain Server
142142

@@ -169,7 +169,7 @@ docker exec -it qna-rag-redis-server bash
169169
nohup python app/server.py &
170170
```
171171

172-
The LangChain backend service listens to port 8000 by port, you can customize it by change the code in `docker/qna-app/app/server.py`.
172+
The LangChain backend service listens to port 8000, you can customize it by changing the code in `docker/qna-app/app/server.py`.
173173

174174
And then you can make requests like below to check the LangChain backend service status:
175175

@@ -227,7 +227,7 @@ This will initiate the frontend service and launch the application.
227227

228228
# Enable TGI Gaudi FP8 for higher throughput (Optional)
229229

230-
The TGI Gaudi utilizes BFLOAT16 optimization as the default setting. If you aim to achieve higher throughput, you can enable FP8 quantization on the TGI Gaudi. According to our test results, FP8 quantization yields approximately a 1.8x performance gain compared to BFLOAT16. Please follow the below steps to enable FP8 quantization.
230+
The TGI Gaudi utilizes BFLOAT16 optimization as the default setting. If you aim to achieve higher throughput, you can enable FP8 quantization on the TGI Gaudi. Note that currently only Llama2 series and Mistral series models support FP8 quantization. Please follow the below steps to enable FP8 quantization.
231231

232232
## Prepare Metadata for FP8 Quantization
233233

@@ -257,9 +257,7 @@ Then modify the `dump_stats_path` to "/data/hqt_output/measure" and update `dump
257257
docker run -p 8080:80 -e QUANT_CONFIG=/data/maxabs_quant.json -v $volume:/data --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:1.2.1 --model-id Intel/neural-chat-7b-v3-3
258258
```
259259

260-
Now the TGI Gaudi will launch the FP8 model by default. Please note that currently only Llama2 series and Mistral series models support FP8 quantization.
261-
262-
And then you can make requests like below to check the service status:
260+
Now the TGI Gaudi will launch the FP8 model by default and you can make requests like below to check the service status:
263261

264262
```bash
265263
curl 127.0.0.1:8080/generate \

0 commit comments

Comments
 (0)