Skip to content

Commit 8bd14d1

Browse files
authored
update and polish llama2 trtllm_guide.md (#129)
1 parent 92574e6 commit 8bd14d1

File tree

1 file changed

+12
-10
lines changed

1 file changed

+12
-10
lines changed

Popular_Models_Guide/Llama2/trtllm_guide.md

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -217,7 +217,7 @@ the Triton container. Simply follow the next steps:
217217
```bash
218218
HF_LLAMA_MODEL=/Llama-2-7b-hf
219219
UNIFIED_CKPT_PATH=/tmp/ckpt/llama/7b/
220-
ENGINE_DIR=/engines
220+
ENGINE_DIR=/engines/llama-2-7b/1-gpu/
221221
CONVERT_CHKPT_SCRIPT=/tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py
222222
python3 ${CONVERT_CHKPT_SCRIPT} --model_dir ${HF_LLAMA_MODEL} --output_dir ${UNIFIED_CKPT_PATH} --dtype float16
223223
trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH} \
@@ -233,13 +233,12 @@ trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH} \
233233
> located in the same llama examples folder.
234234
>
235235
> ```bash
236-
> python3 /tensorrtllm_backend/tensorrt_llm/examples/run.py --engine_dir=/engines/1-gpu/ --max_output_len 50 --tokenizer_dir /Llama-2-7b-hf --input_text "What is ML?"
236+
> python3 /tensorrtllm_backend/tensorrt_llm/examples/run.py --engine_dir=/engines/llama-2-7b/1-gpu/ --max_output_len 50 --tokenizer_dir /Llama-2-7b-hf --input_text "What is ML?"
237237
> ```
238238
> You should expect the following response:
239239
> ```
240-
> [TensorRT-LLM] TensorRT-LLM version: 0.9.0
240+
> [TensorRT-LLM] TensorRT-LLM version: 0.17.0.post1
241241
> ...
242-
> [TensorRT-LLM][INFO] Max KV cache pages per sequence: 1
243242
> Input [Text 0]: "<s> What is ML?"
244243
> Output [Text 0 Beam 0]: "
245244
> ML is a branch of AI that allows computers to learn from data, identify patterns, and make predictions. It is a powerful tool that can be used in a variety of industries, including healthcare, finance, and transportation."
@@ -269,19 +268,20 @@ Note: `TRITON_BACKEND` has two possible options: `tensorrtllm` and `python`. If
269268
# preprocessing
270269
TOKENIZER_DIR=/Llama-2-7b-hf/
271270
TOKENIZER_TYPE=auto
272-
ENGINE_DIR=/engines
271+
ENGINE_DIR=/engines/llama-2-7b/1-gpu/
273272
DECOUPLED_MODE=false
274273
MODEL_FOLDER=/opt/tritonserver/inflight_batcher_llm
275274
MAX_BATCH_SIZE=4
276275
INSTANCE_COUNT=1
277276
MAX_QUEUE_DELAY_MS=10000
278277
TRITON_BACKEND=tensorrtllm
278+
LOGITS_DATATYPE="TYPE_FP32"
279279
FILL_TEMPLATE_SCRIPT=/tensorrtllm_backend/tools/fill_template.py
280280
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
281281
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT}
282-
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT}
283-
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE}
284-
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:${TRITON_BACKEND},triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching
282+
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT},logits_datatype:${LOGITS_DATATYPE}
283+
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},logits_datatype:${LOGITS_DATATYPE}
284+
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:${TRITON_BACKEND},triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,encoder_input_features_data_type:TYPE_FP16,logits_datatype:${LOGITS_DATATYPE}
285285
```
286286
287287
3. Launch Tritonserver
@@ -290,6 +290,7 @@ Use the [launch_triton_server.py](https://github.com/triton-inference-server/ten
290290
```bash
291291
python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=<world size of the engine> --model_repo=/opt/tritonserver/inflight_batcher_llm
292292
```
293+
`<world size of the engine>` is the number of GPUs you want to use to run the engine. Set it to 1 for single GPU deployment.
293294
> You should expect the following response:
294295
> ```
295296
> ...
@@ -302,6 +303,7 @@ To stop Triton Server inside the container, run:
302303
```bash
303304
pkill tritonserver
304305
```
306+
Note: do not forget to run above command to stop Triton Server if launch Tritionserver failed due to various reasons. Otherwise, it could cause OOM or MPI issues.
305307
306308
### Send an inference request
307309
@@ -335,14 +337,14 @@ curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What
335337
```
336338
> You should expect the following response:
337339
> ```
338-
> {"context_logits":0.0,...,"text_output":"What is ML?\nML is a branch of AI that allows computers to learn from data, identify patterns, and make predictions. It is a powerful tool that can be used in a variety of industries, including healthcare, finance, and transportation."}
340+
> {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"What is ML?\nML is a branch of AI that allows computers to learn from data, identify patterns, and make predictions. It is a powerful tool that can be used in a variety of industries, including healthcare, finance, and transportation."}
339341
> ```
340342
341343
### Evaluating performance with Gen-AI Perf
342344
Gen-AI Perf is a command line tool for measuring the throughput and latency of generative AI models as served through an inference server.
343345
You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/client/src/c%2B%2B/perf_analyzer/genai-perf/README.html).
344346
345-
To use Gen-AI Perf, run the following command in the same Triton docker container:
347+
To use Gen-AI Perf, run the following command in the same Triton docker container (i.e. nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk):
346348
```bash
347349
genai-perf \
348350
profile \

0 commit comments

Comments
 (0)