You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> [TensorRT-LLM][INFO] Max KV cache pages per sequence: 1
243
242
> Input [Text 0]: "<s> What is ML?"
244
243
> Output [Text 0 Beam 0]: "
245
244
> ML is a branch of AI that allows computers to learn from data, identify patterns, and make predictions. It is a powerful tool that can be used in a variety of industries, including healthcare, finance, and transportation."
@@ -269,19 +268,20 @@ Note: `TRITON_BACKEND` has two possible options: `tensorrtllm` and `python`. If
@@ -290,6 +290,7 @@ Use the [launch_triton_server.py](https://github.com/triton-inference-server/ten
290
290
```bash
291
291
python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=<world size of the engine> --model_repo=/opt/tritonserver/inflight_batcher_llm
292
292
```
293
+
`<world size of the engine>` is the number of GPUs you want to use to run the engine. Set it to 1 for single GPU deployment.
293
294
> You should expect the following response:
294
295
>```
295
296
> ...
@@ -302,6 +303,7 @@ To stop Triton Server inside the container, run:
302
303
```bash
303
304
pkill tritonserver
304
305
```
306
+
Note: do not forget to run above command to stop Triton Server if launch Tritionserver failed due to various reasons. Otherwise, it could cause OOM or MPI issues.
> {"context_logits":0.0,...,"text_output":"What is ML?\nML is a branch of AI that allows computers to learn from data, identify patterns, and make predictions. It is a powerful tool that can be used in a variety of industries, including healthcare, finance, and transportation."}
340
+
> {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"What is ML?\nML is a branch of AI that allows computers to learn from data, identify patterns, and make predictions. It is a powerful tool that can be used in a variety of industries, including healthcare, finance, and transportation."}
339
341
>```
340
342
341
343
### Evaluating performance with Gen-AI Perf
342
344
Gen-AI Perf is a command line tool for measuring the throughput and latency of generative AI models as served through an inference server.
343
345
You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/client/src/c%2B%2B/perf_analyzer/genai-perf/README.html).
344
346
345
-
To use Gen-AI Perf, run the following commandin the same Triton docker container:
347
+
To use Gen-AI Perf, run the following commandin the same Triton docker container (i.e. nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk):
0 commit comments