[Bug] Regression: "opea/vllm-gaudi:latest" container in crash loop

### Priority

Undecided

### OS type

* OS: Ubuntu 22.04
* Kernel: 5.15.0

### Hardware type

* HW: Gaudi2
* driver_ver: 1.16.2-f195ec4

### Installation method

- [X] Pull docker images from hub.docker.com

### Deploy method

- [X] Helm

### Running nodes

Single Node

### What's the version?

https://hub.docker.com/layers/opea/vllm-gaudi/latest/images/sha256-d2c0b0aa88cd26ae2084990663d8d789728f658bacacd8a49cc5b81a6a022c58

### Description

`vllm-gaudi:latest` container does not find devices, and is in crash loop.

But if I change `latest` tag to `1.1`, it works fine, i.e. this is regression.

### Reproduce steps

Run _ChatQnA_ from _GenAIInfra_ with vLLM:
`$ helm install chatqna chatqna/ --skip-tests --values chatqna/gaudi-vllm-values.yaml ...`

### Raw log

```shell
$ kubectl logs chatqna-vllm-75dfb59d66-wp4vs
...
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py", line 132, in current_device
    init()
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py", line 71, in init
    _hpu_C.init()
RuntimeError: synStatus=8 [Device not found] Device acquire failed.
```
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Regression: "opea/vllm-gaudi:latest" container in crash loop #1038

Priority

OS type

Hardware type

Installation method

Deploy method

Running nodes

What's the version?

Description

Reproduce steps

Raw log

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Regression: "opea/vllm-gaudi:latest" container in crash loop #1038

Description

Priority

OS type

Hardware type

Installation method

Deploy method

Running nodes

What's the version?

Description

Reproduce steps

Raw log

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions