Skip to content

[Bug] Regression: "opea/vllm-gaudi:latest" container in crash loop #1038

@eero-t

Description

@eero-t

Priority

Undecided

OS type

  • OS: Ubuntu 22.04
  • Kernel: 5.15.0

Hardware type

  • HW: Gaudi2
  • driver_ver: 1.16.2-f195ec4

Installation method

  • Pull docker images from hub.docker.com

Deploy method

  • Helm

Running nodes

Single Node

What's the version?

https://hub.docker.com/layers/opea/vllm-gaudi/latest/images/sha256-d2c0b0aa88cd26ae2084990663d8d789728f658bacacd8a49cc5b81a6a022c58

Description

vllm-gaudi:latest container does not find devices, and is in crash loop.

But if I change latest tag to 1.1, it works fine, i.e. this is regression.

Reproduce steps

Run ChatQnA from GenAIInfra with vLLM:
$ helm install chatqna chatqna/ --skip-tests --values chatqna/gaudi-vllm-values.yaml ...

Raw log

$ kubectl logs chatqna-vllm-75dfb59d66-wp4vs
...
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py", line 132, in current_device
    init()
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py", line 71, in init
    _hpu_C.init()
RuntimeError: synStatus=8 [Device not found] Device acquire failed.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions