Building on DGX H200 - GPUs not detected #14989

simusid · 2025-07-31T13:17:20Z

simusid
Jul 31, 2025

I'm having problems building llama.cpp on a DGX-H200.
Driver Version: 570.158.01 CUDA Version: 12.8

cmake version is 3.26.5
nvcc is in the normal /usr/local/cuda and is detected

running cmake -B build -DGGML_CUDA=1 ends with:

CMake Error in ggml/src/ggml-cuda/CMakeLists.txt:
CUDA_ARCHITECTURES is set to "native", but no GPU was detected.

If I overrride and build using:
cmake3 -B build -DGGML_CUDA=1 -DCMAKE_CUDA_ARCHITECTURES=89

It builds without error but when executed does not use the GPU. I suspect there is something about our system that is non-standard that is keeping llama.cpp from detecting the GPUs at build time (native) and at run time. Is there something I overlooked in the configuration?

simusid · 2025-07-31T15:23:08Z

simusid
Jul 31, 2025
Author

I upgraded cmake to version 3.31.2 and I no longer get the error:
CMake Error in ggml/src/ggml-cuda/CMakeLists.txt:
CUDA_ARCHITECTURES is set to "native", but no GPU was detected.

I can compile/run but GPUs are still not detected at runtime.

0 replies

owainkenwayucl · 2025-08-06T16:41:07Z

owainkenwayucl
Aug 6, 2025

If it's any help, I'm also at this step but I've tested other things (e.g. pytorch) and while the GPUs in the system show up in nvidia-smi nothing can actually access them.

Our system is an 8x PCIe H200 NVL system.

e..g pytorch

Python 3.12.9 (main, Jun 20 2025, 00:00:00) [GCC 11.5.0 20240719 (Red Hat 11.5.0-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.device_count()
8
>>> torch.cuda.get_device_name(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/uccaoke/genai/lib64/python3.12/site-packages/torch/cuda/__init__.py", line 544, in get_device_name
    return get_device_properties(device).name
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/uccaoke/genai/lib64/python3.12/site-packages/torch/cuda/__init__.py", line 576, in get_device_properties
    _lazy_init()  # will define _get_device_properties
    ^^^^^^^^^^^^
  File "/home/uccaoke/genai/lib64/python3.12/site-packages/torch/cuda/__init__.py", line 372, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.
>>>

$ nvidia-smi
Wed Aug  6 17:42:14 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06              Driver Version: 580.65.06      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200 NVL                Off |   00000000:03:00.0 Off |                    0 |
| N/A   38C    P0             73W /  600W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H200 NVL                Off |   00000000:04:00.0 Off |                    0 |
| N/A   38C    P0             70W /  600W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H200 NVL                Off |   00000000:73:00.0 Off |                    0 |
| N/A   39C    P0             72W /  600W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H200 NVL                Off |   00000000:74:00.0 Off |                    0 |
| N/A   39C    P0             75W /  600W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H200 NVL                Off |   00000000:83:00.0 Off |                    0 |
| N/A   39C    P0             72W /  600W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H200 NVL                Off |   00000000:84:00.0 Off |                    0 |
| N/A   38C    P0             72W /  600W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H200 NVL                Off |   00000000:F3:00.0 Off |                    0 |
| N/A   40C    P0             74W /  600W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H200 NVL                Off |   00000000:F4:00.0 Off |                    0 |
| N/A   40C    P0             74W /  600W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

0 replies

owainkenwayucl · 2025-08-06T17:44:41Z

owainkenwayucl
Aug 6, 2025

After a lot of swearing, I build the cuda-samples deviceQuery sample, and it output this:

$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 3
-> initialization error
Result = FAIL

Googling that error suggested down/side-grading to the proprietary driver from the Open one which I did, rebooted and now things mostly work (I still have a problem in torch with multiple GPUs but I suspect that's a differnt problem entirely) - so I suggest you try that?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Building on DGX H200 - GPUs not detected #14989

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Building on DGX H200 - GPUs not detected #14989

Uh oh!

simusid Jul 31, 2025

Replies: 3 comments

Uh oh!

simusid Jul 31, 2025 Author

Uh oh!

Uh oh!

owainkenwayucl Aug 6, 2025

Uh oh!

owainkenwayucl Aug 6, 2025

simusid
Jul 31, 2025

simusid
Jul 31, 2025
Author

owainkenwayucl
Aug 6, 2025

owainkenwayucl
Aug 6, 2025