-
Notifications
You must be signed in to change notification settings - Fork 765
Description
使用v0.12.0docker镜像部署,启动命令如下:
sudo docker run -d -v /home/tskj/MOD/:/home/MOD/ -e XINFERENCE_HOME=/home/MOD -p 9997:9997 --gpus all xprobe/xinference:v0.12.0 xinference-local -H 0.0.0.0 --log-level debug
我是8卡,选择8后,模型报错:
2024-06-12 05:08:11,767 xinference.core.worker 95 DEBUG Enter launch_builtin_model, args: (<xinference.core.worker.WorkerActor object at 0x7f013c4aa700>_uid': 'gpt-3.5-turbo-1-0', 'model_name': 'Qwen1.5-110B-Chat', 'model_size_in_billions': 110, 'model_format': 'pytorch', 'quantization': 'none', 'model_engl_type': 'LLM', 'n_gpu': 8, 'request_limits': None, 'peft_model_config': None, 'gpu_idx': None, 'gpu_memory_utilization': 0.9, 'max_model_len': 32768}
2024-06-12 05:08:11,767 xinference.core.worker 95 DEBUG GPU selected: [0, 1, 2, 3, 4, 5, 6, 7] for model gpt-3.5-turbo-1-0
2024-06-12 05:08:15,436 xinference.model.llm.core 95 DEBUG Launching gpt-3.5-turbo-1-0 with VLLMChatModel
2024-06-12 05:08:15,437 xinference.model.llm.llm_family 95 INFO Caching from URI: /home/MOD/Qwen/Qwen1.5-110B-Chat
2024-06-12 05:08:15,437 xinference.model.llm.llm_family 95 INFO Cache /home/MOD/Qwen/Qwen1.5-110B-Chat exists
2024-06-12 05:08:15,461 xinference.model.llm.vllm.core 210 INFO Loading gpt-3.5-turbo with following model config: {'gpu_memory_utilization': 0.9, 'max 'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 8, 'block_size': 16, 'swap_space': 4, 'max_num_seqs': 256, 'quantization': Nose. Lora count: 0.
2024-06-12 05:08:17,542 WARNING services.py:2009 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67067904 bytes avharm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by p10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2024-06-12 05:08:18,666 INFO worker.py:1753 -- Started a local Ray instance.
INFO 06-12 05:08:20 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/home/MOD/Qwen/Qwen1.5-110B-Chat', speculative_config=None, D/Qwen/Qwen1.5-110B-Chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=Truat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, disable_custom_all_reduce=False, quantization=None, enforcache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_modeln/Qwen1.5-110B-Chat)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-12 05:08:45 utils.py:618] Found nccl from library libnccl.so.2
INFO 06-12 05:08:45 pynccl.py:65] vLLM is using nccl==2.20.5
(RayWorkerWrapper pid=6318) INFO 06-12 05:08:45 utils.py:618] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=6318) INFO 06-12 05:08:45 pynccl.py:65] vLLM is using nccl==2.20.5
ERROR 06-12 05:08:46 worker_base.py:148] Error executing method init_device. This might cause deadlock in distributed execution.
ERROR 06-12 05:08:46 worker_base.py:148] Traceback (most recent call last):
ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method
ERROR 06-12 05:08:46 worker_base.py:148] return executor(*args, **kwargs)
ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 114, in init_device
ERROR 06-12 05:08:46 worker_base.py:148] init_worker_distributed_environment(self.parallel_config, self.rank,
ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 349, in init_worker_distributed_envir
ERROR 06-12 05:08:46 worker_base.py:148] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 239, in ensure_model_par
ERROR 06-12 05:08:46 worker_base.py:148] initialize_model_parallel(tensor_model_parallel_size,
ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 191, in initialize_model
ERROR 06-12 05:08:46 worker_base.py:148] _TP_PYNCCL_COMMUNICATOR = PyNcclCommunicator(
ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 94, in __in
ERROR 06-12 05:08:46 worker_base.py:148] self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244nk
ERROR 06-12 05:08:46 worker_base.py:148] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223
ERROR 06-12 05:08:46 worker_base.py:148] raise RuntimeError(f"NCCL error: {error_str}")
ERROR 06-12 05:08:46 worker_base.py:148] RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
(RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] Error executing method init_device. This might cause deadlock in distributed execution
(RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] Traceback (most recent call last):
(RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140,
(RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] return executor(*args, **kwargs)
(RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 114, in i
(RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] init_worker_distributed_environment(self.parallel_config, self.rank,
(RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 349, in ited_environment
(RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", lmodel_parallel_initialized
(RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] initialize_model_parallel(tensor_model_parallel_size,
(RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", lize_model_parallel
(RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] _TP_PYNCCL_COMMUNICATOR = PyNcclCommunicator(
(RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/device_communicators/, in init
(RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/device_communicators/ line 244, in ncclCommInitRank
(RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/device_communicators/ line 223, in NCCL_CHECK
(RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] raise RuntimeError(f"NCCL error: {error_str}")
(RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details
服务器nccl如下:chatchat) yqga@dnb:~$ dpkg -l|grep nccl
ii libnccl-dev 2.21.5-1+cuda12.5 amd64 NVIDIA Collective Communication Library (NCCL) Development Files
ii libnccl2 2.21.5-1+cuda12.5 amd64 NVIDIA Collective Communication Library (NCCL) Runtime
ii nccl-local-repo-ubuntu2204-2.21.5-cuda12.5 1.0-1 amd64 nccl-local repository configuration files