-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Hello again,
Again thank you for your continued work on the pipeline!
I have been running the pipeline and found that when I run the sd_cell_segmentation step, I run into the following error:
ERROR ~ Error executing process > 'sd_segment_cells:sd_cell_segmentation (23)'
Caused by:
Process `sd_segment_cells:sd_cell_segmentation (23)` terminated with an error exit status (1)
Command executed:
python3.8 /opt/stardist_segment.py \
1189_003 \
1189_003-cp4-preprocessed_metadata.csv \
DNA1 \
2D_versatile_fluo \
default \
0.05\
default \
1189_003-StarDist-Cells.csv \
mmand 1189_003-StarDist-Cell_Mask.tiff \
1 > stardist_segmentation_log.txt 2>&1
Command exit status:
in ./work/48/91c .... /stardist_segmentation_log.txt
1 2025-09-02 17:31:26.636220: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /.singularity.d/libs
2 2025-09-02 17:31:26.636269: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your mac hine.
3 2025-09-02 17:31:29.250180: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
4 2025-09-02 17:31:29.252965: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libc uda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /.singularity.d/libs
5 2025-09-02 17:31:29.252980: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
6 2025-09-02 17:31:29.253007: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: node0424.palmett o.clemson.edu
7 2025-09-02 17:31:29.253011: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: node0424.palmetto.clemson.edu
8 2025-09-02 17:31:29.253071: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcud a.so DSO loaded into this program
9 2025-09-02 17:31:29.254470: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 550.163.1
10 2025-09-02 17:31:29.254655: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Lib rary (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
11 To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
12 2025-09-02 17:31:29.254818: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
13 The selected sample name is: 1189_003
14 The selected probability threshold is: 0.05
15 Using the model's default value for NMS
16 Parsing metadata file: 1189_003-cp4-preprocessed_metadata.csv
17 The files to be loaded are: { ... }
18 The parsed labels are: ['DNA1']
19 Loading model: 2D_versatile_fluo default
20 Found model '2D_versatile_fluo' for 'StarDist2D'.
21 Loading network weights from 'weights_best.h5'.
22 Traceback (most recent call last):
23 File "/opt/stardist_segment.py", line 190, in <module>
24 model = load_model(model_name, model_path)
25 File "/opt/stardist_segment.py", line 94, in load_model
26 return StarDist2D.from_pretrained(model_to_load)
27 File "/usr/local/lib/python3.8/dist-packages/csbdeep/models/base_model.py", line 79, in from_pretrained
28 return get_model_instance(cls, name_or_alias)
29 File "/usr/local/lib/python3.8/dist-packages/csbdeep/models/pretrained.py", line 102, in get_model_instance
30 model = cls(config=None, name=path.stem, basedir=path.parent)
31 File "/usr/local/lib/python3.8/dist-packages/stardist/models/model2d.py", line 292, in __init__
32 super().__init__(config, name=name, basedir=basedir)
33 File "/usr/local/lib/python3.8/dist-packages/stardist/models/base.py", line 220, in __init__
34 super().__init__(config=config, name=name, basedir=basedir)
35 File "/usr/local/lib/python3.8/dist-packages/csbdeep/models/base_model.py", line 113, in __init__
36 self._find_and_load_weights()
37 File "/usr/local/lib/python3.8/dist-packages/csbdeep/models/base_model.py", line 32, in wrapper
38 return f(*args, **kwargs)
39 File "/usr/local/lib/python3.8/dist-packages/csbdeep/models/base_model.py", line 167, in _find_and_load_weights
40 self.load_weights(weights_chosen.name)
41 File "/usr/local/lib/python3.8/dist-packages/csbdeep/models/base_model.py", line 32, in wrapper
42 return f(*args, **kwargs)
43 File "/usr/local/lib/python3.8/dist-packages/csbdeep/models/base_model.py", line 184, in load_weights
44 self.keras_model.load_weights(str(self.logdir/name))
45 File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/training.py", line 2229, in load_weights
46 hdf5_format.load_weights_from_hdf5_group(f, self.layers)
47 File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/saving/hdf5_format.py", line 696, in load_weights_from_hdf5_group
48 weight_values = [np.asarray(g[weight_name]) for weight_name in weight_names]
49 File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/saving/hdf5_format.py", line 696, in <listcomp>
50 weight_values = [np.asarray(g[weight_name]) for weight_name in weight_names]
51 File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
52 File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
53 File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/group.py", line 264, in __getitem__
54 oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
55 File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
56 File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
57 File "h5py/h5o.pyx", line 190, in h5py.h5o.open
58 KeyError: 'Unable to open object (bad local heap signature)'
When I load the container and load the 2D_versatile_fluo model in independently, I do not encounter any errors.
I believe that this is related to an ongoing issue with stardist and opening the weights files in parallel in conjunction with h5py. [https://github.com/stardist/stardist/issues/93] ... I have resolved this issue by setting the executor.queueSize for my custom profile to be 1 to set the number of jobs for that step to be 1, but this seems like a poor solution, I'm not really sure the proper way to go about this... This is most likely not coming up on the test data since there are only two samples being run at once, I would be willing to try and run using different queue sizes on my local cluster to see when it seems to "break". Let me know your thoughts! Another hacky solution is to delay the processes using a timer randomly so that the process doesn't try to load in the model weights all at once as discussed in the linked issue thread. Let me know if you need more than what is provided log wise...
Again thanks for your time and effort!