Skip to content

Cannot validate any workbench or batch edit data sets #6431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
emenslin opened this issue Apr 21, 2025 · 4 comments · Fixed by #6437
Closed

Cannot validate any workbench or batch edit data sets #6431

emenslin opened this issue Apr 21, 2025 · 4 comments · Fixed by #6437
Assignees
Labels
2 - WorkBench Issues that are related to the WorkBench Batch Edit regression This is behavior that once worked that has broken. Must be resolved before the next release.
Milestone

Comments

@emenslin
Copy link
Collaborator

Describe the bug
When trying to validate a data set it just stays on the first validation status screen and nothing happens. This happens on production and batch edit PRs #6196 and #6428 so it is preventing a lot of testing. This is happening on ojsmnh, calvert, and kuinvertpaleo but it's not happening on ciscollecitons so i'm not sure what the issue is.

To Reproduce
Steps to reproduce the behavior:

For batch edit:

  1. Create a query
  2. Press batch edit
  3. Try to validate
  4. See its stuck on the validation status dialog

For workbench:

  1. Create or import a data set
  2. Fill out rows if you created a new one
  3. Validate
  4. See error

Expected behavior
It should validate properly

Screenshots
Batch edit:

04-21_14.15.mp4

Workbench:

04-21_14.16.mp4

Please fill out the following information manually:

@emenslin emenslin added 2 - WorkBench Issues that are related to the WorkBench Batch Edit regression This is behavior that once worked that has broken. Must be resolved before the next release. labels Apr 21, 2025
@emenslin emenslin added this to the 7.10.3 milestone Apr 21, 2025
@sharadsw
Copy link
Contributor

Cannot recreate on ojsmnh20250314 right now. Will continue testing
batch edit validation worked: https://ojsmnh20250314-production.test.specifysystems.org/specify/workbench/219

Image

the picklist workbench dataset from the video worked: https://ojsmnh20250314-production.test.specifysystems.org/specify/workbench/201

Image

Maybe it's a browser side issue related to cache?

@melton-jason
Copy link
Contributor

I'm able to consistently recreate the issue on the ojsmnh_2025_04_04_BatchEditLatest database on branch issue-6127 on the Test Panel.

issue_6431.mov

https://ojsmnh20250404batcheditlatest-issue-6127.test.specifysystems.org/specify/workbench/108

It doesn't look like a cache issue: it looks like a problem with the worker itself; more specifically, it looks like either the worker is unable to connect to redis/backend or vice-versa.

Note that this problem also occurs for merging proccess and not just WorkBench operations.

issue_6431_merge.mov

Looking at the worker logs for this instance, it looks like the celery worker lost connection with redis and has since been unresponsive to the backend's attempt to communicate through redis.

Below are the logs for the instance's worker as well as the instance's docker-compose.yml backend and worker services.

worker_logs.txt

docker_config.txt

I'm unable to recreate the Issue locally, and unsure exactly what caused this disconnect on the test panel in the first place.

@sharadsw
Copy link
Contributor

I was able to recreate it once so far on https://ojsmnhbatchedit20250404-batchedit-broken-validation-debug.test.specifysystems.org/specify/workbench/345

[22/Apr/2025 11:55:32] [INFO] [specifyweb.workbench.upload.upload:83] rolling back save point: 'main upload' due to: 'no_commit option'
[2025-04-22 11:55:32,298: INFO/ForkPoolWorker-1] Task specifyweb.workbench.tasks.upload[b95ea291-27ed-4c35-80bb-ac5d88f9787b] succeeded in 0.14176161214709282s: None
[2025-04-22 11:56:51,308: INFO/MainProcess] sync with celery@598ae8dcbad9
[2025-04-22 12:23:37,460: INFO/MainProcess] sync with celery@9d9ac5e86154
[2025-04-22 12:30:01,355: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
  File "/opt/specify7/ve/lib/python3.8/site-packages/celery/worker/consumer/consumer.py", line 332, in start
    blueprint.start(self)
  File "/opt/specify7/ve/lib/python3.8/site-packages/celery/bootsteps.py", line 116, in start
    step.start(parent)
  File "/opt/specify7/ve/lib/python3.8/site-packages/celery/worker/consumer/consumer.py", line 628, in start
    c.loop(*c.loop_args())
  File "/opt/specify7/ve/lib/python3.8/site-packages/celery/worker/loops.py", line 97, in asynloop
    next(loop)
  File "/opt/specify7/ve/lib/python3.8/site-packages/kombu/asynchronous/hub.py", line 362, in create_loop
    cb(*cbargs)
  File "/opt/specify7/ve/lib/python3.8/site-packages/kombu/transport/redis.py", line 1326, in on_readable
    self.cycle.on_readable(fileno)
  File "/opt/specify7/ve/lib/python3.8/site-packages/kombu/transport/redis.py", line 562, in on_readable
    chan.handlers[type]()
  File "/opt/specify7/ve/lib/python3.8/site-packages/kombu/transport/redis.py", line 906, in _receive
    ret.append(self._receive_one(c))
  File "/opt/specify7/ve/lib/python3.8/site-packages/kombu/transport/redis.py", line 916, in _receive_one
    response = c.parse_response()
  File "/opt/specify7/ve/lib/python3.8/site-packages/redis/client.py", line 865, in parse_response
    response = self._execute(conn, try_read)
  File "/opt/specify7/ve/lib/python3.8/site-packages/redis/client.py", line 841, in _execute
    return conn.retry.call_with_retry(
  File "/opt/specify7/ve/lib/python3.8/site-packages/redis/retry.py", line 65, in call_with_retry
    fail(error)
  File "/opt/specify7/ve/lib/python3.8/site-packages/redis/client.py", line 843, in <lambda>
    lambda error: self._disconnect_raise_connect(conn, error),
  File "/opt/specify7/ve/lib/python3.8/site-packages/redis/client.py", line 830, in _disconnect_raise_connect
    raise error
  File "/opt/specify7/ve/lib/python3.8/site-packages/redis/retry.py", line 62, in call_with_retry
    return do()
  File "/opt/specify7/ve/lib/python3.8/site-packages/redis/client.py", line 842, in <lambda>
    lambda: command(*args, **kwargs),
  File "/opt/specify7/ve/lib/python3.8/site-packages/redis/client.py", line 863, in try_read
    return conn.read_response(disconnect_on_error=False, push_request=True)
  File "/opt/specify7/ve/lib/python3.8/site-packages/redis/connection.py", line 592, in read_response
    response = self._parser.read_response(disable_decoding=disable_decoding)
  File "/opt/specify7/ve/lib/python3.8/site-packages/redis/_parsers/resp2.py", line 15, in read_response
    result = self._read_response(disable_decoding=disable_decoding)
  File "/opt/specify7/ve/lib/python3.8/site-packages/redis/_parsers/resp2.py", line 25, in _read_response
    raw = self._buffer.readline()
  File "/opt/specify7/ve/lib/python3.8/site-packages/redis/_parsers/socket.py", line 115, in readline
    self._read_from_socket()
  File "/opt/specify7/ve/lib/python3.8/site-packages/redis/_parsers/socket.py", line 68, in _read_from_socket
    raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
redis.exceptions.ConnectionError: Connection closed by server.
[2025-04-22 12:30:01,435: WARNING/MainProcess] /opt/specify7/ve/lib/python3.8/site-packages/celery/worker/consumer/consumer.py:367: CPendingDeprecationWarning: 
In Celery 5.1 we introduced an optional breaking change which
on connection loss cancels all currently executed tasks with late acknowledgement enabled.
These tasks cannot be acknowledged as the connection is gone, and the tasks are automatically redelivered back to the queue.
You can enable this behavior using the worker_cancel_long_running_tasks_on_connection_loss setting.
In Celery 5.1 it is set to False by default. The setting will be set to True by default in Celery 6.0.

  warnings.warn(CANCEL_TASKS_BY_DEFAULT, CPendingDeprecationWarning)

[2025-04-22 12:30:20,919: INFO/MainProcess] Connected to redis://redis:6379/0
[2025-04-22 12:30:20,942: INFO/MainProcess] mingle: searching for neighbors
[2025-04-22 12:30:21,988: INFO/MainProcess] mingle: sync with 4 nodes
[2025-04-22 12:30:21,989: INFO/MainProcess] mingle: sync complete

It does look like some internal error with Celery. However, I was able to click on stop in the validation dialog and retry and it started working again.

@sharadsw
Copy link
Contributor

sharadsw commented Apr 23, 2025

My guess is that this has something to do with the recent Django upgrade to 4.2. I have created a branch that upgrades Celery to the latest version. (#6437)

There doesn't seem to be a way to reliably recreate the worker disconnection. We'll just have to keep testing the branch and hope it fixes the problem. Here's an instance with the latest celery version: https://ojsmnhbatchedit20250404-issue-6431.test.specifysystems.org/specify/workbench/345

@sharadsw sharadsw mentioned this issue Apr 23, 2025
6 tasks
@CarolineDenis CarolineDenis linked a pull request Apr 24, 2025 that will close this issue
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - WorkBench Issues that are related to the WorkBench Batch Edit regression This is behavior that once worked that has broken. Must be resolved before the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants