-
Notifications
You must be signed in to change notification settings - Fork 229
Potential memory leak in JEG server due to notebook websocket handler #1018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @rahul26goyal - although I have not seen this, I also haven't looked into things at this level. I agree with your assessment from the client side but haven't spent time with your Notebook PR (I'm trying to be on vacation this month 😄 ). One thing you might be interested in and perhaps not aware of is EG has experimental HA/DR support that is disabled by default. With PR #737 (released in 2.1) we essentially introduced support for active/active HA, but your "DR" case should be applicable as well. With those changes, a potential 404 from the lookup by kernel_id could be averted when EG then checks a directory for that kernel's connection information. So, assuming the kernel persistence location is on a shared drive (or, in your case, even the local filesystem), EG will attempt to reconnect to the remote kernel. The configuration settings you'd need to look into/configure would be:
Please note that the attempts to locate a kernel from the Notebook server to the EG server, when no EG is running, may still result in a leak, so this information doesn't necessarily preclude the need to update the client side. I hope that helps. Since I am out of the office this month, please know that my responses may be more delayed than usual. |
thanks a lot @kevin-bates for your response. and sorry to disturb during ur vacation. Have a good time :) you are right that the |
Since this is being addressed in the respective server projects (Notebook and JupyterServer), I'm closing this issue. |
Description
We have deployed JEG on Kubernetes cluster and trying to run spark_python_kubernetes and python_kubernetes kernel using it. We are seeing an issue with JEG server pod going out of memory every 6-10 hours. We have given max memory of 4 GB to the Pod.
This is mainly happening when we have a Notebook UI client trying to re-establish a Web-socket connection to previously running kernel i.e. Notebook UI still thinks kernel exists but JEG does not know about it anymore. This happens when JEG gets restarted and loses context about the previously running kernels but notebook does not know about the JEG restart and keeps on trying to connect to the existing kernel session(?).
Based on the testing, we think that _register_session method in the notebook kernel handler is causing the leak by creating a new session object each time notebook tries to hit the
/api/kernels/<>/channels
API call. Thing to note is: JEG returns404
response to Notebook but notebook does not stop trying.More details about the issue available here: jupyter/notebook#6244
tagging @Vishwajeet0510 from our team working on this issue.
@kevin-bates : have you seen this behaviour earlier?
Screenshots / Logs
Environment
Enterprise Gateway Version [v 2.1.0]
Notebook Version [v 6.0.3]
Others [Artillery : 1.7.9]
The text was updated successfully, but these errors were encountered: