Skip to content

Potential memory leak in JEG server due to notebook websocket handler #1018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rahul26goyal opened this issue Dec 6, 2021 · 3 comments
Closed

Comments

@rahul26goyal
Copy link
Contributor

Description

We have deployed JEG on Kubernetes cluster and trying to run spark_python_kubernetes and python_kubernetes kernel using it. We are seeing an issue with JEG server pod going out of memory every 6-10 hours. We have given max memory of 4 GB to the Pod.

This is mainly happening when we have a Notebook UI client trying to re-establish a Web-socket connection to previously running kernel i.e. Notebook UI still thinks kernel exists but JEG does not know about it anymore. This happens when JEG gets restarted and loses context about the previously running kernels but notebook does not know about the JEG restart and keeps on trying to connect to the existing kernel session(?).

Based on the testing, we think that _register_session method in the notebook kernel handler is causing the leak by creating a new session object each time notebook tries to hit the /api/kernels/<>/channels API call. Thing to note is: JEG returns 404 response to Notebook but notebook does not stop trying.

More details about the issue available here: jupyter/notebook#6244

tagging @Vishwajeet0510 from our team working on this issue.

@kevin-bates : have you seen this behaviour earlier?

Screenshots / Logs

Environment

Enterprise Gateway Version [v 2.1.0]
Notebook Version [v 6.0.3]
Others [Artillery : 1.7.9]

@kevin-bates
Copy link
Member

Hi @rahul26goyal - although I have not seen this, I also haven't looked into things at this level. I agree with your assessment from the client side but haven't spent time with your Notebook PR (I'm trying to be on vacation this month 😄 ).

One thing you might be interested in and perhaps not aware of is EG has experimental HA/DR support that is disabled by default. With PR #737 (released in 2.1) we essentially introduced support for active/active HA, but your "DR" case should be applicable as well. With those changes, a potential 404 from the lookup by kernel_id could be averted when EG then checks a directory for that kernel's connection information. So, assuming the kernel persistence location is on a shared drive (or, in your case, even the local filesystem), EG will attempt to reconnect to the remote kernel.

The configuration settings you'd need to look into/configure would be:

FileKernelSessionManager.enable_persistence = True
FileKernelSessionManager.persistence_root = /my/persistence/dir

Please note that the attempts to locate a kernel from the Notebook server to the EG server, when no EG is running, may still result in a leak, so this information doesn't necessarily preclude the need to update the client side.

I hope that helps. Since I am out of the office this month, please know that my responses may be more delayed than usual.

@rahul26goyal
Copy link
Contributor Author

thanks a lot @kevin-bates for your response. and sorry to disturb during ur vacation. Have a good time :)

you are right that the HA/DR might not solve all the issues but it will help mitigate losing the kernel contexts when JEG restarts. I will try this out.

@kevin-bates
Copy link
Member

Since this is being addressed in the respective server projects (Notebook and JupyterServer), I'm closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants