Skip to content

Conversation

@shaneknapp
Copy link
Contributor

we've come to find that adding the 20th percentile to server startup times shows us a more realistic view of a general single user server startup duration.

image

this is highly non-critical, but very useful. :)

@shaneknapp
Copy link
Contributor Author

hmm, not sure why the tests are failing...

@consideRatio
Copy link
Member

consideRatio commented Nov 9, 2025

I thought about this quite a bit now, with some insights summarized below.

  1. Whenever a user starts a server, we can only tell that the spawn duration lies between two bucket sizes, defined by JupyterHub's metrics, which are listed below. Note that le stands for less than or equal to, and each entry is a counter that is only being incremented.

    jupyterhub_server_spawn_duration_seconds_bucket{le="0.5",status="success"} 0.0
    jupyterhub_server_spawn_duration_seconds_bucket{le="1.0",status="success"} 0.0
    jupyterhub_server_spawn_duration_seconds_bucket{le="2.5",status="success"} 0.0
    jupyterhub_server_spawn_duration_seconds_bucket{le="5.0",status="success"} 0.0
    jupyterhub_server_spawn_duration_seconds_bucket{le="10.0",status="success"} 2.0
    jupyterhub_server_spawn_duration_seconds_bucket{le="15.0",status="success"} 2.0
    jupyterhub_server_spawn_duration_seconds_bucket{le="30.0",status="success"} 2.0
    jupyterhub_server_spawn_duration_seconds_bucket{le="60.0",status="success"} 2.0
    jupyterhub_server_spawn_duration_seconds_bucket{le="120.0",status="success"} 2.0
    jupyterhub_server_spawn_duration_seconds_bucket{le="180.0",status="success"} 2.0
    jupyterhub_server_spawn_duration_seconds_bucket{le="300.0",status="success"} 2.0
    jupyterhub_server_spawn_duration_seconds_bucket{le="600.0",status="success"} 2.0
    jupyterhub_server_spawn_duration_seconds_bucket{le="+Inf",status="success"} 2.0
    
  2. Often what is presented will only relate to a single server startup, and with these percentiles you end up with multiple values in between the range. For example below, I used five percentiles - 0, 25, 50, 75, and 100.
    image
    image
    image

  3. Sometimes we have multiple server startup times recorded during a single timestep, and then it can look like this:
    image

My current opinion

I think overall, the 99th or 100th percentile represents the worst case, while the 50th percentile represents the best guess of average spawn time. Both of these seem reasonable to me.

I think it could also make sense to see the best case alongside the worst case.

Beyond that, I think it's reasonable to add more points to help see the skew, such as 25 and 75, but we should maintain even spacing between all points, so 0, 25, 50, 75, 100, rather than 0, 20, 50, 100.

@shaneknapp
Copy link
Contributor Author

i'll go ahead and close this in lieu of #161

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants