Skip to content

Need advice on Solid Queue's memory usage #330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
yjchieng opened this issue Sep 7, 2024 · 44 comments
Open

Need advice on Solid Queue's memory usage #330

yjchieng opened this issue Sep 7, 2024 · 44 comments

Comments

@yjchieng
Copy link

yjchieng commented Sep 7, 2024

Ruby: 3.3.4
Rails: 7.2.1
Solid Queue: 0.7.0, 0.8.2


I run a Rails App on AWS EC2 instance with 1G of memory.
I notice the solid queue process takes up 15-20% of the instance's memory, which becomes the single largest process by memory usage.


What I checked:

  1. Check memory usage by start/stop supervisorctl
    (I use it to manage my solid queue process)

stop supervisorctl - free memory 276MB
start supervisorctl - free memory 117MB

It increases 159MB

  1. Stop supervisorctl service, and run "solid_queue:start" directly
    Trying to see if this is something related to supervisor.

before solid_queue:start - free memory 252MB
after solid_queue:start - free memory 109MB

It increases 143MB

  1. Then I notice there is a latest version.
    I upgraded to 0.8.2 (was 0.7.0).

stop supervisorctl - free memory 220MB
start supervisorctl - free memory 38MB

It increases 182MB


I need some advise:

  1. Is 150-200MB the minimum requirement to run "solid_queue:start"?
  2. Is there any setting/feature that I can switch off to reduce memory usage?
  3. Is there any setting that I can limit the maximum memory usage?

And, thanks a lot for making this wonderful gem. :)

@rosa
Copy link
Member

rosa commented Sep 7, 2024

Hey @yjchieng, thanks for opening this issue! 🙏 I think it depends a lot on your app. A brand new Rails app seems to use around 74.6MB memory for me after booting (without Solid Queue, just running Puma). I think the consumption you're seeing is from all the processes together and not just the supervisor, measuring free memory before starting the supervisor and after, as the supervisor will fork more processes. Are you running multiple workers or just one? I think reducing the number of workers there would help. Another thing that might help is using bin/jobs, which preloads the whole app before forking, but the gains there are usually quite modest.

@rosa
Copy link
Member

rosa commented Sep 7, 2024

There might also be something else going on because the only changes from version 0.7.0 to 0.8.2 were for the installing part of Solid Queue; nothing was changed besides the initial installation, so the memory footprint shouldn't have changed. I imagine there is other stuff going on in your AWS instance at the same time that might consuming memory as well.

@Focus-me34
Copy link

Focus-me34 commented Oct 28, 2024

Up 🆙🔥

I have huge memory issues in production (Rails 7.2 + activeJob + solidQueue). Everything works just fine in dev mode, but in production, there seems to be a memory leak. After restarting my production server, I get to roughly ~75% RAM usage. Very quickly (talking in minutes...) I get to ~100%. And if I let the app run for the weekend and come back on Monday (like today), I get to... 288% RAM usage... I tried removing all the lines in my code related to solidQueue, and I can confirm that this is what's causing the memory issue in production.

The exact error codes I'm getting, causing my app to crash in production (Heroku), are R14 and R15.

Any advice/suggestions would be very much appreciated fellow devs. Have an amazing day!

@rosa
Copy link
Member

rosa commented Oct 28, 2024

@Focus-me34, what version of Solid Queue are you running? And when you say you're removing anything related to Solid Queue, what Active Job adapter are you using instead?

@Focus-me34
Copy link

@rosa I'm using Solid Queue version 1.0.0. I checked all sub-dependencies versions. They all respect the pre-requirements.
We didn't really try with any other adapter since Solid Queue will be the default adapter in RoR 8. Hence, we really want to make it work this way.

Here's some of our setup code:

# scrape_rss_feed_job.rb
class ScrapingJob < ApplicationJob
  queue_as :default
  limits_concurrency to: 1, key: -> { "rss_feed_job" }, duration: 1.minute

  def perform
    Api::V1::EntriesController.fetch_latest_entries
  end
end
# recurring.yml
default: &default
  periodic_cleanup:
    class: ScrapeSecRssFeedJob
    schedule: every 2 minutes

development:
  <<: *default

test:
  <<: *default

production:
  <<: *default
# queue.yml
default: &default
  dispatchers:
    - polling_interval: 1
      batch_size: 500
      concurrency_maintenance_intervaL: 15
  workers:
    - queues: "*"
      threads: 3
      processes: <%= ENV.fetch("JOB_CONCURRENCY", 1) %>
      polling_interval: 0.1

development:
  <<: *default

test:
  <<: *default

production:
  <<: *default

Do you see anything weird?

@rosa
Copy link
Member

rosa commented Oct 29, 2024

No, that looks good to me re: configuration. You said:

I tried removing all the lines in my code related to solidQueue, and I can confirm that this is what's causing the memory issue in production.

So, if you don't get any memory issues when not using Solid Queue, is that because you're not running any jobs at all, if you're not using another adapter? Because that would point to the jobs having a memory leak, not Solid Queue.

@Focus-me34
Copy link

Focus-me34 commented Nov 7, 2024

Hey Rosa, sorry for the delayed reply! I've been very busy at work.

Here’s where we're at: I've been working on getting our company’s code running smoothly with Rails 7.2 and Solid Queue (in production Heroku). As I mentioned earlier, it’s been a huge challenge, and unfortunately, we haven’t had much success with it.

My colleague and I decided to take a closer look at our code to see if the problem was on our end. Since my last comment here, we’ve implemented tests, and I can confirm that the code is behaving exactly as expected.

Our next step, after troubleshooting the high memory usage on Heroku, was to try switching away from Solid Queue and try a different job adapter (as you suggested). I've set up Sidekiq as the adapter, and we saw a drastic improvement: memory usage dropped from around 170% of our 512 MB quota to a range of 25%-70%.

This leads me to believe that there might be a memory leak in production when using Solid Queue. From our observations, it seems that after the initial job execution completes, instance variables at the top of the function (which should reset to nil at the start of each job) are retaining the values from the previous iteration. We suspect this might be preventing the Garbage Collector from clearing memory properly between jobs.

Let me know if there's any more information I can provide to help you investigate. We’re really looking forward to moving back to using the built-in Solid Queue functionality once this issue is resolved.

[Edit: The job we're running involves two main dependencies. We scrape an RSS feed using Nokogiri and fetch a URL for each entry using httparty]

@rajeevriitm
Copy link

I am observing high memory usage when solid queue is used on heroku as well. Has there been any solutions to fix this issue? is there any temporary fix that can be used for now?

@arikarim
Copy link

solid queue with puma plugin is stupidly using high memory.

@rosa
Copy link
Member

rosa commented Mar 24, 2025

Hey @rajeevriitm, no, no real solutions. I had an async adapter that would run the supervisor, workers and dispatchers and everything together in the same process, which would save memory. However, this was scrapped from the 1.0 version. I need to push for that again.

In the meantime, you can try using rake solid_queue:start to run Solid Queue, as that, by default, won't preload all your app and see if that makes a difference. What's your configuration like?

@arikarim, thanks for your very helpful and useful comment 😒

@arikarim
Copy link

Haha sorry i was so tired of this.
until i found out that i was running rails s instead of bundle exec puma.
my bad

@arikarim
Copy link

i still there is some strange issues, with puma concurrency more than 0 the memory goes up 😢

@rajeevriitm
Copy link

@rosa I have a simple configuration. queue runs on a single server config. I have a continuous running job that runs every 30 mins.

queue.yml

default: &default
  dispatchers:
    - polling_interval: 2
      batch_size: 500
  workers:
    - queues: "*"
      threads: 1
      processes: 1
      polling_interval: 2

development:
  <<: *default

test:
  <<: *default

production:
  <<: *default

puma.rb

threads_count = ENV.fetch("RAILS_MAX_THREADS", 3)
threads threads_count, threads_count
port ENV.fetch("PORT", 3000)
plugin :tmp_restart

# Run the Solid Queue supervisor inside of Puma for single-server deployments
plugin :solid_queue 

pidfile ENV["PIDFILE"] if ENV["PIDFILE"]

@arikarim
Copy link

@rosa there seems to be a problem with newer versions of solid queue, i have downgraded my solid gem to 1.1.0 and the memory issue is fixed. 😄

@IvanPakhomov99
Copy link

IvanPakhomov99 commented Mar 26, 2025

+1, I'm running the latest solid_queue version and am definitely experiencing memory issues. Once I start the container, memory usage keeps increasing indefinitely until it hits the maximum capacity and the container restarts.

I also tried switching from bin/jobs to a Rake task, but that didn’t help either.

Note: I don't run any jobs.

default: &default
  dispatchers:
    - polling_interval: 1
      batch_size: 500
  workers:
    - queues: "*"
      threads: 3
      processes: <%= ENV.fetch("JOB_CONCURRENCY", 1) %>
      polling_interval: 0.1

development:
<<: *default

test:
<<: *default

production:
<<: *default

Image

@eliasousa
Copy link

eliasousa commented Mar 29, 2025

hi 👋 I'm experiencing the same issue:

Ruby: 3.4.2
Rails: 8.0.2
Solid Queue: 1.1.4
Puma: plugin :solid_queue

Running on DigitalOcean droplet instance with 1G of memory. I'll try to downgrade the solid_queue gem version to see if improves

@rosa
Copy link
Member

rosa commented Mar 29, 2025

Hey all, so sorry about this! I've been swamped with other stuff at work, but I'm going to look into this on Monday.

@rajeevriitm
Copy link

rajeevriitm commented Mar 30, 2025 via email

@rosa
Copy link
Member

rosa commented Mar 31, 2025

@IvanPakhomov99, @rajeevriitm, could you try downgrading to version v1.1.2 and let me know if the issue persists? Also, what version of Ruby are you using?

@IvanPakhomov99
Copy link

Hi @rosa, Thanks for your prompt response. I tested four different Solid Queue versions (1.1.0, 1.1.1, 1.1.2, and 1.1.4) but encountered the same issue across all of them.

Environment:

  • Ruby: 3.4.1
  • Rails: 8.0.1
  • Database: MySQL 8.0 (Solid Queue runs on the main instance)

Let me know if you need any additional details.

@rajeevriitm
Copy link

@rosa I tried downgrading. The issue exist in 1.1.2 as well.

Ruby 3.4.1
Rails 8.0.1

Happy to help .

@rajeevriitm
Copy link

hey @rosa , were you able to identify the issue causing the memory leak?

@rosa
Copy link
Member

rosa commented Apr 4, 2025

@rajeevriitm, I'm afraid I wasn't 😞 I reviewed all code from v1.1.0 and didn't identify anything that could leak memory. This was before @IvanPakhomov99 shared that testing 1.1.0, 1.1.1, 1.1.2, and 1.1.4 made no difference. Then, I tried running jobs of different kinds (recurring jobs being enqueued every 5 seconds of different types, long running jobs, etc.) over a couple of days and couldn't reproduce any memory leaks. I think whatever is happening depends on what your jobs are doing or what you're loading in your app. I also wonder if this is not a memory leak, but simply high memory usage. Solid Queue runs a different process for each worker, a process for the dispatcher, another process for the scheduler and another process for the supervisor. All those processes load your app and even though the fork is done after loading the app, so CoW should ensure memory sharing, this is not the same as running a single process.

I don't have a better idea to reduce memory usage other than providing a single-process execution mode (what I've called "async" mode).

@IvanPakhomov99
Copy link

@rosa Thanks for taking the time to look into this! The main pitfall here is that it's a brand new service and I am not running any jobs yet. I also confirmed that the service pod memory is stable.

That said, I actually have an update — I downgraded to version 1.0.2, and memory usage looks stable now. I’ll stick with this version for now.

@chayuto
Copy link

chayuto commented Apr 5, 2025

have the same issue on heroku to. with version 1.1.4.

as mentioned above. downgraded to version 1.0.2 solve the issue.

Image

@jeffcoh23
Copy link

@rosa Thank you for your attentiveness to this issue. I just downgraded solid_queue to 1.0.2 and can confirm that it drastically reduced the R14 related errors/memory usage.

@rosa
Copy link
Member

rosa commented Apr 7, 2025

Oh! Thanks a lot for confirming that! I had only looked down to version 1.1.0. I'm on-call this week so a bit short on time but will try to figure out why the memory increased from that version to 1.1.0.

@IvanPakhomov99
Copy link

IvanPakhomov99 commented Apr 7, 2025

@rosa I’m not sure how relevant this is, but it might be worth taking another look at this commit: a152f26. It looks like interruptible_sleep was updated to use Promises.future on each call. Given how this method is used — often multiple times within a loop — it’s quite possible this could lead to increased memory consumption.

Even though the block uses .value, making it synchronous from the caller’s perspective, a new thread or fiber is still created under the hood for each call.

Sorry if that’s not the case — I might not be seeing the full picture. Just reviewing the changes between versions 1.1.0 and 1.0.2.

@jeffcoh23
Copy link

jeffcoh23 commented Apr 12, 2025

@rosa Thank you for your attentiveness to this issue. I just downgraded solid_queue to 1.0.2 and can confirm that it drastically reduced the R14 related errors/memory usage.

Just to follow up on this - it seemed to have worked in the short-term, but I am back to where I started unfortunately... still seeing those r14/memory issues consistently.

@zdennis
Copy link

zdennis commented Apr 20, 2025

We are experiencing the same unbounded memory growth issues with solid_queue with results in OOMs. We're on solid-queue 1.1.3 and Ruby 3.3.7.

We can recreate this issue by running bin/jobs and letting it run, without enqueuing any jobs.

solid-queue-worker (1.1.3)

Here's what the memory size of the solid-queue worker process solid-queue-worker(1.1.3): waiting for jobs in * looks like:

Image

It's being sampled every minute for about 75 minutes and grows from about 141mb to 840mb in that time. This was run in development with YJIT on and eager loading off. We have ran it with YJIT off and eager loading off, and the same issue persists.

solid-queue-supervisor (1.1.3)

Here's what the memory size of the solid-queue supervisor process solid-queue-supervisor(1.1.3) looks like over the same time frame:

Image

Memory here also seems to grow but at a much slower rate.

Memory growth

This issue persists in all environments regardless of yjit, eager loading, or seemingly other environmental factors. If left unchecked solid-queue will eventually hit OOM errors causing the pod/container it is running in to be killed.

The memory growth for the worker never seems to plateau. We have tried increasing memory and solid queue will just run until it consumes all memory. Because the worker grows at a much faster rate than the supervisor it's unclear if the supervisor will experienced unbounded memory growth. I suspect it will given that memory growth is occurring when there are no jobs present.

No jobs during this benchmark

Just wanting to call out that there are no jobs being enqueued at all. It's an empty database.

solid-queue 1.1.4 behavior is up next

I'm now running with solid-queue 1.1.4 and so far the issue appears exist there as well.

@rosa
Copy link
Member

rosa commented Apr 20, 2025

Alright, I'm going to revert the change in #417 since that's looking like the most likely culprit as @IvanPakhomov99 said. I hope that fixes this problem.

Thanks a lot for the tests @zdennis and for writing this up 🙏

@zdennis
Copy link

zdennis commented Apr 20, 2025

Thanks, @rosa . 🙇 Let us know when/how to test, and I'm happy to pull it down and re-run the same tests.

rosa added a commit that referenced this issue Apr 20, 2025
…rruptible

This is looking like the most likely culprit of a memory leak pointed
out for multiple people on #330
rosa added a commit that referenced this issue Apr 20, 2025
…rruptible

For all Rubies, and not just Ruby < 3.2.
This is looking like the most likely culprit of a memory leak pointed
out for multiple people on #330
rosa added a commit that referenced this issue Apr 20, 2025
…rruptible

For all Rubies, and not just Ruby < 3.2.
This is looking like the most likely culprit of a memory leak pointed
out for multiple people on #330
@rosa
Copy link
Member

rosa commented Apr 20, 2025

Thanks a lot @zdennis, really appreciate it 🙏 I've just pushed version 1.1.5 with that change reverted, I hope this is it 🤞

@zdennis
Copy link

zdennis commented Apr 20, 2025

solid_queue 1.1.5 is looking much better. After about 80 minutes so far, memory isn't growing. 🥳

Worker memory 🟢

Image

Plateaus right around 149Mb for us.

Supervisor memory 🟢

And the supervisor also plateaus, right around 159Mb:

Image

Concurrent::Promises.future issue

I wonder if this is related the memory leak reported in usage of Concurrent::Promises.future in ruby-concurrency/concurrent-ruby#960.

Thanks for pushing out 1.1.5 🙇

Thanks for quickly push out 1.1.5, @rosa! It is looking like it may be the ticket here. I'll keep this test running over night (because I'm curious), and will post back with those results as well.

@zdennis
Copy link

zdennis commented Apr 21, 2025

Here is the memory usage on the worker and supervisor processes over 15 hours. Again, this is with no jobs being processed.

Still looks great compared to prior releases. We'll bump and test this out in a live environment. I'll share results later this week or next week after it's had some time running in the wild.

Worker memory

Worker memory is still very low at 152Mb.

Image

Supervisor memory

Supervisor memory is up to about 165.7mb. It very slowly grew from about 156mb to 160mb and then jumped up to 165mb after about 14 hours.

Image

I haven't been benchmarking/profiling GC major/minor runs and we're not manually invoking GC.start so it's not clear if this is normal growth by Ruby or if there is a small leak in the supervisor process still. Either way, 1.1.5 continues to look good.

@rafael-pissardo
Copy link

rafael-pissardo commented Apr 23, 2025

@zdennis any surprise until now?

@zdennis
Copy link

zdennis commented Apr 24, 2025

We've had it running in production since late Monday and we haven't had an OOM since. Here's a chart showing how over a similar timeframe (from Saturday to Monday) we did see an OOM with solid_queue 1.1.3. And then over the same number of days, Monday evening to today, we haven't seen an OOM and memory looks much better.

Image

We did notice that memory usage grows when polling (and when there are no active jobs). We have some recurring jobs and then we have past jobs. We are not invoking GC.start manually so I suspect the memory growth we're seeing with polling may be a result of Ruby not garbage collecting for a while, but this is just a guess.

I notice solid_queue has a usage of Concurrent::Promises.future_on. I'm not sure if this is susceptible to the same issue that Concurrent::Promises.future is.

I'll post back again early next week after it's been up for a week, but so far it is still looking good. 🤞

@hms
Copy link
Contributor

hms commented Apr 24, 2025

@rosa I’m not sure how relevant this is, but it might be worth taking another look at this commit: a152f26. It looks like interruptible_sleep was updated to use Promises.future on each call. Given how this method is used — often multiple times within a loop — it’s quite possible this could lead to increased memory consumption.

Even though the block uses .value, making it synchronous from the caller’s perspective, a new thread or fiber is still created under the hood for each call.

Sorry if that’s not the case — I might not be seeing the full picture. Just reviewing the changes between versions 1.1.0 and 1.0.2.

I've instrumented the code between the original (and now current) implementation of Interruptible.rb Vs. the change I made in [a152f26] and they result in exactly the same number calls to Poll. This strongly suggests the implementation of Interruptible is not changing SQ behavior for number of iterations within the poll loop semantics.

I've built a test harness to allow for isolating various configurations (Ruby versions, Rails versions, Database configurations, and SolidQueue configurations).

I've tested various implementations of Concurrent::Promises.future including the Interruptible implementation in the PR that I submitted and can also confirm that it does not leak.

While testing without any jobs (just worker polling) what I'm seeing is a lot of garbage being generated in-between major GC's, which can be exacerbated via a shorter polling_interval, and results constant and significant memory growth. This is forcing Ruby to allocate more slabs to handle the object allocations. While the objects are GC'able, the slabs are forever (stackable via GC.stats and GC.heap_stats).

When I switch to JEMALLOC and MALLOC_MAX_ARENA=2, which is how my Heroku and local dev is configured, I don't see the memory growth issues at all.

More digging to forthcoming.

@rafael-pissardo
Copy link

Also confirming here that the leak has been resolved in v1.1.5. 👍

Image

@jrodden1
Copy link

jrodden1 commented May 13, 2025

@zdennis Any updates on how its been running in prod for you?

@rosa, I hate to rain on the parade, but I'm still having the memory growth issue even after upgrading to 1.1.5 on heroku. 😕

Image
Context

  • Rails 8.0.2
  • Ruby 3.4.2
  • Solid Queue 1.1.5
    • Solid Queue is configured for a single database setup.
  • I'm running 1 web dyno for puma and 1 worker dyno for solid_queue on heroku.
  • I'm using the jemalloc build pack on heroku with JEMALLOC_ENABLED set to true
  • I've tried using bundle exec bin/jobs as well as bundle exec rails solid_queue:start in my Procfile. It was just this afternoon when I switched to trying rails solid_queue:start in the Procfile so I'll report on how that goes... but it doesn't look promising as the memory usage is still climbing in a similar fashion.

Thoughts
I'm testing out this setup as a staging environment on heroku for now, but I've got about 2 weeks until I need to deploy this to prod for my client.

I have 5 recurring jobs specified in my recurring.yml. These are pretty much the only jobs running (except for a mailer that runs once per day that gets triggered by of one of the jobs). Most of the time, the queue is empty / idle but the memory usage continually grows. I'm curious if I'm having the issue and others are not is because I have multiple recurring jobs (and one that runs once per minute)? 🤔

I've included the content of my config files. Perhaps I have something incorrectly configured? I'm open to thoughts or suggestions. 🙏 Should I try adding a recurring job that forces garbage collection to run or something like that as a workaround?

Solid Queue Config

# config/queue.yml
default: &default
  dispatchers:
    - polling_interval: 1
      batch_size: 500
  workers:
    - queues: "*"
      threads: 2
      processes: <%= ENV.fetch("JOB_CONCURRENCY", 1) %>
      polling_interval: 1

development:
  <<: *default
  database: app_development

test:
  <<: *default
  database: app_test

production:
  <<: *default
  url: <%= ENV["DATABASE_URL"] %>

SQ Recurring Jobs Config
(I've generic-ized the names of the jobs as this project is for a client)

# config/recurring.yml
recurring_maintenance: &recurring_maintenance
  job_one:
    class: "JobOne"
    priority: 0
    schedule: "every 60 seconds"
  job_two:
    class: "JobTwo"
    priority: 1
    schedule: "every 5 minutes"
  job_three:
    class: "JobThree"
    priority: 1
    schedule: "every 30 minutes"
  job_four:
    class: "JobFour"
    priority: 1
    schedule: "every weekday at 9am America/Detroit"
  job_five:
    class: "JobFive"
    priority: 1
    schedule: "30 15 15 * * 1-5 America/Detroit"

development:
  <<: *recurring_maintenance

production:
  <<: *recurring_maintenance

5/14 Next Morning Update
I checked on heroku this morning and switching to rails solid_queue:start still exhibits the same behavior as bin/jobs

Image

@zdennis
Copy link

zdennis commented May 14, 2025

@jrodden1 wrote:

@zdennis Any updates on how its been running in prod for you?

Our staging and production environments are still looking good for us.

Image

This isn't to say there's not another issue going on. Memory growth still occurs, but it's no longer unbounded and it gets reclaimed. We have more memory available in our clusters than it appears you have in your Heroku app Heroku so it could be that if we had a lower limit, we might be seeing some OOM issues.

I am eagerly waiting @hms 's continued investigation as I think he may be on to something. I'd love to dive in deeper, but I don't have bandwidth right now.

Have you tried setting MALLOC_MAX_ARENA=2 as @hms mentioned?

@hms
Copy link
Contributor

hms commented May 14, 2025

I have to admin I reprioritized my personal work priorities given it was assumed my patch was the root cause and since my code was removed, I moved on.

I can resume my digging, but I think I need some help.

There are too many permutations between ruby versions, rails versions, and databases (and database versions). The community needs to identify a small subset of configurations to be our test beds as it's a ton of work for me to manage all of the versions that have been mentioned in this thread.

Secondly, I need better info around if people are seeing the leak with SQ simply idling, running standard jobs, and/or running recurring jobs. That's also a lot of permutations to manage. Even worse, idling and standard jobs run on the Worker and reassuring jobs run on the Scheduler, making monitoring harder.

What I've found so far is as follows:

  • I was never able to reproduce the leak with 1.1.4 when letting SQ idle (suggested above as observable). If it was leaking it small enough that I didn't see it over my monitoring window.
  • I didn't see any real differences (other than what would be expected due to the slight differences in the code) in memory usage or object counts between V1.1.4 and V1.1.5 during my idle testing.
  • The Supervisor memory footprint growing over time is confusing and damning . The Supervisor doesn't really do much after SQ has launched other than heartbeating via a separate thread and waiting for, and cleaning up, broken pipes from sub-workers terminating (and as side note, wasn't changed in the code changes in question).

One test that I didn't run yet, but has me curious is around Interruptible sleep. In the 1.1.5 release, Rosa kept (by accident???) one change that I made that removed a lot of unnecessary worker polling. If the problem is in some manner tied to the poll loop, then by putting back the original high volume poll loop, it should make the problem a little easier to spot.

For those of you suffering from OOM issues, here is what I'm doing these days and has been working well for months:

  • I have all of my jobs that generates OOM pressure restricted to a single queue with single-threaded workers.

  • Using a perform_around block, I check memory before and after the job execution (the before info just helps monitor the jump for specific job types and isn't strictly needed).

  • If the Worker is over my memory limit (technically, it's the SQ in aggregate memory footprint, but I only restart the heavy memory worker)

    • execute Process.kill :SIGTERM, $$; sleep 0.2 -- which in theory allows the existing job to complete the required AJ and SQ house keeping to complete

    • prevents new jobs from starting since SIGTERM is a graceful shutdown and has terminated the poll loop.. Note: there is a small race here between the signal being processed and actually setting the shutdown flag which prevents another job from starting. This is the reason for the sleep -- the signal is processed asynchrony and the short sleep allows for the signal to propagate. The sleep prevents the job from completing, which has the side effect of preventing a new job from starting.

    • causes the worker in question to terminate -- releasing all memory used by that worker

    • allows the Supervisor to restart it (hopefully without any pending jobs which take a long time before they run)

This approach has some pros and cons:

  • Pro: I don't get spurious H14 errors anymore. For me, this is big.
  • Con: It adds some overheads and delays for the impacted jobs. For me, this is small.
  • Con: It doesn't play nice with YJIT since we're frequently restarting this worker. For me this is small and with a small PR, could be managed better.

Despite the above, it works well enough that I have abandoned a PR to implement Worker restarts as native feature of SQ.

It should be noted that this works much more cleanly with a dedicated queue / single-threaded worker and known jobs that generate memory pressure. If you kill a worker with multiple threads, there is the risk killing an active job and with the current SQ design, that job can take a while to be rerun or even worse, can get caught in a run, die loop.

@jrodden1
Copy link

I've done some further troubleshooting today.

I decided to just toggle off all the recurring jobs (just commented them all out of recurring.yml) and didn't manually run any jobs through the queue.

And... The memory usage stayed fairly consistent! So it doesn't look like its crazy growing while just sitting idle.

Image

Since I didn't have any recurring jobs specified, the Scheduler process wasn't running.

[1] pry(main)> SolidQueue::Process.all
  SolidQueue::Process Load (3.9ms)  SELECT "solid_queue_processes".* FROM "solid_queue_processes"
[
  [0] #<SolidQueue::Process:0x00007f98a1a6f2c0> {
                   :id => 156,
                 :kind => "Supervisor",
    :last_heartbeat_at => 2025-05-14 14:54:31.070768000 MDT -06:00,
        :supervisor_id => nil,
                  :pid => 2,
             :hostname => "redacted",
             :metadata => {},
           :created_at => 2025-05-14 11:52:29.415980000 MDT -06:00,
                 :name => "supervisor-redacted"
  },
  [1] #<SolidQueue::Process:0x00007f98a3d1ed98> {
                   :id => 157,
                 :kind => "Worker",
    :last_heartbeat_at => 2025-05-14 14:54:31.086549000 MDT -06:00,
        :supervisor_id => 156,
                  :pid => 57,
             :hostname => "redacted",
             :metadata => {
      "polling_interval" => 1,
                "queues" => "*",
      "thread_pool_size" => 2
    },
           :created_at => 2025-05-14 11:52:29.449116000 MDT -06:00,
                 :name => "worker-redacted"
  },
  [2] #<SolidQueue::Process:0x00007f98a3d1ec58> {
                   :id => 158,
                 :kind => "Dispatcher",
    :last_heartbeat_at => 2025-05-14 14:54:31.102366000 MDT -06:00,
        :supervisor_id => 156,
                  :pid => 53,
             :hostname => "redacted",
             :metadata => {
                      "polling_interval" => 1,
                            "batch_size" => 500,
      "concurrency_maintenance_interval" => 600
    },
           :created_at => 2025-05-14 11:52:29.445760000 MDT -06:00,
                 :name => "dispatcher-redacted"
  }
]

This seems to narrow it down to something to do when recurring jobs are being used...

I also found out that Heroku (for new apps since 2019) default sets MALLOC_ARENA_MAX=2
(I'm not sure if it was a typo, but @hms mentioned MALLOC_MAX_ARENA?).

I have tried setting both versions of the ENV var mentioned previously and didn't notice any difference in memory usage.

Also, I am using Heroku Postgres as my DB.

@jrodden1
Copy link

Huzzah! Success!

I turned my recurring jobs back on toward the end of the work day yesterday and also changed my threads setting to 1 in queue.yml.

Image

With this setup, this is the memory graph I'm getting now.

Image

While I don't have a lot of 'headroom' (only around ~100MB), I'm happy to see that the usage leveled off.

I'll be curious to see if this changes if/when I add additional recurring jobs, but for now, I think I'm good!

Thanks for the input @hms, @zdennis!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests