Need advice on Solid Queue's memory usage #330

yjchieng · 2024-09-07T06:13:16Z

Ruby: 3.3.4
Rails: 7.2.1
Solid Queue: 0.7.0, 0.8.2

I run a Rails App on AWS EC2 instance with 1G of memory.
I notice the solid queue process takes up 15-20% of the instance's memory, which becomes the single largest process by memory usage.

What I checked:

Check memory usage by start/stop supervisorctl
(I use it to manage my solid queue process)

stop supervisorctl - free memory 276MB
start supervisorctl - free memory 117MB

It increases 159MB

Stop supervisorctl service, and run "solid_queue:start" directly
Trying to see if this is something related to supervisor.

before solid_queue:start - free memory 252MB
after solid_queue:start - free memory 109MB

It increases 143MB

Then I notice there is a latest version.
I upgraded to 0.8.2 (was 0.7.0).

stop supervisorctl - free memory 220MB
start supervisorctl - free memory 38MB

It increases 182MB

I need some advise:

Is 150-200MB the minimum requirement to run "solid_queue:start"?
Is there any setting/feature that I can switch off to reduce memory usage?
Is there any setting that I can limit the maximum memory usage?

And, thanks a lot for making this wonderful gem. :)

rosa · 2024-09-07T10:25:15Z

Hey @yjchieng, thanks for opening this issue! 🙏 I think it depends a lot on your app. A brand new Rails app seems to use around 74.6MB memory for me after booting (without Solid Queue, just running Puma). I think the consumption you're seeing is from all the processes together and not just the supervisor, measuring free memory before starting the supervisor and after, as the supervisor will fork more processes. Are you running multiple workers or just one? I think reducing the number of workers there would help. Another thing that might help is using bin/jobs, which preloads the whole app before forking, but the gains there are usually quite modest.

rosa · 2024-09-07T10:26:34Z

There might also be something else going on because the only changes from version 0.7.0 to 0.8.2 were for the installing part of Solid Queue; nothing was changed besides the initial installation, so the memory footprint shouldn't have changed. I imagine there is other stuff going on in your AWS instance at the same time that might consuming memory as well.

Focus-me34 · 2024-10-28T19:31:32Z

Up 🆙🔥

I have huge memory issues in production (Rails 7.2 + activeJob + solidQueue). Everything works just fine in dev mode, but in production, there seems to be a memory leak. After restarting my production server, I get to roughly ~75% RAM usage. Very quickly (talking in minutes...) I get to ~100%. And if I let the app run for the weekend and come back on Monday (like today), I get to... 288% RAM usage... I tried removing all the lines in my code related to solidQueue, and I can confirm that this is what's causing the memory issue in production.

The exact error codes I'm getting, causing my app to crash in production (Heroku), are R14 and R15.

Any advice/suggestions would be very much appreciated fellow devs. Have an amazing day!

rosa · 2024-10-28T19:33:33Z

@Focus-me34, what version of Solid Queue are you running? And when you say you're removing anything related to Solid Queue, what Active Job adapter are you using instead?

Focus-me34 · 2024-10-29T09:10:09Z

@rosa I'm using Solid Queue version 1.0.0. I checked all sub-dependencies versions. They all respect the pre-requirements.
We didn't really try with any other adapter since Solid Queue will be the default adapter in RoR 8. Hence, we really want to make it work this way.

Here's some of our setup code:

# scrape_rss_feed_job.rb
class ScrapingJob < ApplicationJob
  queue_as :default
  limits_concurrency to: 1, key: -> { "rss_feed_job" }, duration: 1.minute

  def perform
    Api::V1::EntriesController.fetch_latest_entries
  end
end

# recurring.yml
default: &default
  periodic_cleanup:
    class: ScrapeSecRssFeedJob
    schedule: every 2 minutes

development:
  <<: *default

test:
  <<: *default

production:
  <<: *default

# queue.yml
default: &default
  dispatchers:
    - polling_interval: 1
      batch_size: 500
      concurrency_maintenance_intervaL: 15
  workers:
    - queues: "*"
      threads: 3
      processes: <%= ENV.fetch("JOB_CONCURRENCY", 1) %>
      polling_interval: 0.1

development:
  <<: *default

test:
  <<: *default

production:
  <<: *default

Do you see anything weird?

rosa · 2024-10-29T09:13:35Z

No, that looks good to me re: configuration. You said:

I tried removing all the lines in my code related to solidQueue, and I can confirm that this is what's causing the memory issue in production.

So, if you don't get any memory issues when not using Solid Queue, is that because you're not running any jobs at all, if you're not using another adapter? Because that would point to the jobs having a memory leak, not Solid Queue.

Focus-me34 · 2024-11-07T00:50:18Z

Hey Rosa, sorry for the delayed reply! I've been very busy at work.

Here’s where we're at: I've been working on getting our company’s code running smoothly with Rails 7.2 and Solid Queue (in production Heroku). As I mentioned earlier, it’s been a huge challenge, and unfortunately, we haven’t had much success with it.

My colleague and I decided to take a closer look at our code to see if the problem was on our end. Since my last comment here, we’ve implemented tests, and I can confirm that the code is behaving exactly as expected.

Our next step, after troubleshooting the high memory usage on Heroku, was to try switching away from Solid Queue and try a different job adapter (as you suggested). I've set up Sidekiq as the adapter, and we saw a drastic improvement: memory usage dropped from around 170% of our 512 MB quota to a range of 25%-70%.

This leads me to believe that there might be a memory leak in production when using Solid Queue. From our observations, it seems that after the initial job execution completes, instance variables at the top of the function (which should reset to nil at the start of each job) are retaining the values from the previous iteration. We suspect this might be preventing the Garbage Collector from clearing memory properly between jobs.

Let me know if there's any more information I can provide to help you investigate. We’re really looking forward to moving back to using the built-in Solid Queue functionality once this issue is resolved.

[Edit: The job we're running involves two main dependencies. We scrape an RSS feed using Nokogiri and fetch a URL for each entry using httparty]

rajeevriitm · 2025-03-21T13:47:42Z

I am observing high memory usage when solid queue is used on heroku as well. Has there been any solutions to fix this issue? is there any temporary fix that can be used for now?

arikarim · 2025-03-24T11:16:05Z

solid queue with puma plugin is stupidly using high memory.

rosa · 2025-03-24T11:20:48Z

Hey @rajeevriitm, no, no real solutions. I had an async adapter that would run the supervisor, workers and dispatchers and everything together in the same process, which would save memory. However, this was scrapped from the 1.0 version. I need to push for that again.

In the meantime, you can try using rake solid_queue:start to run Solid Queue, as that, by default, won't preload all your app and see if that makes a difference. What's your configuration like?

@arikarim, thanks for your very helpful and useful comment 😒

arikarim · 2025-03-24T11:22:35Z

Haha sorry i was so tired of this.
until i found out that i was running rails s instead of bundle exec puma.
my bad

arikarim · 2025-03-24T12:10:08Z

i still there is some strange issues, with puma concurrency more than 0 the memory goes up 😢

rajeevriitm · 2025-03-24T20:59:09Z

@rosa I have a simple configuration. queue runs on a single server config. I have a continuous running job that runs every 30 mins.

queue.yml

default: &default
  dispatchers:
    - polling_interval: 2
      batch_size: 500
  workers:
    - queues: "*"
      threads: 1
      processes: 1
      polling_interval: 2

development:
  <<: *default

test:
  <<: *default

production:
  <<: *default

puma.rb

threads_count = ENV.fetch("RAILS_MAX_THREADS", 3)
threads threads_count, threads_count
port ENV.fetch("PORT", 3000)
plugin :tmp_restart

# Run the Solid Queue supervisor inside of Puma for single-server deployments
plugin :solid_queue 

pidfile ENV["PIDFILE"] if ENV["PIDFILE"]

arikarim · 2025-03-25T11:32:32Z

@rosa there seems to be a problem with newer versions of solid queue, i have downgraded my solid gem to 1.1.0 and the memory issue is fixed. 😄

IvanPakhomov99 · 2025-03-26T23:31:17Z

+1, I'm running the latest solid_queue version and am definitely experiencing memory issues. Once I start the container, memory usage keeps increasing indefinitely until it hits the maximum capacity and the container restarts.

I also tried switching from bin/jobs to a Rake task, but that didn’t help either.

Note: I don't run any jobs.

default: &default
  dispatchers:
    - polling_interval: 1
      batch_size: 500
  workers:
    - queues: "*"
      threads: 3
      processes: <%= ENV.fetch("JOB_CONCURRENCY", 1) %>
      polling_interval: 0.1

development:
<<: *default

test:
<<: *default

production:
<<: *default

eliasousa · 2025-03-29T16:02:49Z

hi 👋 I'm experiencing the same issue:

Ruby: 3.4.2
Rails: 8.0.2
Solid Queue: 1.1.4
Puma: plugin :solid_queue

Running on DigitalOcean droplet instance with 1G of memory. I'll try to downgrade the solid_queue gem version to see if improves

rosa · 2025-03-29T22:28:50Z

Hey all, so sorry about this! I've been swamped with other stuff at work, but I'm going to look into this on Monday.

rajeevriitm · 2025-03-30T08:33:42Z

Thanks for addressing the issue. Hope it's resolved soon.

…

On Sun, 30 Mar 2025, 3:59 am Rosa Gutierrez, ***@***.***> wrote: Hey all, so sorry about this! I've been swamped with other stuff at work, but I'm going to look into this on Monday. — Reply to this email directly, view it on GitHub <#330 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADHHCXX5AO7SO53U3BY66UT2W4GCTAVCNFSM6AAAAABNZXT3UCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONRUGI3DMNJWGE> . You are receiving this because you were mentioned.Message ID: ***@***.***> [image: rosa]*rosa* left a comment (rails/solid_queue#330) <#330 (comment)> Hey all, so sorry about this! I've been swamped with other stuff at work, but I'm going to look into this on Monday. — Reply to this email directly, view it on GitHub <#330 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADHHCXX5AO7SO53U3BY66UT2W4GCTAVCNFSM6AAAAABNZXT3UCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONRUGI3DMNJWGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

rosa · 2025-03-31T09:49:23Z

@IvanPakhomov99, @rajeevriitm, could you try downgrading to version v1.1.2 and let me know if the issue persists? Also, what version of Ruby are you using?

IvanPakhomov99 · 2025-03-31T22:58:36Z

Hi @rosa, Thanks for your prompt response. I tested four different Solid Queue versions (1.1.0, 1.1.1, 1.1.2, and 1.1.4) but encountered the same issue across all of them.

Environment:

Ruby: 3.4.1
Rails: 8.0.1
Database: MySQL 8.0 (Solid Queue runs on the main instance)

Let me know if you need any additional details.

rajeevriitm · 2025-04-01T09:58:08Z

@rosa I tried downgrading. The issue exist in 1.1.2 as well.

Ruby 3.4.1
Rails 8.0.1

Happy to help .

rajeevriitm · 2025-04-03T21:19:38Z

hey @rosa , were you able to identify the issue causing the memory leak?

rosa · 2025-04-04T14:19:29Z

@rajeevriitm, I'm afraid I wasn't 😞 I reviewed all code from v1.1.0 and didn't identify anything that could leak memory. This was before @IvanPakhomov99 shared that testing 1.1.0, 1.1.1, 1.1.2, and 1.1.4 made no difference. Then, I tried running jobs of different kinds (recurring jobs being enqueued every 5 seconds of different types, long running jobs, etc.) over a couple of days and couldn't reproduce any memory leaks. I think whatever is happening depends on what your jobs are doing or what you're loading in your app. I also wonder if this is not a memory leak, but simply high memory usage. Solid Queue runs a different process for each worker, a process for the dispatcher, another process for the scheduler and another process for the supervisor. All those processes load your app and even though the fork is done after loading the app, so CoW should ensure memory sharing, this is not the same as running a single process.

I don't have a better idea to reduce memory usage other than providing a single-process execution mode (what I've called "async" mode).

IvanPakhomov99 · 2025-04-04T22:37:24Z

@rosa Thanks for taking the time to look into this! The main pitfall here is that it's a brand new service and I am not running any jobs yet. I also confirmed that the service pod memory is stable.

That said, I actually have an update — I downgraded to version 1.0.2, and memory usage looks stable now. I’ll stick with this version for now.

chayuto · 2025-04-05T22:22:06Z

have the same issue on heroku to. with version 1.1.4.

as mentioned above. downgraded to version 1.0.2 solve the issue.

jeffcoh23 · 2025-04-07T00:47:42Z

@rosa Thank you for your attentiveness to this issue. I just downgraded solid_queue to 1.0.2 and can confirm that it drastically reduced the R14 related errors/memory usage.

rosa · 2025-04-07T15:51:16Z

Oh! Thanks a lot for confirming that! I had only looked down to version 1.1.0. I'm on-call this week so a bit short on time but will try to figure out why the memory increased from that version to 1.1.0.

IvanPakhomov99 · 2025-04-07T21:18:55Z

@rosa I’m not sure how relevant this is, but it might be worth taking another look at this commit: a152f26. It looks like interruptible_sleep was updated to use Promises.future on each call. Given how this method is used — often multiple times within a loop — it’s quite possible this could lead to increased memory consumption.

Even though the block uses .value, making it synchronous from the caller’s perspective, a new thread or fiber is still created under the hood for each call.

Sorry if that’s not the case — I might not be seeing the full picture. Just reviewing the changes between versions 1.1.0 and 1.0.2.

jeffcoh23 · 2025-04-12T01:22:58Z

@rosa Thank you for your attentiveness to this issue. I just downgraded solid_queue to 1.0.2 and can confirm that it drastically reduced the R14 related errors/memory usage.

Just to follow up on this - it seemed to have worked in the short-term, but I am back to where I started unfortunately... still seeing those r14/memory issues consistently.

zdennis · 2025-04-20T16:23:07Z

We are experiencing the same unbounded memory growth issues with solid_queue with results in OOMs. We're on solid-queue 1.1.3 and Ruby 3.3.7.

We can recreate this issue by running bin/jobs and letting it run, without enqueuing any jobs.

solid-queue-worker (1.1.3)

Here's what the memory size of the solid-queue worker process solid-queue-worker(1.1.3): waiting for jobs in * looks like:

It's being sampled every minute for about 75 minutes and grows from about 141mb to 840mb in that time. This was run in development with YJIT on and eager loading off. We have ran it with YJIT off and eager loading off, and the same issue persists.

solid-queue-supervisor (1.1.3)

Here's what the memory size of the solid-queue supervisor process solid-queue-supervisor(1.1.3) looks like over the same time frame:

Memory here also seems to grow but at a much slower rate.

Memory growth

This issue persists in all environments regardless of yjit, eager loading, or seemingly other environmental factors. If left unchecked solid-queue will eventually hit OOM errors causing the pod/container it is running in to be killed.

The memory growth for the worker never seems to plateau. We have tried increasing memory and solid queue will just run until it consumes all memory. Because the worker grows at a much faster rate than the supervisor it's unclear if the supervisor will experienced unbounded memory growth. I suspect it will given that memory growth is occurring when there are no jobs present.

No jobs during this benchmark

Just wanting to call out that there are no jobs being enqueued at all. It's an empty database.

solid-queue 1.1.4 behavior is up next

I'm now running with solid-queue 1.1.4 and so far the issue appears exist there as well.

rosa · 2025-04-20T16:38:45Z

Alright, I'm going to revert the change in #417 since that's looking like the most likely culprit as @IvanPakhomov99 said. I hope that fixes this problem.

Thanks a lot for the tests @zdennis and for writing this up 🙏

zdennis · 2025-04-20T16:40:22Z

Thanks, @rosa . 🙇 Let us know when/how to test, and I'm happy to pull it down and re-run the same tests.

…rruptible This is looking like the most likely culprit of a memory leak pointed out for multiple people on #330

…rruptible For all Rubies, and not just Ruby < 3.2. This is looking like the most likely culprit of a memory leak pointed out for multiple people on #330

rosa · 2025-04-20T18:52:47Z

Thanks a lot @zdennis, really appreciate it 🙏 I've just pushed version 1.1.5 with that change reverted, I hope this is it 🤞

zdennis · 2025-04-20T21:02:39Z

solid_queue 1.1.5 is looking much better. After about 80 minutes so far, memory isn't growing. 🥳

Worker memory 🟢

Plateaus right around 149Mb for us.

Supervisor memory 🟢

And the supervisor also plateaus, right around 159Mb:

Concurrent::Promises.future issue

I wonder if this is related the memory leak reported in usage of Concurrent::Promises.future in ruby-concurrency/concurrent-ruby#960.

Thanks for pushing out 1.1.5 🙇

Thanks for quickly push out 1.1.5, @rosa! It is looking like it may be the ticket here. I'll keep this test running over night (because I'm curious), and will post back with those results as well.

zdennis · 2025-04-21T14:01:28Z

Here is the memory usage on the worker and supervisor processes over 15 hours. Again, this is with no jobs being processed.

Still looks great compared to prior releases. We'll bump and test this out in a live environment. I'll share results later this week or next week after it's had some time running in the wild.

Worker memory

Worker memory is still very low at 152Mb.

Supervisor memory

Supervisor memory is up to about 165.7mb. It very slowly grew from about 156mb to 160mb and then jumped up to 165mb after about 14 hours.

I haven't been benchmarking/profiling GC major/minor runs and we're not manually invoking GC.start so it's not clear if this is normal growth by Ruby or if there is a small leak in the supervisor process still. Either way, 1.1.5 continues to look good.

rafael-pissardo · 2025-04-23T17:37:29Z

@zdennis any surprise until now?

zdennis · 2025-04-24T14:52:32Z

We've had it running in production since late Monday and we haven't had an OOM since. Here's a chart showing how over a similar timeframe (from Saturday to Monday) we did see an OOM with solid_queue 1.1.3. And then over the same number of days, Monday evening to today, we haven't seen an OOM and memory looks much better.

We did notice that memory usage grows when polling (and when there are no active jobs). We have some recurring jobs and then we have past jobs. We are not invoking GC.start manually so I suspect the memory growth we're seeing with polling may be a result of Ruby not garbage collecting for a while, but this is just a guess.

I notice solid_queue has a usage of Concurrent::Promises.future_on. I'm not sure if this is susceptible to the same issue that Concurrent::Promises.future is.

I'll post back again early next week after it's been up for a week, but so far it is still looking good. 🤞

hms · 2025-04-24T21:57:59Z

@rosa I’m not sure how relevant this is, but it might be worth taking another look at this commit: a152f26. It looks like interruptible_sleep was updated to use Promises.future on each call. Given how this method is used — often multiple times within a loop — it’s quite possible this could lead to increased memory consumption.

Even though the block uses .value, making it synchronous from the caller’s perspective, a new thread or fiber is still created under the hood for each call.

Sorry if that’s not the case — I might not be seeing the full picture. Just reviewing the changes between versions 1.1.0 and 1.0.2.

I've instrumented the code between the original (and now current) implementation of Interruptible.rb Vs. the change I made in [a152f26] and they result in exactly the same number calls to Poll. This strongly suggests the implementation of Interruptible is not changing SQ behavior for number of iterations within the poll loop semantics.

I've built a test harness to allow for isolating various configurations (Ruby versions, Rails versions, Database configurations, and SolidQueue configurations).

I've tested various implementations of Concurrent::Promises.future including the Interruptible implementation in the PR that I submitted and can also confirm that it does not leak.

While testing without any jobs (just worker polling) what I'm seeing is a lot of garbage being generated in-between major GC's, which can be exacerbated via a shorter polling_interval, and results constant and significant memory growth. This is forcing Ruby to allocate more slabs to handle the object allocations. While the objects are GC'able, the slabs are forever (stackable via GC.stats and GC.heap_stats).

When I switch to JEMALLOC and MALLOC_MAX_ARENA=2, which is how my Heroku and local dev is configured, I don't see the memory growth issues at all.

More digging to forthcoming.

rafael-pissardo · 2025-04-28T19:50:34Z

Also confirming here that the leak has been resolved in v1.1.5. 👍

jrodden1 · 2025-05-13T20:59:52Z

@zdennis Any updates on how its been running in prod for you?

@rosa, I hate to rain on the parade, but I'm still having the memory growth issue even after upgrading to 1.1.5 on heroku. 😕

Context

Rails 8.0.2
Ruby 3.4.2
Solid Queue 1.1.5
- Solid Queue is configured for a single database setup.
I'm running 1 web dyno for puma and 1 worker dyno for solid_queue on heroku.
I'm using the jemalloc build pack on heroku with JEMALLOC_ENABLED set to true
I've tried using bundle exec bin/jobs as well as bundle exec rails solid_queue:start in my Procfile. It was just this afternoon when I switched to trying rails solid_queue:start in the Procfile so I'll report on how that goes... but it doesn't look promising as the memory usage is still climbing in a similar fashion.

Thoughts
I'm testing out this setup as a staging environment on heroku for now, but I've got about 2 weeks until I need to deploy this to prod for my client.

I have 5 recurring jobs specified in my recurring.yml. These are pretty much the only jobs running (except for a mailer that runs once per day that gets triggered by of one of the jobs). Most of the time, the queue is empty / idle but the memory usage continually grows. I'm curious if I'm having the issue and others are not is because I have multiple recurring jobs (and one that runs once per minute)? 🤔

I've included the content of my config files. Perhaps I have something incorrectly configured? I'm open to thoughts or suggestions. 🙏 Should I try adding a recurring job that forces garbage collection to run or something like that as a workaround?

Solid Queue Config

# config/queue.yml
default: &default
  dispatchers:
    - polling_interval: 1
      batch_size: 500
  workers:
    - queues: "*"
      threads: 2
      processes: <%= ENV.fetch("JOB_CONCURRENCY", 1) %>
      polling_interval: 1

development:
  <<: *default
  database: app_development

test:
  <<: *default
  database: app_test

production:
  <<: *default
  url: <%= ENV["DATABASE_URL"] %>

SQ Recurring Jobs Config
(I've generic-ized the names of the jobs as this project is for a client)

# config/recurring.yml
recurring_maintenance: &recurring_maintenance
  job_one:
    class: "JobOne"
    priority: 0
    schedule: "every 60 seconds"
  job_two:
    class: "JobTwo"
    priority: 1
    schedule: "every 5 minutes"
  job_three:
    class: "JobThree"
    priority: 1
    schedule: "every 30 minutes"
  job_four:
    class: "JobFour"
    priority: 1
    schedule: "every weekday at 9am America/Detroit"
  job_five:
    class: "JobFive"
    priority: 1
    schedule: "30 15 15 * * 1-5 America/Detroit"

development:
  <<: *recurring_maintenance

production:
  <<: *recurring_maintenance

5/14 Next Morning Update
I checked on heroku this morning and switching to rails solid_queue:start still exhibits the same behavior as bin/jobs

zdennis · 2025-05-14T17:26:30Z

@jrodden1 wrote:

@zdennis Any updates on how its been running in prod for you?

Our staging and production environments are still looking good for us.

This isn't to say there's not another issue going on. Memory growth still occurs, but it's no longer unbounded and it gets reclaimed. We have more memory available in our clusters than it appears you have in your Heroku app Heroku so it could be that if we had a lower limit, we might be seeing some OOM issues.

I am eagerly waiting @hms 's continued investigation as I think he may be on to something. I'd love to dive in deeper, but I don't have bandwidth right now.

Have you tried setting MALLOC_MAX_ARENA=2 as @hms mentioned?

hms · 2025-05-14T20:03:14Z

I have to admin I reprioritized my personal work priorities given it was assumed my patch was the root cause and since my code was removed, I moved on.

I can resume my digging, but I think I need some help.

There are too many permutations between ruby versions, rails versions, and databases (and database versions). The community needs to identify a small subset of configurations to be our test beds as it's a ton of work for me to manage all of the versions that have been mentioned in this thread.

Secondly, I need better info around if people are seeing the leak with SQ simply idling, running standard jobs, and/or running recurring jobs. That's also a lot of permutations to manage. Even worse, idling and standard jobs run on the Worker and reassuring jobs run on the Scheduler, making monitoring harder.

What I've found so far is as follows:

I was never able to reproduce the leak with 1.1.4 when letting SQ idle (suggested above as observable). If it was leaking it small enough that I didn't see it over my monitoring window.
I didn't see any real differences (other than what would be expected due to the slight differences in the code) in memory usage or object counts between V1.1.4 and V1.1.5 during my idle testing.
The Supervisor memory footprint growing over time is confusing and damning . The Supervisor doesn't really do much after SQ has launched other than heartbeating via a separate thread and waiting for, and cleaning up, broken pipes from sub-workers terminating (and as side note, wasn't changed in the code changes in question).

One test that I didn't run yet, but has me curious is around Interruptible sleep. In the 1.1.5 release, Rosa kept (by accident???) one change that I made that removed a lot of unnecessary worker polling. If the problem is in some manner tied to the poll loop, then by putting back the original high volume poll loop, it should make the problem a little easier to spot.

For those of you suffering from OOM issues, here is what I'm doing these days and has been working well for months:

I have all of my jobs that generates OOM pressure restricted to a single queue with single-threaded workers.
Using a perform_around block, I check memory before and after the job execution (the before info just helps monitor the jump for specific job types and isn't strictly needed).
If the Worker is over my memory limit (technically, it's the SQ in aggregate memory footprint, but I only restart the heavy memory worker)
- execute Process.kill :SIGTERM, $$; sleep 0.2 -- which in theory allows the existing job to complete the required AJ and SQ house keeping to complete
- prevents new jobs from starting since SIGTERM is a graceful shutdown and has terminated the poll loop.. Note: there is a small race here between the signal being processed and actually setting the shutdown flag which prevents another job from starting. This is the reason for the sleep -- the signal is processed asynchrony and the short sleep allows for the signal to propagate. The sleep prevents the job from completing, which has the side effect of preventing a new job from starting.
- causes the worker in question to terminate -- releasing all memory used by that worker
- allows the Supervisor to restart it (hopefully without any pending jobs which take a long time before they run)

This approach has some pros and cons:

Pro: I don't get spurious H14 errors anymore. For me, this is big.
Con: It adds some overheads and delays for the impacted jobs. For me, this is small.
Con: It doesn't play nice with YJIT since we're frequently restarting this worker. For me this is small and with a small PR, could be managed better.

Despite the above, it works well enough that I have abandoned a PR to implement Worker restarts as native feature of SQ.

It should be noted that this works much more cleanly with a dedicated queue / single-threaded worker and known jobs that generate memory pressure. If you kill a worker with multiple threads, there is the risk killing an active job and with the current SQ design, that job can take a while to be rerun or even worse, can get caught in a run, die loop.

jrodden1 · 2025-05-14T21:25:30Z

I've done some further troubleshooting today.

I decided to just toggle off all the recurring jobs (just commented them all out of recurring.yml) and didn't manually run any jobs through the queue.

And... The memory usage stayed fairly consistent! So it doesn't look like its crazy growing while just sitting idle.

Since I didn't have any recurring jobs specified, the Scheduler process wasn't running.

[1] pry(main)> SolidQueue::Process.all
  SolidQueue::Process Load (3.9ms)  SELECT "solid_queue_processes".* FROM "solid_queue_processes"
[
  [0] #<SolidQueue::Process:0x00007f98a1a6f2c0> {
                   :id => 156,
                 :kind => "Supervisor",
    :last_heartbeat_at => 2025-05-14 14:54:31.070768000 MDT -06:00,
        :supervisor_id => nil,
                  :pid => 2,
             :hostname => "redacted",
             :metadata => {},
           :created_at => 2025-05-14 11:52:29.415980000 MDT -06:00,
                 :name => "supervisor-redacted"
  },
  [1] #<SolidQueue::Process:0x00007f98a3d1ed98> {
                   :id => 157,
                 :kind => "Worker",
    :last_heartbeat_at => 2025-05-14 14:54:31.086549000 MDT -06:00,
        :supervisor_id => 156,
                  :pid => 57,
             :hostname => "redacted",
             :metadata => {
      "polling_interval" => 1,
                "queues" => "*",
      "thread_pool_size" => 2
    },
           :created_at => 2025-05-14 11:52:29.449116000 MDT -06:00,
                 :name => "worker-redacted"
  },
  [2] #<SolidQueue::Process:0x00007f98a3d1ec58> {
                   :id => 158,
                 :kind => "Dispatcher",
    :last_heartbeat_at => 2025-05-14 14:54:31.102366000 MDT -06:00,
        :supervisor_id => 156,
                  :pid => 53,
             :hostname => "redacted",
             :metadata => {
                      "polling_interval" => 1,
                            "batch_size" => 500,
      "concurrency_maintenance_interval" => 600
    },
           :created_at => 2025-05-14 11:52:29.445760000 MDT -06:00,
                 :name => "dispatcher-redacted"
  }
]

This seems to narrow it down to something to do when recurring jobs are being used...

I also found out that Heroku (for new apps since 2019) default sets MALLOC_ARENA_MAX=2
(I'm not sure if it was a typo, but @hms mentioned MALLOC_MAX_ARENA?).

I have tried setting both versions of the ENV var mentioned previously and didn't notice any difference in memory usage.

Also, I am using Heroku Postgres as my DB.

jrodden1 · 2025-05-15T15:46:01Z

Huzzah! Success!

I turned my recurring jobs back on toward the end of the work day yesterday and also changed my threads setting to 1 in queue.yml.

With this setup, this is the memory graph I'm getting now.

While I don't have a lot of 'headroom' (only around ~100MB), I'm happy to see that the usage leveled off.

I'll be curious to see if this changes if/when I add additional recurring jobs, but for now, I think I'm good!

Thanks for the input @hms, @zdennis!

This was referenced Mar 31, 2025

Question: managing worker memory #227

Closed

Memory Constrained environments #343

Closed

rosa added a commit that referenced this issue Apr 20, 2025

Go back to using the original, self-pipe based implementation of inte…

d3f7198

…rruptible This is looking like the most likely culprit of a memory leak pointed out for multiple people on #330

rosa mentioned this issue Apr 20, 2025

Go back to using the original, self-pipe based implementation of Interruptible #552

Merged

Need advice on Solid Queue's memory usage #330

Need advice on Solid Queue's memory usage #330

Comments

yjchieng commented Sep 7, 2024

rosa commented Sep 7, 2024

Uh oh!

rosa commented Sep 7, 2024

Uh oh!

Focus-me34 commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rosa commented Oct 28, 2024

Uh oh!

Focus-me34 commented Oct 29, 2024

Uh oh!

rosa commented Oct 29, 2024

Uh oh!

Focus-me34 commented Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rajeevriitm commented Mar 21, 2025

Uh oh!

arikarim commented Mar 24, 2025

Uh oh!

rosa commented Mar 24, 2025

Uh oh!

arikarim commented Mar 24, 2025

Uh oh!

arikarim commented Mar 24, 2025

Uh oh!

rajeevriitm commented Mar 24, 2025

Uh oh!

arikarim commented Mar 25, 2025

Uh oh!

IvanPakhomov99 commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eliasousa commented Mar 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rosa commented Mar 29, 2025

Uh oh!

rajeevriitm commented Mar 30, 2025 via email

Uh oh!

rosa commented Mar 31, 2025

Uh oh!

IvanPakhomov99 commented Mar 31, 2025

Uh oh!

rajeevriitm commented Apr 1, 2025

Uh oh!

rajeevriitm commented Apr 3, 2025

Uh oh!

rosa commented Apr 4, 2025

Uh oh!

IvanPakhomov99 commented Apr 4, 2025

Uh oh!

chayuto commented Apr 5, 2025

Uh oh!

jeffcoh23 commented Apr 7, 2025

Uh oh!

rosa commented Apr 7, 2025

Uh oh!

IvanPakhomov99 commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffcoh23 commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zdennis commented Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

solid-queue-worker (1.1.3)

solid-queue-supervisor (1.1.3)

Memory growth

No jobs during this benchmark

solid-queue 1.1.4 behavior is up next

Uh oh!

rosa commented Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zdennis commented Apr 20, 2025

Focus-me34 commented Oct 28, 2024 •

edited

Loading

Focus-me34 commented Nov 7, 2024 •

edited

Loading

IvanPakhomov99 commented Mar 26, 2025 •

edited

Loading

eliasousa commented Mar 29, 2025 •

edited

Loading

IvanPakhomov99 commented Apr 7, 2025 •

edited

Loading

jeffcoh23 commented Apr 12, 2025 •

edited

Loading

zdennis commented Apr 20, 2025 •

edited

Loading

rosa commented Apr 20, 2025 •

edited

Loading

zdennis commented Apr 20, 2025 •

edited

Loading

rafael-pissardo commented Apr 23, 2025 •

edited

Loading

jrodden1 commented May 13, 2025 •

edited

Loading