Skip to content

Conversation

kallebysantos
Copy link
Contributor

What kind of change does this PR introduce?

Refactor, upgrade

What is the current behavior?

Current the ort rust backend is using ort rc-9 & onnx v1.20.1

What is the new behavior?

This PR introduces:

  • ort: library upgrade from rc-9 to rc-10
  • onnx: support from 1.20.1 to 1.22.0

Additional context

This rc-10 version introduces the Compiler feature - I still didn't explored it yet, but seems that would be possible to AOT compilation during model caching that can speed up cold starts.

Need help:
I would like to ask @nyannyacha 💚, if possible, to do k6 tests comparing to the latest version of that.

@kallebysantos kallebysantos changed the title feat: onnx runtime upgrade feat(ai): onnx runtime upgrade Aug 21, 2025
@nyannyacha nyannyacha self-assigned this Aug 22, 2025
@nyannyacha
Copy link
Contributor

Hello @kallebysantos 😋

I'm currently testing this PR locally, but it seems the return scores of these dot product lines are quite different from the main branch.
Could you please check why these lines aren't working?

This PR

"sameScore 1.0000034757189356"
"diffScore 1.0000035978635893"

(These values should be 1 or less, but it seems they aren't? 🧐)

Main Branch

"sameScore 0.999999847816228"
"diffScore 0.8725680383163251"

@kallebysantos
Copy link
Contributor Author

Hi @nyannyacha 💚

Seems I did miss write the input_ids tensor shape 🫠
Was supposed to be [1, size]✅ - instead of [size, 1]

@nyannyacha
Copy link
Contributor

Alright @kallebysantos, I ran integration tests on the latest commit locally, and it looks fine. 😄
As you requested, I'll check if any improvements were made through load testing with k6. I'll share the results tomorrow.
If everything looks good, we'll probably be able to merge it tomorrow.

Have a great day!

@nyannyacha nyannyacha changed the base branch from develop to main September 11, 2025 05:47
@nyannyacha nyannyacha changed the base branch from main to develop September 11, 2025 05:48
@nyannyacha
Copy link
Contributor

Benchmark

This PR

vscode ➜ /workspaces/edge-runtime-shadow (pr/kallebysantos/594) $ k6 run ./k6/dist/specs/ort-rust-backend.js

          /\      |‾‾| /‾‾/   /‾‾/
     /\  /  \     |  |/  /   /  /
    /  \/    \    |     (   /   ‾‾\
   /          \   |  |\  \ |  (‾)  |
  / __________ \  |__| \__\ \_____/ .io

     execution: local
        script: ./k6/dist/specs/ort-rust-backend.js
        output: -

     scenarios: (100.00%) 1 scenario, 12 max VUs, 3m30s max duration (incl. graceful stop):
              * ortRustBackend: 12 looping VUs for 3m0s (gracefulStop: 30s)


     ✓ status is 200
     ✗ request cancelled
      ↳  0% — ✓ 0 / ✗ 4591

     checks.........................: 50.00% ✓ 4591      ✗ 4591
     data_received..................: 666 kB 3.5 kB/s
     data_sent......................: 1.6 MB 8.4 kB/s
     http_req_blocked...............: avg=2.01µs   min=291ns   med=1.08µs   max=1.1ms    p(90)=2.29µs   p(95)=2.51µs
     http_req_connecting............: avg=166ns    min=0s      med=0s       max=206.75µs p(90)=0s       p(95)=0s
     http_req_duration..............: avg=470.73ms min=11.23ms med=457ms    max=2.27s    p(90)=600.72ms p(95)=639.82ms
       { expected_response:true }...: avg=470.73ms min=11.23ms med=457ms    max=2.27s    p(90)=600.72ms p(95)=639.82ms
     http_req_failed................: 0.00%  ✓ 0         ✗ 4592
     http_req_receiving.............: avg=28.13µs  min=4.12µs  med=16.2µs   max=2.57ms   p(90)=44.12µs  p(95)=51.72µs
     http_req_sending...............: avg=7.89µs   min=1.91µs  med=5.08µs   max=356.5µs  p(90)=13.2µs   p(95)=15.18µs
     http_req_tls_handshaking.......: avg=0s       min=0s      med=0s       max=0s       p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=470.7ms  min=11.11ms med=456.98ms max=2.26s    p(90)=600.47ms p(95)=639.81ms
     http_reqs......................: 4592   24.123616/s
     iteration_duration.............: avg=472.88ms min=83.45ms med=457.16ms max=9.39s    p(90)=600.92ms p(95)=640.04ms
     iterations.....................: 4591   24.118363/s
     vus............................: 12     min=0       max=12
     vus_max........................: 12     min=0       max=12


running (3m10.4s), 00/12 VUs, 4591 complete and 0 interrupted iterations
ortRustBackend ✓ [======================================] 12 VUs  3m0s

Main branch:

vscode ➜ /workspaces/edge-runtime-shadow (main) $ k6 run ./k6/dist/specs/ort-rust-backend.js

          /\      |‾‾| /‾‾/   /‾‾/
     /\  /  \     |  |/  /   /  /
    /  \/    \    |     (   /   ‾‾\
   /          \   |  |\  \ |  (‾)  |
  / __________ \  |__| \__\ \_____/ .io

     execution: local
        script: ./k6/dist/specs/ort-rust-backend.js
        output: -

     scenarios: (100.00%) 1 scenario, 12 max VUs, 3m30s max duration (incl. graceful stop):
              * ortRustBackend: 12 looping VUs for 3m0s (gracefulStop: 30s)

INFO[0148] Internal Server Error                         source=console
ERRO[0148] GoError: unexpected response
running at go.k6.io/k6/js/modules/k6.(*K6).Fail-fm (native)
ortRustBat ortRustBackend (file:///workspaces/edge-runtime-shadow/k6/dist/specs/ort-rust-backend.js:65:12(76))  executor=constant-vus scenario=ortRustBackend source=stacktrace
INFO[0161] Internal Server Error                         source=console
ERRO[0161] GoError: unexpected response
running at go.k6.io/k6/js/modules/k6.(*K6).Fail-fm (native)
ortRustBat ortRustBackend (file:///workspaces/edge-runtime-shadow/k6/dist/specs/ort-rust-backend.js:65:12(76))  executor=constant-vus scenario=ortRustBackend source=stacktrace

     ✗ status is 200
      ↳  99% — ✓ 31650 / ✗ 2
     ✗ request cancelled
      ↳  0% — ✓ 0 / ✗ 31652

     checks.........................: 49.99% ✓ 31650      ✗ 31654
     data_received..................: 4.6 MB 24 kB/s
     data_sent......................: 11 MB  59 kB/s
     http_req_blocked...............: avg=3.26µs  min=292ns  med=1.87µs  max=2.54ms p(90)=4.04µs   p(95)=4.54µs
     http_req_connecting............: avg=136ns   min=0s     med=0s      max=1.79ms p(90)=0s       p(95)=0s
     http_req_duration..............: avg=68.04ms min=8.67ms med=54.73ms max=1.38s  p(90)=122.38ms p(95)=134.58ms
       { expected_response:true }...: avg=68.04ms min=8.67ms med=54.73ms max=1.38s  p(90)=122.38ms p(95)=134.58ms
     http_req_failed................: 0.00%  ✓ 2          ✗ 31651
     http_req_receiving.............: avg=54.71µs min=4.87µs med=34.91µs max=5.19ms p(90)=91.42µs  p(95)=109.71µs
     http_req_sending...............: avg=19.92µs min=2.2µs  med=11µs    max=9.02ms p(90)=29.12µs  p(95)=40.59µs
     http_req_tls_handshaking.......: avg=0s      min=0s     med=0s      max=0s     p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=67.96ms min=8.63ms med=54.66ms max=1.38s  p(90)=122.31ms p(95)=134.5ms
     http_reqs......................: 31653  166.199403/s
     iteration_duration.............: avg=68.54ms min=8.78ms med=54.92ms max=9.8s   p(90)=122.61ms p(95)=134.78ms
     iterations.....................: 31652  166.194152/s
     vus............................: 12     min=0        max=12
     vus_max........................: 12     min=0        max=12


running (3m10.5s), 00/12 VUs, 31652 complete and 0 interrupted iterations
ortRustBackend ✓ [======================================] 12 VUs  3m0s

Based on the benchmark, there appears to be a significant regression in the number of handled requests.
I haven't reviewed all the code, but it might be due to the impact of the std::sync::Mutex you introduced.

And could you change the base of this PR from develop to main?
The develop branch is currently unused as it will be used later for the Deno 2.x upgrade.

@kallebysantos kallebysantos changed the base branch from develop to main September 11, 2025 17:05
@kallebysantos
Copy link
Contributor Author

Hi Nya thanks for your feedback 💚
I'm still investigating a better way to fix this.

Here you can see more about the mut breaking change.

@nyannyacha
Copy link
Contributor

nyannyacha commented Sep 12, 2025

Let's see what migration options they offered...

  • Batch requests, and increase the number of threads if using the CPU EP
    • I don't know how to use this yet, but it seems like the option that can handle the breaking change in rc10 with the least regression.
  • Create a single-threaded inference queue using an mpsc channel, and a one-shot channel to receive the results
    • This eventually linearizes the inference requests from multiple workers, so the regression will persist.
  • Put Sessions behind a Mutex
    • This is the option you currently selected. But wrapping a Mutex around sessions that are created only once per model does not semantically differ from the second option, so the regression will persist.
  • If you have the memory to spare, create multiple sessions. Maybe prepacked weights could be of use (though I can't recall if it supports non-CPU weights)
    • I'm not entirely sure what this does, but it seems PrepackagedWeights can be shared.
      Even if we create a Session via the Builder using PrepackedWeights, it eventually appears we must call Builder::model_from_file anyhow.
      So this seems to create duplicate sessions for the same model. If that's correct, it won't be helpful for our case.
    • #1 #2

The options available to us appear to be either not upgrading to this version or choosing the first option, which introduces minimal regression.

...On the other hand, I also see comments like this:

...snip...
which is what all the run_async_* eventually call into, still accepts &self. Does these functions really modify the internal state in a thread-unsafe way? If yes, run_inner_async should also take &mut self otherwise this looks unsound to me.

pykeio/ort#402 (comment)

Indeed, if we look at the commit that changed the signature of self, we can see that only the signature of the public API was changed to mut self, while no internal changes were made.
pykeio/ort@bd2aff7#diff-af36d5fa0ee7f11fecf4482ebfbe7a43c8eeb42769bdcb9f94d8f87e1d5afaf6

Internally, OrtApis::Run modifies state without using a mutex. Well, to be more specific, it will use a mutex only if the execution provider's ConcurrentRunSupported() flag is false, which only applies to WebGPU and XNNPACK; but in practice, CUDA has had issues with concurrent runs for pykeio/ort#348 pykeio/ort#321. So Session::run taking &self was technically unsafe, though you'd typically only ever run into issues if you were using a non-CPU EP.

If we don't mind practicing black magic, there are ways to trick rustc and make the reference to the shared Session mutable and pass it around. 🙃 (Though it does seem quite risky since CUDA EP can also be used. 😅)

So Session::run taking &self was technically unsafe, though you'd typically only ever run into issues if you were using a non-CPU EP.

Maybe based on this statement, we wrap it with a Mutex only when using CUDA EP and fully accept the regression, and on the other hand, for CPU EP, we could use black magic to bypass rustc's function signature checks, completely resolving the regression.

What do you think? 😋

@kallebysantos
Copy link
Contributor Author

kallebysantos commented Sep 12, 2025

Hi @nyannyacha thanks for helping 💚

for CPU EP, we could use black magic to bypass rustc's function signature checks

I'm ok with that 🧙‍♂️🪄 - Just would like to refer this other comment

since CUDA EP can also be used.

To be honest, I'm not really sure if the CUDA support is still working. It became harder to test since I don't have easy access to a GPU machine, and I think that found some problems last time I did try.

In my opinion we should focus on CPU only, then add GPU later based on demand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants