Set memory pool through RMM #2866

viclafargue · 2025-11-11T15:19:44Z

This PR aims at updating the device_resources_snmg utility to make sure that the method in charge of setting up the RMM pools on the GPUs effectively work in multi-GPU context. This method used to set a pool workspace on the device resource to no effect. We instead now set the per device memory resource of each GPU to a pool through RMM.

Benchmark of search throughput for 100 rows queries on 2 GPUs CAGRA indexes built with NN Descent :

achirkin

Hi @viclafargue , could you please update the description of the PR to let the reviewers know what is the problem it addresses?
I suspect, this is about perf bottlenecks related to locking the cuda context in the multi-gpu setup. If so, please attach the benchmarks. I'm also concerned a little about the lifetime of the objects allocated with the per-device pools. Could there be a situation when an object outlives its memory pool and segfaults the program?

viclafargue · 2025-11-12T10:19:37Z

I'm also concerned a little about the lifetime of the objects allocated with the per-device pools. Could there be a situation when an object outlives its memory pool and segfaults the program?

Yes, that is a real issue. Once a user sets a memory pool (e.g. handle.set_memory_pool(80)). Anything allocated afterward should be released before the handle is released. That is why a management from the RAFT side would be preferable.

jinsolp

Thanks Victor!

cpp/include/raft/core/device_resources_snmg.hpp

bdice · 2025-11-13T19:31:24Z

Could there be a situation when an object outlives its memory pool and segfaults the program?

Yes, this is a tricky problem with RMM today. I am working to improve this for 26.02 or 26.04.

This class of problems will be eliminated by RMM migrating to the CCCL 3.2 memory resource design. There is a new form of memory resource ownership in any_resource, which is maybe-shared. Stateful resources will have shared (refcounted) ownership but stateless resources will not, to avoid overhead.

This issue comment (rapidsai/rmm#2011 (comment)) tracks adoption of the new memory resources, which is my primary active project right now. I am hopeful that 26.02 will entirely resolve this class of issues, though deprecation cycles may force it out to 26.04.

viclafargue · 2025-12-15T15:18:20Z

@achirkin @jinsolp Do you think we can merge the PR as-is for now and come back with a follow-up once the new RMM feature is available or would you like to wait for it?

jinsolp · 2025-12-15T18:34:22Z

I'm fine either way!

Set memory pool through RMM

dc819a8

viclafargue assigned tfeher and achirkin Nov 11, 2025

viclafargue requested a review from a team as a code owner November 11, 2025 15:19

github-project-automation bot added this to Vector Search, ML, & Data Mining Release Board Nov 11, 2025

github-project-automation bot moved this to Todo in Vector Search, ML, & Data Mining Release Board Nov 11, 2025

viclafargue added bug Something isn't working non-breaking Non-breaking change labels Nov 11, 2025

Additional safety additions

8281092

viclafargue assigned jinsolp Nov 11, 2025

achirkin requested changes Nov 11, 2025

View reviewed changes

viclafargue requested a review from achirkin November 12, 2025 10:25

code style

61f7b40

jinsolp reviewed Nov 12, 2025

View reviewed changes

cpp/include/raft/core/device_resources_snmg.hpp Outdated Show resolved Hide resolved

Answering review

d801ba1

viclafargue changed the base branch from main to release/25.12 November 17, 2025 17:10

viclafargue requested a review from jinsolp November 18, 2025 17:55

Merge branch 'release/25.12' into set-mem-pool-with-rmm

7d00dc8

viclafargue changed the base branch from release/25.12 to main December 1, 2025 17:49

viclafargue added 5 commits December 3, 2025 13:41

Merge branch 'main' into set-mem-pool-with-rmm

047db23

Merge branch 'main' into set-mem-pool-with-rmm

e769ee5

Merge branch 'main' into set-mem-pool-with-rmm

cee934b

fix RMM headers

747ed1a

Merge branch 'main' into set-mem-pool-with-rmm

434a8cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Set memory pool through RMM #2866

Set memory pool through RMM #2866

Uh oh!

viclafargue commented Nov 11, 2025 •

edited

Loading

Uh oh!

achirkin left a comment

Uh oh!

viclafargue commented Nov 12, 2025

Uh oh!

jinsolp left a comment

Uh oh!

Uh oh!

bdice commented Nov 13, 2025 •

edited

Loading

Uh oh!

viclafargue commented Dec 15, 2025

Uh oh!

jinsolp commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Set memory pool through RMM #2866

Are you sure you want to change the base?

Set memory pool through RMM #2866

Uh oh!

Conversation

viclafargue commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

achirkin left a comment

Choose a reason for hiding this comment

Uh oh!

viclafargue commented Nov 12, 2025

Uh oh!

jinsolp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bdice commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viclafargue commented Dec 15, 2025

Uh oh!

jinsolp commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

viclafargue commented Nov 11, 2025 •

edited

Loading

bdice commented Nov 13, 2025 •

edited

Loading