Skip to content

Conversation

@viclafargue
Copy link
Contributor

@viclafargue viclafargue commented Nov 11, 2025

This PR aims at updating the device_resources_snmg utility to make sure that the method in charge of setting up the RMM pools on the GPUs effectively work in multi-GPU context. This method used to set a pool workspace on the device resource to no effect. We instead now set the per device memory resource of each GPU to a pool through RMM.

Benchmark of search throughput for 100 rows queries on 2 GPUs CAGRA indexes built with NN Descent :
benchmark

Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @viclafargue , could you please update the description of the PR to let the reviewers know what is the problem it addresses?
I suspect, this is about perf bottlenecks related to locking the cuda context in the multi-gpu setup. If so, please attach the benchmarks. I'm also concerned a little about the lifetime of the objects allocated with the per-device pools. Could there be a situation when an object outlives its memory pool and segfaults the program?

@viclafargue
Copy link
Contributor Author

I'm also concerned a little about the lifetime of the objects allocated with the per-device pools. Could there be a situation when an object outlives its memory pool and segfaults the program?

Yes, that is a real issue. Once a user sets a memory pool (e.g. handle.set_memory_pool(80)). Anything allocated afterward should be released before the handle is released. That is why a management from the RAFT side would be preferable.

@viclafargue viclafargue requested a review from achirkin November 12, 2025 10:25
Copy link
Contributor

@jinsolp jinsolp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Victor!

@bdice
Copy link
Contributor

bdice commented Nov 13, 2025

Could there be a situation when an object outlives its memory pool and segfaults the program?

Yes, this is a tricky problem with RMM today. I am working to improve this for 26.02 or 26.04.

This class of problems will be eliminated by RMM migrating to the CCCL 3.2 memory resource design. There is a new form of memory resource ownership in any_resource, which is maybe-shared. Stateful resources will have shared (refcounted) ownership but stateless resources will not, to avoid overhead.

This issue comment (rapidsai/rmm#2011 (comment)) tracks adoption of the new memory resources, which is my primary active project right now. I am hopeful that 26.02 will entirely resolve this class of issues, though deprecation cycles may force it out to 26.04.

@viclafargue viclafargue changed the base branch from main to release/25.12 November 17, 2025 17:10
@viclafargue viclafargue requested a review from jinsolp November 18, 2025 17:55
@viclafargue viclafargue changed the base branch from release/25.12 to main December 1, 2025 17:49
@viclafargue
Copy link
Contributor Author

@achirkin @jinsolp Do you think we can merge the PR as-is for now and come back with a follow-up once the new RMM feature is available or would you like to wait for it?

@jinsolp
Copy link
Contributor

jinsolp commented Dec 15, 2025

I'm fine either way!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Non-breaking change

Development

Successfully merging this pull request may close these issues.

6 participants