-
Notifications
You must be signed in to change notification settings - Fork 7.9k
[feat] support megatron-LM training by mcore_adapter #9237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @Kuangdd01, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request, authored by Kuangdd01, introduces initial support for Megatron-LM training via the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for Megatron-LM training using mcore_adapter. The changes are comprehensive, touching argument parsing, training workflows, and configuration. The core logic resides in the new src/llamafactory/train/mca/workflow.py file. While the implementation appears to cover the main requirements, there are several areas for improvement regarding maintainability, clarity, and robustness. My review focuses on addressing code smells like "hacks" and FIXMEs, improving the extensibility of model-specific logic, and making the code safer by avoiding in-place modifications of shared objects. Additionally, there's an inconsistency in how MCA-specific logic is triggered (environment variable vs. config option) that should be resolved for better predictability.
src/llamafactory/cli.py
Outdated
| if is_env_enabled("USE_MCA"): | ||
| # force use torchrun | ||
| os.environ["FORCE_TORCHRUN"] = "1" | ||
|
|
||
| if command == "train" and (is_env_enabled("FORCE_TORCHRUN") or (get_device_count() > 1 and not use_ray())): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of is_env_enabled("USE_MCA") to control logic in both cli.py and parser.py is inconsistent with using finetuning_args.use_mca in tuner.py. This can lead to confusing behavior where setting use_mca: true in a config file doesn't work unless the USE_MCA environment variable is also set. This dependency on an environment variable for a core logic switch makes configuration less transparent and more error-prone.
To make use_mca the single source of truth, consider a two-pass approach for argument parsing. You could first parse just the FinetuningArguments to check the value of use_mca, and then, based on that, select the appropriate full set of argument classes for the second pass. This would eliminate the need for the USE_MCA environment variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for Megatron-LM training using mcore_adapter. The changes are quite extensive, adding new example configurations, modifying argument parsing, and creating new training workflows. The overall strategy of conditionally using mcore_adapter based on an environment variable is sound. However, I've identified several areas for improvement, particularly concerning code duplication in the new workflow files, and some inconsistencies in argument handling. My review includes specific suggestions to enhance code clarity, consistency, and maintainability.
| template = get_template_and_fix_tokenizer(tokenizer, data_args) | ||
|
|
||
| # dataset needs +1 then cut back due to MCA shift logic | ||
| data_args.cutoff_len += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pattern of incrementing data_args.cutoff_len, calling get_dataset, and then decrementing it is repeated across run_pt, run_sft, and run_dpo. This in-place modification can be error-prone and makes the code less clean. Consider creating a copy of data_args with the modified cutoff_len or using a context manager to handle this temporarily to avoid side effects and reduce repetition.
|
|
||
| model = AutoModel.from_pretrained(model_args.model_name_or_path, training_args) | ||
|
|
||
| from transformers import DataCollatorForSeq2Seq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
hiyouga
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Yaowei Zheng <hiyouga@buaa.edu.cn>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Yaowei Zheng <hiyouga@buaa.edu.cn>
|
When I try to run It shows that the mca_config is not exist in model_name_or_path. Do we need to convert hf model to Megatron model before tunning? |
I guess it is a network error. Try to load model locally. |
|
model_name_or_path: /data/ptmodels/Qwen/Qwen3-Next-80B-A3B-Instruct do_train: true output_dir: /data/sftmodels/FinMDT-ThoughtPOP/llma-Qwen-next-sft-dot1 tensor_model_parallel_size: 1 shcd /data/code/llama-mcore/LLaMA-Factory && 为什么我这样配置8张H20 也会出现torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 5 has a total capacity of 95.08 GiB of which 17.88 MiB is free. Including non-PyTorch memory, this process has 95.06 GiB memory in use. Of the allocated memory 91.95 GiB is allocated by PyTorch, and 12.15 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) |
|
For Qwen3-Next(80B), we may need 16*80GB GPU memory at least for model parameters amd optimizer loading if we do use normal training recipes. |
What does this PR do?
Try to introduce mcore by mcore_adapter
Todo List
Quick Start
you can refer this document for detailed info.
Env Setup
We offer a dockerfile here.
our experiment
1. Qwen2VL-7B-instruct training
a800 * 8, dataset: llava_en_1k, cutoff_len: 4096
The setting may not be fair, but it's just a correctness check.
loss curve

2. Qwen3-30B-A3B-2507-Instruct
a800 * 16, dataset: OpenR1-Math-94k, cutoff_len: 4096
loss curve mcore only

3. E2E sft on identity dataset
Acknowledgement
Thanks to the ROLL team for this mcore adapter plugin!
Todo
Before submitting