Skip to content

Commit 3e1da1f

Browse files
authored
Update references to new GitHub org (deepspeedai) (#462)
Signed-off-by: Logan Adams <loadams@microsoft.com>
1 parent f4157be commit 3e1da1f

File tree

7 files changed

+12
-12
lines changed

7 files changed

+12
-12
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
## Latest News
2-
* [2023/07] Synced with [upstream](https://github.com/NVIDIA/Megatron-LM) over 1k commits, see [rebase folder for more details](https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/rebase) in terms of features and updated performance.
2+
* [2023/07] Synced with [upstream](https://github.com/NVIDIA/Megatron-LM) over 1k commits, see [rebase folder for more details](https://github.com/deepspeedai/Megatron-DeepSpeed/tree/main/examples_deepspeed/rebase) in terms of features and updated performance.
33

44
## Megatron-DeepSpeed
55
DeepSpeed version of NVIDIA's Megatron-LM that adds additional support for several features such as MoE model training, Curriculum Learning, 3D Parallelism, and others. The ```examples_deepspeed/``` folder includes example scripts about the features supported by DeepSpeed.
66

77
### Recent sync with NVIDIA/Megatron-LM
8-
In July 2023, we had a sync with the NVIDIA/Megatron-LM repo (where this repo is forked from) by git-merging 1100+ commits. Details can be found in the ```examples_deepspeed/rebase``` folder. Given the amount of merged commits, bugs can happen in the cases that we haven't tested, and your contribution (bug report, bug fix pull request) is highly welcomed. We also created a [backup branch](https://github.com/microsoft/Megatron-DeepSpeed/tree/before_rebase) which is the version before this sync. This backup branch is just for comparison tests and for temporary use when you need to debug the main branch. We do not plan to continue supporting the version before sync.
8+
In July 2023, we had a sync with the NVIDIA/Megatron-LM repo (where this repo is forked from) by git-merging 1100+ commits. Details can be found in the ```examples_deepspeed/rebase``` folder. Given the amount of merged commits, bugs can happen in the cases that we haven't tested, and your contribution (bug report, bug fix pull request) is highly welcomed. We also created a [backup branch](https://github.com/deepspeedai/Megatron-DeepSpeed/tree/before_rebase) which is the version before this sync. This backup branch is just for comparison tests and for temporary use when you need to debug the main branch. We do not plan to continue supporting the version before sync.
99

1010
### Run on Azure and AzureML
1111
To try out DeepSpeed on Azure, this fork of Megatron offers easy-to-use recipes and bash scripts. We strongly recommend to start with AzureML recipe in the ```examples_deepspeed/azureml``` folder. If you have a custom infrastructure (e.g. HPC clusters) or Azure VM based environment, please refer to the bash scripts in the ```examples_deepspeed/azure``` folder.

examples_deepspeed/bert_with_pile/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ This ```bert_with_pile``` folder includes examples about BERT pre-training (usin
88

99
As a reference performance number, our measurements show that our example is able to achieve a throughput up to 145 TFLOPs per GPU when pre-training a 1.3B BERT model (with ZeRO stage-1, without model parallelism, with 64 NVIDIA A100 GPUs, with batch size 4096 (64 per GPU), with activation checkpointing).
1010

11-
One thing to note is that this pre-training recipe is NOT a strict reproduction of the [original BERT paper](https://arxiv.org/abs/1810.04805): the Pile data is larger than the data used in original BERT (and the data used by Megatron paper); Megatron-LM introduces some changes to the BERT model (see details in [Megatron paper](https://arxiv.org/abs/1909.08053)); the training hyperparameters are also different. Overall these differences lead to longer training time but also better model quality than original BERT (see MNLI score below), and supporting large model scale by the combination of ZeRO and model parallelism. If you don't have enough computation budget, we recommend to reduce the total training iterations (```train_iters``` in the script) and potentially increase the learning rate at the same time. If you want to strictly reproduce original BERT, we recommend to use our [another BERT example](https://github.com/microsoft/DeepSpeedExamples/tree/master/bing_bert).
11+
One thing to note is that this pre-training recipe is NOT a strict reproduction of the [original BERT paper](https://arxiv.org/abs/1810.04805): the Pile data is larger than the data used in original BERT (and the data used by Megatron paper); Megatron-LM introduces some changes to the BERT model (see details in [Megatron paper](https://arxiv.org/abs/1909.08053)); the training hyperparameters are also different. Overall these differences lead to longer training time but also better model quality than original BERT (see MNLI score below), and supporting large model scale by the combination of ZeRO and model parallelism. If you don't have enough computation budget, we recommend to reduce the total training iterations (```train_iters``` in the script) and potentially increase the learning rate at the same time. If you want to strictly reproduce original BERT, we recommend to use our [another BERT example](https://github.com/deepspeedai/DeepSpeedExamples/tree/master/bing_bert).
1212

1313
## BERT MNLI fine-tuning
1414
```ds_finetune_bert_mnli.sh``` is the script for BERT MNLI fine-tuning, following the hyperparameters in the [Megatron paper](https://arxiv.org/abs/1909.08053). As a reference, table below present the scores using the model pre-trained based on the script above, comparing with the scores of original BERT and Megatron paper's BERT. Our BERT-Large's score is slightly lower than Megatron paper's, mainly due to the different data we used (Pile data is much diverse and larger than the data in Megatron paper, which potentially has negative effect on small million-scale models).

examples_deepspeed/bert_with_pile/prepare_pile_data.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ def pile_merge(file_path):
102102
# usage during merge is about 600GB. If you don't have enough memory,
103103
# one solution is to directly use the 30 data chunks as multiple
104104
# datasets. See '--data-path' in
105-
# github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/arguments.py
105+
# https://github.com/deepspeedai/Megatron-DeepSpeed/blob/main/megatron/arguments.py
106106
pile_merge(file_path)
107107
else:
108108
if sys.argv[1] == "range":

examples_deepspeed/data_efficiency/bert/pile_data_download_preprocess.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,7 @@ def pile_merge(file_path):
103103
# usage during merge is about 600GB. If you don't have enough memory,
104104
# one solution is to directly use the 30 data chunks as multiple
105105
# datasets. See '--data-path' in
106-
# github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/arguments.py
106+
# https://github.com/deepspeedai/Megatron-DeepSpeed/blob/main/megatron/arguments.py
107107
pile_merge(file_path)
108108
else:
109109
if sys.argv[1] == "range":

examples_deepspeed/deepspeed4science/megatron_long_seq_support/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,8 @@ Resolved Issues:
2323

2424
```shell
2525
# clone source code
26-
git clone https://github.com/microsoft/DeepSpeed.git
27-
git clone https://github.com/microsoft/Megatron-DeepSpeed.git
26+
git clone https://github.com/deepspeedai/DeepSpeed.git
27+
git clone https://github.com/deepspeedai/Megatron-DeepSpeed.git
2828
git clone https://github.com/NVIDIA/apex
2929

3030
# creat a new virtual environment

examples_deepspeed/rebase/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# July 2023 sync with NVIDIA/Megatron-LM
22
This folder includes details about the recent sync with the NVIDIA/Megatron-LM repo (where this repo is forked from). It includes example scripts we used to test after the sync, together with this README documentation about what were tested.
33

4-
We also created a [backup branch](https://github.com/microsoft/Megatron-DeepSpeed/tree/before_rebase) which is the version before this sync. This branch is just for comparison tests and for temporary use when debugging the main branch. We do not plan to continue supporting the version before sync.
4+
We also created a [backup branch](https://github.com/deepspeedai/Megatron-DeepSpeed/tree/before_rebase) which is the version before this sync. This branch is just for comparison tests and for temporary use when debugging the main branch. We do not plan to continue supporting the version before sync.
55

66
## List of rebase efforts/achievements
77
* Enabling Megatron-LM's sequence parallel.
@@ -26,7 +26,7 @@ In addition, below is a performance/convergence comparison between before and af
2626

2727
| Case | TFLOPs (per GPU) | Validation loss at step 200 | Training script |
2828
| ---- | ---------------- | --------------------------- | --------------- |
29-
| Before sync, GPT-3 13B, 3D parallelism | 50 | 5.73 | [script (in the backup branch)](https://github.com/microsoft/Megatron-DeepSpeed/blob/before_rebase/examples/before_rebase_test/ds_pretrain_gpt_13B.sh) |
29+
| Before sync, GPT-3 13B, 3D parallelism | 50 | 5.73 | [script (in the backup branch)](https://github.com/deepspeedai/Megatron-DeepSpeed/blob/before_rebase/examples/before_rebase_test/ds_pretrain_gpt_13B.sh) |
3030
| After sync, GPT-3 13B, 3D parallelism | 55.6 | 5.71 | [script](ds_pretrain_gpt_13B.sh) |
3131

3232
At last, we provide a [toy example script](ds_pretrain_gpt_125M.sh) that users can try as the first test.
@@ -44,4 +44,4 @@ We also tested and verified that the Rotary Positional Embedding (RoPE) introduc
4444

4545
## Notes/TODOs
4646
* After the sync, DeepSpeed still relies on the older activation checkpointing mechanism (see function ```_checkpointed_forward``` in ```Megatron-DeepSpeed/megatron/model/transformer.py```) since we didn't have time to integrate with the new version yet. Contribution is very welcomed.
47-
* (Aug 2023 update) With the contribution from 3P users (https://github.com/microsoft/Megatron-DeepSpeed/pull/225), now it's also possible to use Megatron-LM's newer activation checkpointing mechanism. However, currently it's still not compatible with DeepSpeed, so you won't be able to combine it with any DeepSpeed technologies. We DeepSpeed team compared the [older mechanism](ds_pretrain_gpt_1.3B.sh) and [newer mechanism](ds_pretrain_gpt_1.3B_megatron_checkpointing.sh) on 1 DGX-2 node (16 V100), and found that the older mechanism has less memory saving (older max allocated 15241 MB, newer 12924 MB) and higher throughput (older 23.11 TFLOPs newer 17.26 TFLOPs). Thus currently we still recommend using the older mechanism both because of the similar checkpointing performance, and (more importantly) because only older mechnaism is compatible with DeepSpeed (and in this case you can combine with ZeRO to achieve more memeory saving).
47+
* (Aug 2023 update) With the contribution from 3P users (https://github.com/deepspeedai/Megatron-DeepSpeed/pull/225), now it's also possible to use Megatron-LM's newer activation checkpointing mechanism. However, currently it's still not compatible with DeepSpeed, so you won't be able to combine it with any DeepSpeed technologies. We DeepSpeed team compared the [older mechanism](ds_pretrain_gpt_1.3B.sh) and [newer mechanism](ds_pretrain_gpt_1.3B_megatron_checkpointing.sh) on 1 DGX-2 node (16 V100), and found that the older mechanism has less memory saving (older max allocated 15241 MB, newer 12924 MB) and higher throughput (older 23.11 TFLOPs newer 17.26 TFLOPs). Thus currently we still recommend using the older mechanism both because of the similar checkpointing performance, and (more importantly) because only older mechnaism is compatible with DeepSpeed (and in this case you can combine with ZeRO to achieve more memeory saving).

examples_deepspeed/universal_checkpointing/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -73,9 +73,9 @@ bash examples_deepspeed/universal_checkpointing/megatron_gpt/run_universal_bf16.
7373
```
7474
This resumption script effects the loading of universal checkpoint rather than the ZeRO checkpoint in the folder by passing `--universal-checkpoint` command line flag to the main training script (i.e., `pretrain_gpt.py`).
7575

76-
Please see the corresponding [pull request](https://github.com/microsoft/Megatron-DeepSpeed/pull/276) for visualizations of matching loss values between original and universal checkpoint runs for bf16 and fp16 examples.
76+
Please see the corresponding [pull request](https://github.com/deepspeedai/Megatron-DeepSpeed/pull/276) for visualizations of matching loss values between original and universal checkpoint runs for bf16 and fp16 examples.
7777

78-
Combining sequence parallelism with data parallelism is another good use case for universal checkpointing, see [sp pull request](https://github.com/microsoft/DeepSpeed/pull/4752) for example and visualization of matching loss values.
78+
Combining sequence parallelism with data parallelism is another good use case for universal checkpointing, see [sp pull request](https://github.com/deepspeedai/DeepSpeed/pull/4752) for example and visualization of matching loss values.
7979

8080
Notes: The model weights using the ```--no-pipeline-parallel``` parameter and the model weights not using the ```--no-pipeline-parallel``` parameter are currently not supported for mutual conversion.
8181

0 commit comments

Comments
 (0)