Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion nemo/utils/import_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@
# See the License for the specific language governing permissions and
# limitations under the License.

# This file is taken from https://github.com/NVIDIA/NeMo-Curator, which is adapted from cuML's safe_imports module:
# This file is taken from https://github.com/NVIDIA-NeMo/Curator/blob/dask/nemo_curator/utils/import_utils.py,
# which is adapted from cuML's safe_imports module:
# https://github.com/rapidsai/cuml/blob/e93166ea0dddfa8ef2f68c6335012af4420bc8ac/python/cuml/internals/safe_imports.py


Expand Down
2 changes: 1 addition & 1 deletion tutorials/llm/llama/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ This repository contains Jupyter Notebook tutorials using the NeMo Framework for
- Perform LoRA PEFT on Llama 3 8B Instruct using a dataset for bio-medical domain question answering. Deploy multiple LoRA adapters with NVIDIA NIM.
* - `Llama 3.1 Law-Domain LoRA Fine-Tuning and Deployment with NeMo Framework and NVIDIA NIM <./sdg-law-title-generation>`_
- `Law StackExchange <https://huggingface.co/datasets/ymoslem/Law-StackExchange>`_
- Perform LoRA PEFT on Llama 3.1 8B Instruct using a synthetically augmented version of Law StackExchange with NeMo Framework, followed by deployment with NVIDIA NIM. As a prerequisite, follow the tutorial for `data curation using NeMo Curator <https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/peft-curation-with-sdg>`_.
- Perform LoRA PEFT on Llama 3.1 8B Instruct using a synthetically augmented version of Law StackExchange with NeMo Framework, followed by deployment with NVIDIA NIM. As a prerequisite, follow the tutorial for `data curation using NeMo Curator <https://github.com/NVIDIA-NeMo/Curator/tree/dask/tutorials/peft-curation-with-sdg>`_.
* - `Llama 3.1 Pruning and Distillation with NeMo Framework <./pruning-distillation>`_
- `WikiText-103-v1 <https://huggingface.co/datasets/Salesforce/wikitext/viewer/wikitext-103-v1>`_
- Perform pruning and distillation on Llama 3.1 8B using the WikiText-103-v1 dataset with NeMo Framework.
Expand Down
4 changes: 2 additions & 2 deletions tutorials/llm/llama/domain-adaptive-pretraining/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,9 @@ Here, we share a tutorial with best practices on custom tokenization and DAPT (D

* In this tutorial, we will leverage chip domain/hardware datasets from open-source GitHub repositories, wiki URLs, and academic papers. Therefore, as a pre-requisite the user should curate the domain specific and general purpose data using the NeMo Curator and place them in the directories mentioned below.

* `./code/data` should contain curated data from chip domain after processing with NeMo Curator. Playbook for DAPT data curation can be found [here](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/dapt-curation)
* `./code/data` should contain curated data from chip domain after processing with NeMo Curator. Playbook for DAPT data curation can be found [here](https://github.com/NVIDIA-NeMo/Curator/tree/dask/tutorials/dapt-curation). Please note that this tutorial uses NeMo Curator version 0.9.0 or lower.

* `./code/general_data` should contain open-source general purpose data that llama-2 was trained on. This data will help idenitfy token/vocabulary differences between general purpose and domain-specific datasets. Data can be downloaded from [Wikepedia](https://huggingface.co/datasets/legacy-datasets/wikipedia), [CommonCrawl](https://data.commoncrawl.org/) etc. and curated with [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/single_node_tutorial)
* `./code/general_data` should contain open-source general purpose data that llama-2 was trained on. This data will help idenitfy token/vocabulary differences between general purpose and domain-specific datasets. Data can be downloaded from [Wikepedia](https://huggingface.co/datasets/legacy-datasets/wikipedia), [CommonCrawl](https://data.commoncrawl.org/) etc. and curated with [NeMo Curator](https://github.com/NVIDIA-NeMo/Curator/tree/dask/tutorials/single_node_tutorial). Please note that this tutorial uses NeMo Curator version 0.9.0 or lower.


## Custom Tokenization for DAPT
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@
"- Step 6: Merge the new embeddings with the original embedding table (in llama2-2-70b) to get the final <b>Domain Adapted Tokenizer</b>.\n",
"## Data\n",
"\n",
"In this playbook, we will leverage chip domain/hardware datasets from open-source GitHub repositories, wiki URLs, and academic papers. Data has been processed and curated using [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator/tree/main) as shown in this [playbook](https://github.com/jvamaraju/ndc_dapt_playbook/tree/dapt_jv)"
"In this playbook, we will leverage chip domain/hardware datasets from open-source GitHub repositories, wiki URLs, and academic papers. Data has been processed and curated using [NeMo Curator](https://github.com/NVIDIA-NeMo/Curator/tree/dask) as shown in this [playbook](https://github.com/jvamaraju/ndc_dapt_playbook/tree/dapt_jv). Please note that this tutorial uses NeMo Curator version 0.9.0 or lower."
]
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@
"source": [
"# Data\n",
"\n",
"* In this playbook, we will leverage chip domain/hardware datasets from open-source GitHub repositories, wiki URLs, and academic papers. Data has been processed and curated using [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator/tree/main) as shown in this [playbook](https://github.com/jvamaraju/ndc_dapt_playbook/tree/dapt_jv)"
"* In this playbook, we will leverage chip domain/hardware datasets from open-source GitHub repositories, wiki URLs, and academic papers. Data has been processed and curated using [NeMo Curator](https://github.com/NVIDIA-NeMo/Curator/tree/dask) as shown in this [playbook](https://github.com/jvamaraju/ndc_dapt_playbook/tree/dapt_jv). Please note that this tutorial uses NeMo Curator version 0.9.0 or lower."
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion tutorials/llm/reasoning/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ This recipe is inspired by the [Llama Nemotron family of models](https://www.nvi

Checkout the following resources that are used in this tutorial.
* [Llama-Nemotron-Post-Training-Data](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset), an open source dataset for instilling reasoning behavior in large language models.
* The tutorial on [curating the Llama Nemotron Reasoning Dataset with NVIDIA NeMo Curator](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/llama-nemotron-data-curation).
* The tutorial on [curating the Llama Nemotron Reasoning Dataset with NVIDIA NeMo Curator](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/text/llama-nemotron-data-curation).
You will need the output from that tutorial for training a reasoning model.

## Hardware Requirements
Expand Down
4 changes: 2 additions & 2 deletions tutorials/llm/reasoning/Reasoning-SFT.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
"### 🧰 Tools and Resources\n",
"* [NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html)\n",
"* [Llama-Nemotron-Post-Training-Data](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset), an open source dataset for instilling reasoning behavior in large language models.\n",
"* [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator) for data curation\n",
"* [NeMo Curator](https://github.com/NVIDIA-NeMo/Curator) for data curation\n",
"\n",
"## 📌 Requirements\n",
"\n",
Expand All @@ -32,7 +32,7 @@
"* A valid Hugging Face API token with access to the [Meta LLaMa 3.1-8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model (since this is a gated model).\n",
"\n",
"### Dataset\n",
"To follow along, you would need an appropriate reasoning dataset. Checkout the tutorial on [curating the Llama Nemotron Reasoning Dataset with NVIDIA NeMo Curator](https://github.com/NVIDIA-NeMo/Curator/tree/dask/tutorials/llama-nemotron-data-curation).\n",
"To follow along, you would need an appropriate reasoning dataset. Checkout the tutorial on [curating the Llama Nemotron Reasoning Dataset with NVIDIA NeMo Curator](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/text/llama-nemotron-data-curation).\n",
"You will need the output from that tutorial as the training set input to this playbook!\n",
"\n",
"### Hardware Requirements\n",
Expand Down
Loading