Skip to content

Conversation

mheryerznkanyan
Copy link

No description provided.

Comment on lines +1 to +5
processors_to_run: "0:"
workspace_dir: /workspace/nemo_capstone
final_manifest: ${workspace_dir}/final_manifest.json

processors:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add

  1. Nvidia copyright text
  2. config documentation text

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you use this docker compose file? Is it possible to run the scripts without it?

Comment on lines +25 to +27
Make sure to install yt-dlp tool before funning this code.

Tool link: https://github.com/yt-dlp/yt-dlp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you use a 3rd party tool here, could you add a specific check in the script whether the tool is installed or not, and log message for user to install it?

Comment on lines +29 to +46
Args:
raw_data_dir (str): Root directory of the files to be added to the manifest. Recursively searches for files with the given 'extension'.
output_field (str): Field to store the file paths in the dataset. Default is "audio_filepath".
extension (str): Extension of the files to include in the dataset. Default is "wav".
**kwargs: Additional keyword arguments for the base class `BaseParallelProcessor`.
"""

def __init__(
self,
raw_data_dir: str,
output_field: str = "audio_filepath",
# extension: str = "wav",
**kwargs,
):
super().__init__(**kwargs)
self.raw_data_dir = Path(raw_data_dir)
self.output_field = output_field
file_path = "sdp/processors/datasets/ytdlp/search_terms.json"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. If you don't use an "extension", remove it from both the commented version and from documentation of the function
  2. For convenience we use "key" instead of "field". Replace output_field with output_key
  3. Maybe it would be more convenient to change file_path to more informative name? Also could you remove it's hardcoding, make it a variable passed from config file with default value?

Comment on lines +28 to +31
Args:
links_filepath_field (str): Field to get the YouTube video link.
output_audio_path (str): Path to save the downloaded audio files.
**kwargs: Additional keyword arguments for the base class `BaseParallelProcessor`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use "key" instead of field

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to use this as an example, not as a working configuration, please mention somewhere how user should work with this file and remove any personal information from here, like the name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants