-
Notifications
You must be signed in to change notification settings - Fork 34
yt dlp initial pull request #63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
processors_to_run: "0:" | ||
workspace_dir: /workspace/nemo_capstone | ||
final_manifest: ${workspace_dir}/final_manifest.json | ||
|
||
processors: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add
- Nvidia copyright text
- config documentation text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you use this docker compose file? Is it possible to run the scripts without it?
Make sure to install yt-dlp tool before funning this code. | ||
|
||
Tool link: https://github.com/yt-dlp/yt-dlp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you use a 3rd party tool here, could you add a specific check in the script whether the tool is installed or not, and log message for user to install it?
Args: | ||
raw_data_dir (str): Root directory of the files to be added to the manifest. Recursively searches for files with the given 'extension'. | ||
output_field (str): Field to store the file paths in the dataset. Default is "audio_filepath". | ||
extension (str): Extension of the files to include in the dataset. Default is "wav". | ||
**kwargs: Additional keyword arguments for the base class `BaseParallelProcessor`. | ||
""" | ||
|
||
def __init__( | ||
self, | ||
raw_data_dir: str, | ||
output_field: str = "audio_filepath", | ||
# extension: str = "wav", | ||
**kwargs, | ||
): | ||
super().__init__(**kwargs) | ||
self.raw_data_dir = Path(raw_data_dir) | ||
self.output_field = output_field | ||
file_path = "sdp/processors/datasets/ytdlp/search_terms.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If you don't use an "extension", remove it from both the commented version and from documentation of the function
- For convenience we use "key" instead of "field". Replace output_field with output_key
- Maybe it would be more convenient to change file_path to more informative name? Also could you remove it's hardcoding, make it a variable passed from config file with default value?
Args: | ||
links_filepath_field (str): Field to get the YouTube video link. | ||
output_audio_path (str): Path to save the downloaded audio files. | ||
**kwargs: Additional keyword arguments for the base class `BaseParallelProcessor`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use "key" instead of field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to use this as an example, not as a working configuration, please mention somewhere how user should work with this file and remove any personal information from here, like the name
No description provided.