Skip to content

Conversation

@github-classroom
Copy link

@github-classroom github-classroom bot commented Jul 16, 2024

👋! GitHub Classroom created this pull request as a place for your teacher to leave feedback on your work. It will update automatically. Don’t close or merge this pull request, unless you’re instructed to do so by your teacher.
In this pull request, your teacher can leave comments and feedback on your code. Click the Subscribe button to be notified if that happens.
Click the Files changed or Commits tab to see all of the changes pushed to main since the assignment started. Your teacher can see this too.

Notes for teachers

Use this PR to leave feedback. Here are some tips:

  • Click the Files changed tab to see all of the changes pushed to main since the assignment started. To leave comments on specific lines of code, put your cursor over a line of code and click the blue + (plus sign). To learn more about comments, read “Commenting on a pull request”.
  • Click the Commits tab to see the commits pushed to main. Click a commit to see specific changes.
  • If you turned on autograding, then click the Checks tab to see the results.
  • This page is an overview. It shows commits, line comments, and general comments. You can leave a general comment below.
    For more information about this pull request, read “Leaving assignment feedback in GitHub”.

Subscribed: @romerocruzsa @BJ-KHALED @Aliya-Daire @RonaldoLopezTucux @Edrop7

github-classroom bot and others added 14 commits July 16, 2024 22:01
Included team link and members. Also, added project description pdf document provided.
* feat: tried loading data through the huggingface API. decided to work on locally stored data. started prototyping with model and our CORAAL:ATL data.

* docs: added documentation for the prototyping-model notebook

* feat: added training loop and model benchmarks using only data subsets

* feat: evaluated WER score between pre-trained whisper and coraal:atl-trained whisper
* feat: tried loading data through the huggingface API. decided to work on locally stored data. started prototyping with model and our CORAAL:ATL data.

* docs: added documentation for the prototyping-model notebook

* feat: added training loop and model benchmarks using only data subsets

* feat: evaluated WER score between pre-trained whisper and coraal:atl-trained whisper

* feat: started evaluating hyperparameter tuning configurations for improved model inference
* feat: completed preliminary EDA on Svarah-Indian American accented english dataset

* fix: added files to not be tracked in gitignore
* unzipping and extracting all files from coraal.zip to notebook

* looking into files in the coraal directory

* working on eda of interview date and making a df

* visual bar chart of interviews per year

* evaluation on other meta data now

* pause for today

* doing EDA on metadata folder with 6 txt files

* comparing histograms to bar charts

* test pushing data to branch to see if I get error and correcting error

* evaluating different visuals for metadata folder

* column analysis for dfs

* Identifying useful columns for features

* cleaning up df for running in model

* eda on cleaning df to be readable by model

* evaluating differnt cleaning process for transcript_df

* checking over random samples of cleaned data for mistakes

* pause on work

* removing rows from transcript df of Special symbols and keeping puncuations

* pause on sampling of cleaned df

* combining and cleaning df using only one loop

* cleaned the data for varying desires and implementing into sebastian code

* 5draft for cleaning data on transcript

* added in comments

* removed outputs

* looking for the max & min duration of audio files in dev folder

* attempting to create a desirable df

* fixed issue with getting a duplicate row in indian accent nb and commenting in cleaning data nb

* commenting on changes

* analysis of column data

* documenting steps

* commenting on steps of diffrent code blocks

* commenting on steps for code blocks and what they do
* feat: created a manifest.json for saving our data from coraal:atl dataset

* fix: fixed gitignore to avoid tracking unwanted json files

* feat: migrated notebook prototyping code to python scripts for running training and inference through the command line

* fix: (at least trying) for merge conflict
* Categorical EDA w Draft One of Utils.py

* Feat: Uniqueness Scoring Methodology Created

* Documenting work from last week: establishing dependencies for whisper

* Code optimizations and data visualization improvements

* Uniqueness Score Generator for Transcript in utils

* Finished ATL data analysis and visuals
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants