Skip to content

notnews/nbc_transcripts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

NBC transcripts

NBC used to provide transcripts of some of its shows at the now defunct http://www.nbcnews.com/id/3719710. Check out this archive.org page https://web.archive.org/web/20170601234403/http://www.nbcnews.com/id/3719710.

nbc_crawl.py crawls all the links to news transcripts. The script produces a list of all links. And nbc_extract.py downloads and parses the news transcripts and appends some meta data and dumps it to a CSV file.

The raw html files and the final csv can be downloaded from http://dx.doi.org/10.7910/DVN/ND1TCV.

And a list of all the links along with the title of the show and the date, see here.

Here's the yearly breakdown of the final dataset (5,369 rows):

2008 2009 2010 2011 2012 2013 2014 
  76  434  752 1042 1164 1177  724 

Notes

  • Scripts from 2014.
  • Some news transcripts had a typo in the date string, e.g. 'Thusday','Februrary', etc. That caused the script to fail to fill in the date column.

πŸ”— Adjacent Repositories

About

NBC transcripts 2011--2014

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Contributors 2

  •  
  •  

Languages