Scrape top Github repositories and users based on keywords.
I used this tool to analyze the top 1k machine learning users and create an interactive map to search for users based on their location.
Installation
pip install top-github-scraperAdd Credentials
To make sure you can scrape many repositories and users, add your GitHub's credentials to .env file.
touch .envAdd your username and token to .env file:
GITHUB_USERNAME=yourusername
GITHUB_TOKEN=yourtokenView full documentation here.
from top_github_scraper import get_top_repo_urls
get_top_repo_urls(keyword="machine learning", stop_page=10)Output at top_repo_urls_<keyword>_<sort_by>_<start_page>_<end_page>.json:
[
    "/josephmisiti/awesome-machine-learning",
    "/wepe/MachineLearning",
    "/udacity/machine-learning",
    "/Jack-Cherish/Machine-Learning",
    "/ZuzooVn/machine-learning-for-software-engineers",
    "/rasbt/python-machine-learning-book",
    "/lawlite19/MachineLearning_Python",
    "/lazyprogrammer/machine_learning_examples",
    "/trekhleb/homemade-machine-learning",
    "/ujjwalkarn/Machine-Learning-Tutorials"
]from top_github_scraper import get_top_repos
get_top_repos("machine learning", stop_page=10)Output for 1 repository at top_repo_info_<keyword>_<sort_by>_<start_page>_<end_page>.json :
{
        "stargazers_count": 48620,
        "forks_count": 12155,
        "contributors": {
            "login": [
                "josephmisiti",
                "josephmmisiti",
                "hslatman",
                "0asa",
                "ajkl",
                "ipcenas",
                "cogmission",
                "spekulatius",
                "basickarl",
                "NathanEpstein"
            ],
            "url": [
                "https://api.github.com/users/josephmisiti",
                "https://api.github.com/users/josephmmisiti",
                "https://api.github.com/users/hslatman",
                "https://api.github.com/users/0asa",
                "https://api.github.com/users/ajkl",
                "https://api.github.com/users/ipcenas",
                "https://api.github.com/users/cogmission",
                "https://api.github.com/users/spekulatius",
                "https://api.github.com/users/basickarl",
                "https://api.github.com/users/NathanEpstein"
            ],
            "contributions": [
                671,
                105,
                21,
                12,
                11,
                9,
                8,
                7,
                7,
                7
            ]
        }
    }from top_github_scraper import get_top_contributors
get_top_contributors("machine learning", stop_page=10)Output at top_contributor_info_<keyword>_<sort_by>_<start_page>_<end_page>.csv:
| login | url | type | name | company | location | hireable | bio | public_repos | public_gists | followers | following | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | josephmisiti | https://api.github.com/users/josephmisiti | User | Joseph Misiti | Math & Pencil | "Brooklyn, NY" | True | Mathematician & Co-founder of Math & Pencil | 229 | 142 | 2705 | 275 | |
| 1 | josephmmisiti | https://api.github.com/users/josephmmisiti | User | 0 | 0 | 2 | 0 | ||||||
| 2 | hslatman | https://api.github.com/users/hslatman | User | Herman Slatman | DistributIT | 133 | 20 | 469 | 67 | ||||
| 3 | 0asa | https://api.github.com/users/0asa | User | Vincent Botta | Belgium | "Innovation Engineer @evs-broadcast, previously Data Scientist @kensuio, E-Marketing Tools Manager @Diagenode, cofounder @Antibody-Adviser and photographer" | 35 | 15 | 25 | 16 | |||
| 4 | ajkl | https://api.github.com/users/ajkl | User | Ajinkya Kale | [email protected] | 58 | 1 | 29 | 4 | ||||
| 5 | ipcenas | https://api.github.com/users/ipcenas | User | 79 | 0 | 1 | 0 | ||||||
| 6 | cogmission | https://api.github.com/users/cogmission | User | David Ray | Third planet from the sun... | [email protected] | Humanity's freedom and abundance through the pursuit of technological innovation in the area of cognitive applications - Cognition Mission | 30 | 19 | 54 | 44 | ||
| 7 | spekulatius | https://api.github.com/users/spekulatius | User | Peter Thaleikis | @bringyourownideas | 127.0.0.1 | True | Software engineer focused on solutions using open source and simply filling in the gaps to fulfill the requirements. | 42 | 1 | 232 | 920 | |
| 8 | basickarl | https://api.github.com/users/basickarl | User | Karl Morrison | "Malmö, Sweden" | [email protected] | The question is: Will you take me seriously | 5 | 1 | 12 | 6 | ||
| 9 | NathanEpstein | https://api.github.com/users/NathanEpstein | User | Nathan Epstein | "New York, NY" | [email protected] | True | 23 | 12 | 208 | 0 | 
from top_github_scraper import get_top_users
get_top_users("machine learning", stop_page=10)Output at top_user_info_<keyword>_<start_page>_<end_page>.csv
| login | url | type | name | company | location | hireable | bio | public_repos | public_gists | followers | following | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | rasbt | https://api.github.com/users/rasbt | User | Sebastian Raschka | UW-Madison | "Madison, WI" | "Machine Learning researcher & open source contributor. Author of ""Python Machine Learning."" Asst. Prof. of Statistics @ UW-Madison." | 71 | 5 | 13888 | 35 | ||
| 1 | tqchen | https://api.github.com/users/tqchen | User | Tianqi Chen | "CMU, OctoML" | Large scale Machine Learning | 28 | 1 | 8611 | 126 | |||
| 2 | halfrost | https://api.github.com/users/halfrost | User | halfrost | @Alibaba | Shanghai China | [email protected] | 💪天道酬勤,勤能补拙。博观而约取,厚积而薄发。Gopher / Rustacean / iOS Dev. / Machine Learning / Retired acmer / Math / Philosophy / Technical Writer. | 22 | 0 | 8566 | 314 | |
| 3 | ageron | https://api.github.com/users/ageron | User | Aurélien Geron | Paris | Author of the book Hands-On Machine Learning with Scikit-Learn and TensorFlow. Former PM of YouTube video classification and founder & CTO of a telco operator. | 43 | 16 | 8383 | 2 | |||
| 4 | chiphuyen | https://api.github.com/users/chiphuyen | User | Chip Huyen | https://snorkel.ai | "Mountain View, CA" | True | Developing tools and best practices for machine learning production. | 19 | 1 | 7839 | 15 | |
| 5 | rhiever | https://api.github.com/users/rhiever | User | Randy Olson | FOXO BioScience | "Vancouver, WA" | [email protected] | "Chief Data Scientist, @FOXOBioScience. AI, Machine Learning, and Data Visualization specialist. Community leader for /r/DataIsBeautiful." | 77 | 17 | 5363 | 13 | |
| 6 | lexfridman | https://api.github.com/users/lexfridman | User | Lex Fridman | MIT | "Cambridge, MA" | "AI researcher working on autonomous vehicles, human-robot interaction, and machine learning at MIT and beyond." | 2 | 0 | 5031 | 0 | ||
| 7 | eriklindernoren | https://api.github.com/users/eriklindernoren | User | Erik Linder-Norén | "Stockholm, Sweden" | [email protected] | "ML engineer at Apple. Excited about machine learning, basketball and building things." | 24 | 0 | 3764 | 11 | ||
| 8 | roboticcam | https://api.github.com/users/roboticcam | User | A/Prof Richard Xu 徐亦达教授 | University of Technology Sydney | Sydney Australia | "I am an A/Professor in Machine Learning at UTS. manage a large research team of postdoc, PhD students close to 30 people" | 10 | 0 | 3561 | 0 | ||
| 9 | ogrisel | https://api.github.com/users/ogrisel | User | Olivier Grisel | Inria | "Paris, France" | [email protected] | Machine Learning Engineer a Inria Saclay (Parietal team). | 174 | 93 | 3237 | 116 | 
View a full list of paramters here.
top-github-scraper scrapes the owners as well as the contributors of the top repositories that pop up in the search when searching for a specific keyword on GitHub.
For each user, top-github-scraper scrapes 16 data points:
login: usernameurl: URL of the usertype: Whether this account is a user or an organizationname: Name of the usercompany: User's companylocation: User's locationemail: User's emailhireable: Whether the user is hireablebio: Short description of the userpublic_repos: Number of public repositories the user has (including forked repositories)public_gists: Number of public repositories the user has (including forked gists)followers: Number of followers the user hasfollowing: Number of people the user is following

