Paper Datasets

Extracted list of published papers and author data across multiple domains of science which powers indiainresearch.org.

To view a subset of the data, visit: indiainresearch.org

Features

Data enriched using OpenAlex, arXiv, PDF and conference website parsing using LLMs.
1:1 mapping between paper authors and associated institutions
Author rank (undergrad, postgrad, faculty, industry) available for certain venues.
Author affiliation history available.
Standard topics for every paper assigned from OpenAlex and LLMs.
Author and affiliation matching done using LLMs.

Data

Computer Science

Artificial Intelligence

Subfield	Venue	Data
Machine Learning	NeurIPS	2024
	ICML	2024
	ICLR	2024
Computer Vision	CVPR	2024
	WACV	2024
	ECCV	2024
Natural Language Processing	EMNLP	2024
	ACL	2024
	NAACL	2024
	EACL	2024
	TACL	2024
	LREC-COLING	2024

Systems and Networks

Subfield	Venue	Data
Mobile Computing	MobiSys	2024
	UbiComp/IMWUT	2024
	SenSys	2024
	MobiCom	2024
Databases	VLDB	2024
	SIGMOD	2024
	PODS	2024
Operating Systems	OSDI	2024
Computer Networks	NSDI	2024
	SIGCOMM	2024

Theory

Subfield	Venue	Data
Algorithms and Complexity	STOC	2024
	FOCS	2024
	SODA	2024

Schema

Each paper has the following schema and can be parsed as the following pydantic models:

Paper model

class Paper(BaseModel):
    id: str                                                         # unique ID for the paper, unused for now
    openalex_id: str | None = None                                  # corresponding ID from OpenAlex database
    doi: str | None = None                                          # DOI if present
    conf_id: str | None = None                                      # ID used by the corresponding conference
    title: str | None = None                                        # Paper title
    authorships: list[AuthorLink] = []                              # list of paper, author relations
    primary_location: dict | None = None                            # https://docs.openalex.org/api-entities/works/work-object#primary_location
    open_access: dict | None = None                                 # https://docs.openalex.org/api-entities/works/work-object#the-openaccess-object
    best_oa_location: dict | None = None                            # https://docs.openalex.org/api-entities/works/work-object#best_oa_location
    citation_normalized_percentile: dict | None = None              # FWCI percentile
    fwci: float | None = None                                       # FWCI
    primary_topic: TopicLink | None = None                          # top ranked topic
    publication_venue: str                                          # unique code for publication venue, usually same as conference name
    publication_year: int                                           # publication or conference year
    related_works: list[str] = []                                   # OpenAlex IDs of related works
    topics: list[TopicLink] = []                                    # top ranked topics. upto 3
    keywords: list[dict] = []                                       # keywords from OpenAlex
    link: str | None = None                                         # Primary webpage for the paper. (prefer this)
    pdf_link: str | None = None                                     # Primary PDF for the paper if open access (prefer this)

    status: str | None = None                                       # Oral, Poster, Spotlight (from Paper Copilot)
    track: str | None = None                                        # Conference track
    github_link: str | None = None                                   
    project_link: str | None = None
    video_link: str | None = None
    openaccess_link: str | None = None
    poster_link: str | None = None
    openreview_link: str | None = None
    arxiv_link: str | None = None
    proceeding_link: str | None = None

    author_names_from_paper: list[str] = []                         # list of author names by scraping PDF opr website or from Paper Copilot. (avoid using)
    aff_names_from_paper: list[str] = []                            # list of author affiliations by scraping PDF opr website or from Paper Copilot. (avoid using)
    aff_domains_from_paper: list[str] = []                          # list of author domains by scraping PDF opr website or from Paper Copilot. (avoid using)
    author_rank_from_paper: list[str] = []                          # undergrad, postgrad, faculty, researcher, engineer etc. (avoid using)
    openreview_ids_from_paper: list[str] = []                       # OpenReview IDs (avoid using)

    keywords_from_paper: list[str] = []                             # keywords from conference or pdf scraping
    primary_area_from_paper: str | None = None                      # primary area from paper
    overall_rating_from_paper: list[int] = []
    percent_overall_rating_from_paper: list[float] = []
    novelty_from_paper: list[int] = []
    percent_novelty_from_paper: list[float] = []

Author model

class Institution(BaseModel):
    id: str                                                         # unique ID for the institution, unused for now
    openalex_id: str | None = None                                  # OpenAlex ID for the institution
    display_name: str                                               # https://docs.openalex.org/api-entities/institutions/institution-object
    display_name_acronyms: list[str] = []
    display_name_alternatives: list[str] = []
    ror: str | None = None
    homepage_url: str | None = None
    country_code: str | None = None
    type: InstitutionType                                           # institute type as a custom type
    latlon: tuple[float, float] | None = None                       # latitude and longitude

class AuthorInstitutionLink(BaseModel):
    rank: AuthorRank | None = None                                  # author rank as undergrad, postgrad, faculty, insdustry etc.
    institution: Institution | None = None                          # instituion affiliation of author
    years: list[int] = []                                           # known years associated with institute

class InstitutionLink(BaseModel):
    rank: AuthorRank | None = None                                  # author rank as undergrad, postgrad, faculty, insdustry etc.
    institution: Institution | None = None                          # institution of author used in association with this corresponding paper

class Author(BaseModel):
    id: str                                                         # unique ID for the author, unused for now
    openalex_id: str | None = None                                  # OpenAlex ID for the institution
    orcid: str | None = None                                        # ORCID (preferred)
    openreview_id: str | None = None                                # OpenReview (preferred)
    name: str
    email: str | None = None
    work: list[AuthorInstitutionLink] = []                          # work history, unused for now
    education: list[AuthorInstitutionLink] = []                     # education history, unused for now
    affiliations: list[AuthorInstitutionLink] = []                  # paper affiliation history of the author

class AuthorLink(BaseModel):
    position: AuthorPosition | None = None                          # first, middle or last author
    author: Author                                                  # author model
    institutions: list[InstitutionLink] = []                        # institutions of author used in association with this corresponding paper
    countries: list[str] = []                                       # countries these institutions belong to

class Topic(BaseModel):                                             # same as https://docs.openalex.org/api-entities/topics
    id: str
    openalex_id: str
    display_name: str
    subfield: dict
    field: dict
    domain: dict

class TopicLink(BaseModel):                                         # same as https://docs.openalex.org/api-entities/topics
    score: float
    topic: Topic

Types

class InstitutionType(str, Enum):
    EDUCATION = "education"
    HEALTHCARE = "healthcare"
    COMPANY = "company"
    ARCHIVE = "archive"
    NONPROFIT = "nonprofit"
    GOVERNMENT = "government"
    FACILITY = "facility"
    FUNDER = "funder"
    OTHER = "other"

class AuthorRank(str, Enum):
    UNDERGRAD = "undergrad"
    POSTGRAD = "postgrad"
    POSTDOC = "postdoc"
    FACULTY = "faculty"
    INDUSTRY = "industry"

class AuthorPosition(str, Enum):
    FIRST = "first"
    MIDDLE = "middle"
    LAST = "last"

Sources

OpenAlex, arXiv, dblp, Paper Copilot, ACL Anthology.

Note: Some data has been analyzed by AI and may be incorrect.

Need assistance?

Custom requests for scraping and licensing can be sent to contact[AT]indiainresearch.org.

License and Attribution

If you find this useful, please consider starring and sharing the repository.

Shield:

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data/cs		data/cs
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Paper Datasets

Features

Data

Computer Science

Artificial Intelligence

Systems and Networks

Theory

Schema

Paper model

Author model

Types

Sources

Need assistance?

License and Attribution

About

Uh oh!

Releases

Packages

License

IndiaInResearch/paper-data

Folders and files

Latest commit

History

Repository files navigation

Paper Datasets

Features

Data

Computer Science

Artificial Intelligence

Systems and Networks

Theory

Schema

Paper model

Author model

Types

Sources

Need assistance?

License and Attribution

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages