Skip to content

Extracted paper data which powers indiainresearch.org. Contains paper and author data for popular CS conferences in 2024

License

Notifications You must be signed in to change notification settings

IndiaInResearch/paper-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

Paper Datasets

Extracted list of published papers and author data across multiple domains of science which powers indiainresearch.org.

To view a subset of the data, visit: indiainresearch.org

Features

  1. Data enriched using OpenAlex, arXiv, PDF and conference website parsing using LLMs.
  2. 1:1 mapping between paper authors and associated institutions
  3. Author rank (undergrad, postgrad, faculty, industry) available for certain venues.
  4. Author affiliation history available.
  5. Standard topics for every paper assigned from OpenAlex and LLMs.
  6. Author and affiliation matching done using LLMs.

Data

Computer Science

Artificial Intelligence

Subfield Venue Data
Machine Learning NeurIPS 2024
ICML 2024
ICLR 2024
Computer Vision CVPR 2024
WACV 2024
ECCV 2024
Natural Language Processing EMNLP 2024
ACL 2024
NAACL 2024
EACL 2024
TACL 2024
LREC-COLING 2024

Systems and Networks

Subfield Venue Data
Mobile Computing MobiSys 2024
UbiComp/IMWUT 2024
SenSys 2024
MobiCom 2024
Databases VLDB 2024
SIGMOD 2024
PODS 2024
Operating Systems OSDI 2024
Computer Networks NSDI 2024
SIGCOMM 2024

Theory

Subfield Venue Data
Algorithms and Complexity STOC 2024
FOCS 2024
SODA 2024

Schema

Each paper has the following schema and can be parsed as the following pydantic models:

Paper model

class Paper(BaseModel):
    id: str                                                         # unique ID for the paper, unused for now
    openalex_id: str | None = None                                  # corresponding ID from OpenAlex database
    doi: str | None = None                                          # DOI if present
    conf_id: str | None = None                                      # ID used by the corresponding conference
    title: str | None = None                                        # Paper title
    authorships: list[AuthorLink] = []                              # list of paper, author relations
    primary_location: dict | None = None                            # https://docs.openalex.org/api-entities/works/work-object#primary_location
    open_access: dict | None = None                                 # https://docs.openalex.org/api-entities/works/work-object#the-openaccess-object
    best_oa_location: dict | None = None                            # https://docs.openalex.org/api-entities/works/work-object#best_oa_location
    citation_normalized_percentile: dict | None = None              # FWCI percentile
    fwci: float | None = None                                       # FWCI
    primary_topic: TopicLink | None = None                          # top ranked topic
    publication_venue: str                                          # unique code for publication venue, usually same as conference name
    publication_year: int                                           # publication or conference year
    related_works: list[str] = []                                   # OpenAlex IDs of related works
    topics: list[TopicLink] = []                                    # top ranked topics. upto 3
    keywords: list[dict] = []                                       # keywords from OpenAlex
    link: str | None = None                                         # Primary webpage for the paper. (prefer this)
    pdf_link: str | None = None                                     # Primary PDF for the paper if open access (prefer this)

    status: str | None = None                                       # Oral, Poster, Spotlight (from Paper Copilot)
    track: str | None = None                                        # Conference track
    github_link: str | None = None                                   
    project_link: str | None = None
    video_link: str | None = None
    openaccess_link: str | None = None
    poster_link: str | None = None
    openreview_link: str | None = None
    arxiv_link: str | None = None
    proceeding_link: str | None = None

    author_names_from_paper: list[str] = []                         # list of author names by scraping PDF opr website or from Paper Copilot. (avoid using)
    aff_names_from_paper: list[str] = []                            # list of author affiliations by scraping PDF opr website or from Paper Copilot. (avoid using)
    aff_domains_from_paper: list[str] = []                          # list of author domains by scraping PDF opr website or from Paper Copilot. (avoid using)
    author_rank_from_paper: list[str] = []                          # undergrad, postgrad, faculty, researcher, engineer etc. (avoid using)
    openreview_ids_from_paper: list[str] = []                       # OpenReview IDs (avoid using)

    keywords_from_paper: list[str] = []                             # keywords from conference or pdf scraping
    primary_area_from_paper: str | None = None                      # primary area from paper
    overall_rating_from_paper: list[int] = []
    percent_overall_rating_from_paper: list[float] = []
    novelty_from_paper: list[int] = []
    percent_novelty_from_paper: list[float] = []

Author model

class Institution(BaseModel):
    id: str                                                         # unique ID for the institution, unused for now
    openalex_id: str | None = None                                  # OpenAlex ID for the institution
    display_name: str                                               # https://docs.openalex.org/api-entities/institutions/institution-object
    display_name_acronyms: list[str] = []
    display_name_alternatives: list[str] = []
    ror: str | None = None
    homepage_url: str | None = None
    country_code: str | None = None
    type: InstitutionType                                           # institute type as a custom type
    latlon: tuple[float, float] | None = None                       # latitude and longitude

class AuthorInstitutionLink(BaseModel):
    rank: AuthorRank | None = None                                  # author rank as undergrad, postgrad, faculty, insdustry etc.
    institution: Institution | None = None                          # instituion affiliation of author
    years: list[int] = []                                           # known years associated with institute

class InstitutionLink(BaseModel):
    rank: AuthorRank | None = None                                  # author rank as undergrad, postgrad, faculty, insdustry etc.
    institution: Institution | None = None                          # institution of author used in association with this corresponding paper

class Author(BaseModel):
    id: str                                                         # unique ID for the author, unused for now
    openalex_id: str | None = None                                  # OpenAlex ID for the institution
    orcid: str | None = None                                        # ORCID (preferred)
    openreview_id: str | None = None                                # OpenReview (preferred)
    name: str
    email: str | None = None
    work: list[AuthorInstitutionLink] = []                          # work history, unused for now
    education: list[AuthorInstitutionLink] = []                     # education history, unused for now
    affiliations: list[AuthorInstitutionLink] = []                  # paper affiliation history of the author

class AuthorLink(BaseModel):
    position: AuthorPosition | None = None                          # first, middle or last author
    author: Author                                                  # author model
    institutions: list[InstitutionLink] = []                        # institutions of author used in association with this corresponding paper
    countries: list[str] = []                                       # countries these institutions belong to

class Topic(BaseModel):                                             # same as https://docs.openalex.org/api-entities/topics
    id: str
    openalex_id: str
    display_name: str
    subfield: dict
    field: dict
    domain: dict

class TopicLink(BaseModel):                                         # same as https://docs.openalex.org/api-entities/topics
    score: float
    topic: Topic

Types

class InstitutionType(str, Enum):
    EDUCATION = "education"
    HEALTHCARE = "healthcare"
    COMPANY = "company"
    ARCHIVE = "archive"
    NONPROFIT = "nonprofit"
    GOVERNMENT = "government"
    FACILITY = "facility"
    FUNDER = "funder"
    OTHER = "other"

class AuthorRank(str, Enum):
    UNDERGRAD = "undergrad"
    POSTGRAD = "postgrad"
    POSTDOC = "postdoc"
    FACULTY = "faculty"
    INDUSTRY = "industry"

class AuthorPosition(str, Enum):
    FIRST = "first"
    MIDDLE = "middle"
    LAST = "last"

Sources

OpenAlex, arXiv, dblp, Paper Copilot, ACL Anthology.

Note: Some data has been analyzed by AI and may be incorrect.

Need assistance?

Custom requests for scraping and licensing can be sent to contact[AT]indiainresearch.org.

License and Attribution

If you find this useful, please consider starring and sharing the repository.

Shield: CC BY-NC-SA 4.0

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

CC BY-NC-SA 4.0

About

Extracted paper data which powers indiainresearch.org. Contains paper and author data for popular CS conferences in 2024

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published