Extracted list of published papers and author data across multiple domains of science which powers indiainresearch.org.
To view a subset of the data, visit: indiainresearch.org
- Data enriched using OpenAlex, arXiv, PDF and conference website parsing using LLMs.
- 1:1 mapping between paper authors and associated institutions
- Author rank (undergrad, postgrad, faculty, industry) available for certain venues.
- Author affiliation history available.
- Standard topics for every paper assigned from OpenAlex and LLMs.
- Author and affiliation matching done using LLMs.
Subfield | Venue | Data |
---|---|---|
Machine Learning | NeurIPS | 2024 |
ICML | 2024 | |
ICLR | 2024 | |
Computer Vision | CVPR | 2024 |
WACV | 2024 | |
ECCV | 2024 | |
Natural Language Processing | EMNLP | 2024 |
ACL | 2024 | |
NAACL | 2024 | |
EACL | 2024 | |
TACL | 2024 | |
LREC-COLING | 2024 |
Subfield | Venue | Data |
---|---|---|
Mobile Computing | MobiSys | 2024 |
UbiComp/IMWUT | 2024 | |
SenSys | 2024 | |
MobiCom | 2024 | |
Databases | VLDB | 2024 |
SIGMOD | 2024 | |
PODS | 2024 | |
Operating Systems | OSDI | 2024 |
Computer Networks | NSDI | 2024 |
SIGCOMM | 2024 |
Subfield | Venue | Data |
---|---|---|
Algorithms and Complexity | STOC | 2024 |
FOCS | 2024 | |
SODA | 2024 |
Each paper has the following schema and can be parsed as the following pydantic models:
class Paper(BaseModel):
id: str # unique ID for the paper, unused for now
openalex_id: str | None = None # corresponding ID from OpenAlex database
doi: str | None = None # DOI if present
conf_id: str | None = None # ID used by the corresponding conference
title: str | None = None # Paper title
authorships: list[AuthorLink] = [] # list of paper, author relations
primary_location: dict | None = None # https://docs.openalex.org/api-entities/works/work-object#primary_location
open_access: dict | None = None # https://docs.openalex.org/api-entities/works/work-object#the-openaccess-object
best_oa_location: dict | None = None # https://docs.openalex.org/api-entities/works/work-object#best_oa_location
citation_normalized_percentile: dict | None = None # FWCI percentile
fwci: float | None = None # FWCI
primary_topic: TopicLink | None = None # top ranked topic
publication_venue: str # unique code for publication venue, usually same as conference name
publication_year: int # publication or conference year
related_works: list[str] = [] # OpenAlex IDs of related works
topics: list[TopicLink] = [] # top ranked topics. upto 3
keywords: list[dict] = [] # keywords from OpenAlex
link: str | None = None # Primary webpage for the paper. (prefer this)
pdf_link: str | None = None # Primary PDF for the paper if open access (prefer this)
status: str | None = None # Oral, Poster, Spotlight (from Paper Copilot)
track: str | None = None # Conference track
github_link: str | None = None
project_link: str | None = None
video_link: str | None = None
openaccess_link: str | None = None
poster_link: str | None = None
openreview_link: str | None = None
arxiv_link: str | None = None
proceeding_link: str | None = None
author_names_from_paper: list[str] = [] # list of author names by scraping PDF opr website or from Paper Copilot. (avoid using)
aff_names_from_paper: list[str] = [] # list of author affiliations by scraping PDF opr website or from Paper Copilot. (avoid using)
aff_domains_from_paper: list[str] = [] # list of author domains by scraping PDF opr website or from Paper Copilot. (avoid using)
author_rank_from_paper: list[str] = [] # undergrad, postgrad, faculty, researcher, engineer etc. (avoid using)
openreview_ids_from_paper: list[str] = [] # OpenReview IDs (avoid using)
keywords_from_paper: list[str] = [] # keywords from conference or pdf scraping
primary_area_from_paper: str | None = None # primary area from paper
overall_rating_from_paper: list[int] = []
percent_overall_rating_from_paper: list[float] = []
novelty_from_paper: list[int] = []
percent_novelty_from_paper: list[float] = []
class Institution(BaseModel):
id: str # unique ID for the institution, unused for now
openalex_id: str | None = None # OpenAlex ID for the institution
display_name: str # https://docs.openalex.org/api-entities/institutions/institution-object
display_name_acronyms: list[str] = []
display_name_alternatives: list[str] = []
ror: str | None = None
homepage_url: str | None = None
country_code: str | None = None
type: InstitutionType # institute type as a custom type
latlon: tuple[float, float] | None = None # latitude and longitude
class AuthorInstitutionLink(BaseModel):
rank: AuthorRank | None = None # author rank as undergrad, postgrad, faculty, insdustry etc.
institution: Institution | None = None # instituion affiliation of author
years: list[int] = [] # known years associated with institute
class InstitutionLink(BaseModel):
rank: AuthorRank | None = None # author rank as undergrad, postgrad, faculty, insdustry etc.
institution: Institution | None = None # institution of author used in association with this corresponding paper
class Author(BaseModel):
id: str # unique ID for the author, unused for now
openalex_id: str | None = None # OpenAlex ID for the institution
orcid: str | None = None # ORCID (preferred)
openreview_id: str | None = None # OpenReview (preferred)
name: str
email: str | None = None
work: list[AuthorInstitutionLink] = [] # work history, unused for now
education: list[AuthorInstitutionLink] = [] # education history, unused for now
affiliations: list[AuthorInstitutionLink] = [] # paper affiliation history of the author
class AuthorLink(BaseModel):
position: AuthorPosition | None = None # first, middle or last author
author: Author # author model
institutions: list[InstitutionLink] = [] # institutions of author used in association with this corresponding paper
countries: list[str] = [] # countries these institutions belong to
class Topic(BaseModel): # same as https://docs.openalex.org/api-entities/topics
id: str
openalex_id: str
display_name: str
subfield: dict
field: dict
domain: dict
class TopicLink(BaseModel): # same as https://docs.openalex.org/api-entities/topics
score: float
topic: Topic
class InstitutionType(str, Enum):
EDUCATION = "education"
HEALTHCARE = "healthcare"
COMPANY = "company"
ARCHIVE = "archive"
NONPROFIT = "nonprofit"
GOVERNMENT = "government"
FACILITY = "facility"
FUNDER = "funder"
OTHER = "other"
class AuthorRank(str, Enum):
UNDERGRAD = "undergrad"
POSTGRAD = "postgrad"
POSTDOC = "postdoc"
FACULTY = "faculty"
INDUSTRY = "industry"
class AuthorPosition(str, Enum):
FIRST = "first"
MIDDLE = "middle"
LAST = "last"
OpenAlex, arXiv, dblp, Paper Copilot, ACL Anthology.
Note: Some data has been analyzed by AI and may be incorrect.
Custom requests for scraping and licensing can be sent to contact[AT]indiainresearch.org.
If you find this useful, please consider starring and sharing the repository.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.