Skip to content
@google-research-datasets

Google Research Datasets

Datasets released by Google Research

Pinned Loading

  1. natural-questions natural-questions Public

    Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question ans…

    Python 996 157

  2. conceptual-captions conceptual-captions Public

    Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.

    Shell 534 26

  3. Objectron Objectron Public

    Objectron is a dataset of short, object-centric video clips. In addition, the videos also contain AR session metadata including camera poses, sparse point-clouds and planes. In each video, the came…

    Jupyter Notebook 2.3k 261

  4. wit wit Public

    WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

    1k 42

  5. paws paws Public

    This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase ident…

    Python 557 54

  6. dstc8-schema-guided-dialogue dstc8-schema-guided-dialogue Public

    The Schema-Guided Dialogue Dataset

    Python 565 128

Repositories

Showing 10 of 166 repositories
  • egotempo Public
    google-research-datasets/egotempo’s past year of commit activity
    Jupyter Notebook 11 CC-BY-4.0 0 2 0 Updated Apr 26, 2025
  • artydiqa Public

    ArTyDi-QA is a dataset for Question Answering (QA) and Question Generation (QG) in Modern Standard Arabic (MSA), adapted from TyDiQA. It features extractive QA where models find answer spans or identify unanswerable questions, and a QG task involving formulating questions from context and answer pairs.

    google-research-datasets/artydiqa’s past year of commit activity
    0 0 0 0 Updated Apr 23, 2025
  • Amplify_SSA Public

    An annotated dataset of 8,091 adversarial queries in seven Sub-Saharan African languages.

    google-research-datasets/Amplify_SSA’s past year of commit activity
    Jupyter Notebook 0 0 0 0 Updated Apr 18, 2025
  • web-images Public

    Images gathered from the Internet in 2023 and some metadata

    google-research-datasets/web-images’s past year of commit activity
    HTML 2 2 0 0 Updated Mar 19, 2025
  • screen_qa Public

    ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K question-answer pairs collected by human annotators for ~35K screenshots from Rico. It should be used to train and evaluate models capable of screen content understanding via question answering.

    google-research-datasets/screen_qa’s past year of commit activity
    Python 114 CC-BY-4.0 8 3 0 Updated Feb 7, 2025
  • adversarial-nibbler Public

    This dataset contains results from all rounds of Adversarial Nibbler. This data includes adversarial prompts fed into public generative text2image models and validations for unsafe images. There will be two sets of data: all prompts submitted and all prompts attempted (sent to t2i models but not submitted as unsafe).

    google-research-datasets/adversarial-nibbler’s past year of commit activity
    21 CC-BY-4.0 3 0 0 Updated Feb 3, 2025
  • cube Public

    CUBE is a benchmark to evaluate the Cultural Competence of T2I models

    google-research-datasets/cube’s past year of commit activity
    8 CC-BY-4.0 0 3 0 Updated Jan 20, 2025
  • google-research-datasets/global_streamflow_model_paper’s past year of commit activity
    Jupyter Notebook 57 Apache-2.0 16 3 0 Updated Jan 17, 2025
  • hiertext Public

    The HierText dataset contains ~12k images from the Open Images dataset v6 with large amount of text entities. We provide word, line and paragraph level annotations.

    google-research-datasets/hiertext’s past year of commit activity
    Jupyter Notebook 283 CC-BY-SA-4.0 25 0 1 Updated Dec 2, 2024
  • scin Public

    The SCIN dataset contains 10,000+ images of dermatology conditions, crowdsourced with informed consent from US internet users. Contributions include self-reported demographic and symptom information and dermatologist labels. The dataset also contains estimated Fitzpatrick skin type and Monk Skin Tone.

    google-research-datasets/scin’s past year of commit activity
    Jupyter Notebook 108 10 2 0 Updated Nov 23, 2024

People

This organization has no public members. You must be a member to see who’s a part of this organization.