Skip to content

sciknoworg/ALD-E-ImageMiner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ALD/E-ImageMiner Logo

Project Overview

ALD/E-ImageMiner is an annotation project on figures from atomic layer deposition (ALD) and atomic layer etching (ALE), situated within the broader field of materials science and engineering. Within each of these categories, the data is further organized into the sub-categories experimental-usecase and simulation-usecase.

It aims to host gold-standard annotations for chart classification, data extraction, summarization, and question answering—providing both pilot and full-phase data to support multimodal AI research in scientific image understanding.

🗂️ Directory Structure

We have compiled the dataset for annotation in this repository, structured into clearly defined categories and sub-categories.
The layout reflects the distinction between ALD and ALE literature, as well as between experimental and simulation studies, making it easier to navigate both the pilot and full annotation phases.

data
├── pilot-annotation-task
│   ├── atomic-layer-deposition
│   │   ├── experimental-usecase
│   │   │   ├── paper #
│   │   │   │   ├── images
│   │   │   │   │   ├── figures
│   │   │   │   │   │   ├── filename 1.jpg          # (JPEG) actual figure image extracted using MinerU
│   │   │   │   │   │   ├── filename.caption.txt    # (Text) figure caption extracted from the paper.
│   │   │   │   │   │   ├── filename.class.txt      # (Text) chart visualization class/category extracted using Qwen 2.5 VL
│   │   │   │   │   │   ├── filename.data.txt       # (Text) data extracted as a markdown table using instruction-tuned Qwen 2.5 VL
│   │   │   │   │   │   └── filename.summary.txt    # (Text) summarization of chart visualization extracted using Qwen 2.5 VL
│   │   │   │   │   ├── formulas
│   │   │   │   │   │   ├── filename.jpg            # (JPEG) actual formula image extracted using MinerU
│   │   │   │   │   └── tables
│   │   │   │   │       ├── filename.jpg            # (JPEG) actual table image extracted using MinerU
│   │   │   │   ├── Author et al.pdf                # (PDF) actual PDF document
│   │   │   │   ├── content.json                    # (JSON) structured content extracted using MinerU
│   │   │   │   ├── content.md                      # (Markdown) structured content extracted using MinerU
│   │   │   │   ├── content.tei.xml                 # (TEI-XML) structured content extracted using GROBID
│   │   │   │   ├── content.txt                     # (Text) unstructured content extracted using MinerU
│   │   │   │   └── layout.json                     # (JSON) bounding box and segmentation data from MinerU
│   │   │   └── ...
│   │   └── simulation-usecase
│   │       └── ...
│   └── atomic-layer-etching
│       └── ...
└── full-annotation-task
    ├── atomic-layer-deposition
    │   ├── experimental-usecase
    │   └── simulation-usecase
    └── atomic-layer-etching
        ├── experimental-usecase
        └── simulation-usecase

🛠️ Tools Used

  • GROBID (GeneRation Of BIbliographic Data) → scholarly PDF parsing into TEI XML.
  • GROBID Python Client → Python interface to GROBID.
  • MinerU → structured text, figures, formulas, and tables from PDFs. It is created by OpenDataLab as an open-source tool designed for data extraction from PDF documents, converting them into structured machine-readable formats like Markdown and JSON. MinerU can interpret the complex layout structure of research papers, including figures, tables, formulas, and text.
  • Qwen2.5-VL → multimodal LLM applied for classification, extraction, and summarization. Specifically, we used Qwen2.5-VL-7B-Instruct.
    The Prompts.md file documents the prompts used for information extraction (figure type, data, summary, and figure labels).

📊 Dataset Statistics

Overall

Category Sub-category PDFs Figures Formulas Tables
atomic-layer-deposition experimental-usecase 66 552 102 76
atomic-layer-deposition simulation-usecase 58 579 413 131
atomic-layer-etching experimental-usecase 47 461 116 28
atomic-layer-etching simulation-usecase 34 400 226 60
Total - 205 1,992 857 295

Figure type classification

We have defined a taxonomy of 40 figure types including "unknown". The full taxonomy with descriptions, parent taxonomy category, and aliases is here figure_taxonomy.tsv. The ALD/E-ImageMiner project maintains a focus only on figures of parent taxonomy category quantitative plot.

Individual statistics for each annotation task dataset distribution are also available i.e. pilot-annotation-task and full-annotation-task.

Figure Type Auto Labels Human Labels
3d bar chart 5 0
3d scatter plot 23 0
apparatus diagram 98 0
area chart 6 0
band diagram 12 0
bar chart 46 0
box plot 4 0
bubble chart 1 0
conceptual diagram 127 0
formula 3 0
grouped bar chart 26 0
heatmap 89 0
histogram 2 0
image panel 526 0
line chart 1066 0
line plot 2 0
map/geo chart 4 0
molecular structure diagram 807 0
multi-axis chart 114 0
multiple line chart 44 0
network diagram 1 0
periodic table map 3 0
pie chart 8 0
polar chart 14 0
process flow diagram 28 0
reaction scheme 443 0
scatter plot 201 0
spectra chart 419 0
stacked bar chart 4 0
table 6 0
timeline chart 6 0
unknown 12 0
Total 4150 0

📖 Citation

The ALD/E-ImageMiner project vision is described in the following working paper, pre-released on Zenodo.
Please cite this paper if you find this work useful:

@misc{d_souza_2025_17130928,
  author       = {D'Souza, Jennifer},
  title        = {A Pathway to General-Purpose Scientific AI:
                   Multimodal Comprehension of Scientific Images},
  month        = sep,
  year         = 2025,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.17130928},
  url          = {https://doi.org/10.5281/zenodo.17130928},
}

⭐ Acknowledgements

The ALD/E-ImageMiner project is supported by:

  • NFDI4DataScience Logo

    The NFDI4DataScience initiative, funded by the German Research Foundation (DFG, Grant ID: 460234259) under the Speedboat Annotation Project funding scheme.

  • The AI-Aware Pathways to Sustainable Semiconductor Process and Manufacturing Technologies (AWASES) initiative (Mackus et al., 2024), funded by Merck and Intel, with collaboration between Eindhoven University, Leibniz University Hannover’s L3S Research Centre, and University of Warwick. AWASES hosts three fully funded PhD positions and supports advances in generative AI, multimodal models, and FAIR scientific knowledge graph construction.

About

Annotation of images in Atomic Layer Deposition and Etching for scientific QA

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •