GitHub - sciknoworg/ALD-E-ImageMiner: Annotation of images in Atomic Layer Deposition and Etching for scientific QA

Project Overview

ALD/E-ImageMiner is an annotation project on figures from atomic layer deposition (ALD) and atomic layer etching (ALE), situated within the broader field of materials science and engineering. Within each of these categories, the data is further organized into the sub-categories experimental-usecase and simulation-usecase.

It aims to host gold-standard annotations for chart classification, data extraction, summarization, and question answering—providing both pilot and full-phase data to support multimodal AI research in scientific image understanding.

🗂️ Directory Structure

We have compiled the dataset for annotation in this repository, structured into clearly defined categories and sub-categories.
The layout reflects the distinction between ALD and ALE literature, as well as between experimental and simulation studies, making it easier to navigate both the pilot and full annotation phases.

data
├── pilot-annotation-task
│   ├── atomic-layer-deposition
│   │   ├── experimental-usecase
│   │   │   ├── paper #
│   │   │   │   ├── images
│   │   │   │   │   ├── figures
│   │   │   │   │   │   ├── filename 1.jpg          # (JPEG) actual figure image extracted using MinerU
│   │   │   │   │   │   ├── filename.caption.txt    # (Text) figure caption extracted from the paper.
│   │   │   │   │   │   ├── filename.class.txt      # (Text) chart visualization class/category extracted using Qwen 2.5 VL
│   │   │   │   │   │   ├── filename.data.txt       # (Text) data extracted as a markdown table using instruction-tuned Qwen 2.5 VL
│   │   │   │   │   │   └── filename.summary.txt    # (Text) summarization of chart visualization extracted using Qwen 2.5 VL
│   │   │   │   │   ├── formulas
│   │   │   │   │   │   ├── filename.jpg            # (JPEG) actual formula image extracted using MinerU
│   │   │   │   │   └── tables
│   │   │   │   │       ├── filename.jpg            # (JPEG) actual table image extracted using MinerU
│   │   │   │   ├── Author et al.pdf                # (PDF) actual PDF document
│   │   │   │   ├── content.json                    # (JSON) structured content extracted using MinerU
│   │   │   │   ├── content.md                      # (Markdown) structured content extracted using MinerU
│   │   │   │   ├── content.tei.xml                 # (TEI-XML) structured content extracted using GROBID
│   │   │   │   ├── content.txt                     # (Text) unstructured content extracted using MinerU
│   │   │   │   └── layout.json                     # (JSON) bounding box and segmentation data from MinerU
│   │   │   └── ...
│   │   └── simulation-usecase
│   │       └── ...
│   └── atomic-layer-etching
│       └── ...
└── full-annotation-task
    ├── atomic-layer-deposition
    │   ├── experimental-usecase
    │   └── simulation-usecase
    └── atomic-layer-etching
        ├── experimental-usecase
        └── simulation-usecase

🛠️ Tools Used

GROBID (GeneRation Of BIbliographic Data) → scholarly PDF parsing into TEI XML.
GROBID Python Client → Python interface to GROBID.
MinerU → structured text, figures, formulas, and tables from PDFs. It is created by OpenDataLab as an open-source tool designed for data extraction from PDF documents, converting them into structured machine-readable formats like Markdown and JSON. MinerU can interpret the complex layout structure of research papers, including figures, tables, formulas, and text.
Qwen2.5-VL → multimodal LLM applied for classification, extraction, and summarization. Specifically, we used Qwen2.5-VL-7B-Instruct.
The Prompts.md file documents the prompts used for information extraction (figure type, data, summary, and figure labels).

📊 Dataset Statistics

Overall

Category	Sub-category	PDFs	Figures	Formulas	Tables
atomic-layer-deposition	experimental-usecase	66	552	102	76
atomic-layer-deposition	simulation-usecase	58	579	413	131
atomic-layer-etching	experimental-usecase	47	461	116	28
atomic-layer-etching	simulation-usecase	34	400	226	60
Total	-	205	1,992	857	295

Figure type classification

We have defined a taxonomy of 40 figure types including "unknown". The full taxonomy with descriptions, parent taxonomy category, and aliases is here figure_taxonomy.tsv. The ALD/E-ImageMiner project maintains a focus only on figures of parent taxonomy category quantitative plot.

Individual statistics for each annotation task dataset distribution are also available i.e. pilot-annotation-task and full-annotation-task.

Figure Type	Auto Labels	Human Labels
3d bar chart	5	0
3d scatter plot	23	0
apparatus diagram	98	0
area chart	6	0
band diagram	12	0
bar chart	46	0
box plot	4	0
bubble chart	1	0
conceptual diagram	127	0
formula	3	0
grouped bar chart	26	0
heatmap	89	0
histogram	2	0
image panel	526	0
line chart	1066	0
line plot	2	0
map/geo chart	4	0
molecular structure diagram	807	0
multi-axis chart	114	0
multiple line chart	44	0
network diagram	1	0
periodic table map	3	0
pie chart	8	0
polar chart	14	0
process flow diagram	28	0
reaction scheme	443	0
scatter plot	201	0
spectra chart	419	0
stacked bar chart	4	0
table	6	0
timeline chart	6	0
unknown	12	0
Total	4150	0

📖 Citation

The ALD/E-ImageMiner project vision is described in the following working paper, pre-released on Zenodo.
Please cite this paper if you find this work useful:

@misc{d_souza_2025_17130928,
  author       = {D'Souza, Jennifer},
  title        = {A Pathway to General-Purpose Scientific AI:
                   Multimodal Comprehension of Scientific Images},
  month        = sep,
  year         = 2025,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.17130928},
  url          = {https://doi.org/10.5281/zenodo.17130928},
}

⭐ Acknowledgements

The ALD/E-ImageMiner project is supported by:

The NFDI4DataScience initiative, funded by the German Research Foundation (DFG, Grant ID: 460234259) under the Speedboat Annotation Project funding scheme.
The AI-Aware Pathways to Sustainable Semiconductor Process and Manufacturing Technologies (AWASES) initiative (Mackus et al., 2024), funded by Merck and Intel, with collaboration between Eindhoven University, Leibniz University Hannover’s L3S Research Centre, and University of Warwick. AWASES hosts three fully funded PhD positions and supports advances in generative AI, multimodal models, and FAIR scientific knowledge graph construction.

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
assets		assets
data		data
example_data		example_data
papers		papers
.env		.env
CITATION.cff		CITATION.cff
Prompts.md		Prompts.md
README.md		README.md
figure_taxonomy.tsv		figure_taxonomy.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Project Overview

🗂️ Directory Structure

🛠️ Tools Used

📊 Dataset Statistics

Overall

Figure type classification

📖 Citation

⭐ Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

sciknoworg/ALD-E-ImageMiner

Folders and files

Latest commit

History

Repository files navigation

Project Overview

🗂️ Directory Structure

🛠️ Tools Used

📊 Dataset Statistics

Overall

Figure type classification

📖 Citation

⭐ Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Packages