openproblems datasets

This repository contains dataset loaders and processing workflows.

Common datasets

Pipeline topology

%%| column: screen-inset-shaded
flowchart LR
  file_dataset(Dataset+Pca+Hvg)
  file_normalized(Normalized Dataset)
  file_pca(Dataset+Pca)
  file_raw(Raw Dataset)
  comp_dataset_loader[/Dataset Loader/]
  comp_normalization[/Normalization/]
  comp_processor_hvg[/Processor Hvg/]
  comp_processor_pca[/Processor Pca/]
  file_raw---comp_normalization
  file_pca---comp_processor_hvg
  file_normalized---comp_processor_pca
  comp_dataset_loader-->file_raw
  comp_normalization-->file_normalized
  comp_processor_hvg-->file_dataset
  comp_processor_pca-->file_pca

File format API

`Dataset+Pca+Hvg`

A normalised data with a PCA embedding and HVG selection

Used in:

processor hvg: output (as output)

Slots:

struct	name	type	description
layers	counts	integer	Raw counts
layers	normalized	double	Normalised expression values
obs	celltype	string	Cell type information
obs	batch	string	Batch information
obs	tissue	string	Tissue information
obs	size_factors	double	The size factors created by the normalisation method, if any.
var	hvg	boolean	Whether or not the feature is considered to be a ‘highly variable gene’
var	hvg_score	integer	A ranking of the features by hvg.
obsm	X_pca	double	The resulting PCA embedding.
varm	pca_loadings	double	The PCA loadings matrix.
uns	dataset_id	string	A unique identifier for the dataset
uns	normalization_id	string	Which normalization was used
uns	pca_variance	double	The PCA variance objects.

Example:

AnnData object
 obs: 'celltype', 'batch', 'tissue', 'size_factors'
 var: 'hvg', 'hvg_score'
 uns: 'dataset_id', 'normalization_id', 'pca_variance'
 obsm: 'X_pca'
 varm: 'pca_loadings'
 layers: 'counts', 'normalized'

`Normalized Dataset`

A normalized dataset

Used in:

normalization: output (as output)
processor pca: input (as input)

Slots:

struct	name	type	description
layers	counts	integer	Raw counts
layers	normalized	double	Normalised expression values
obs	celltype	string	Cell type information
obs	batch	string	Batch information
obs	tissue	string	Tissue information
obs	size_factors	double	The size factors created by the normalisation method, if any.
uns	dataset_id	string	A unique identifier for the dataset
uns	normalization_id	string	Which normalization was used

Example:

AnnData object
 obs: 'celltype', 'batch', 'tissue', 'size_factors'
 uns: 'dataset_id', 'normalization_id'
 layers: 'counts', 'normalized'

`Dataset+Pca`

A normalised data with a PCA embedding

Used in:

processor hvg: input (as input)
processor pca: output (as output)

Slots:

struct	name	type	description
layers	counts	integer	Raw counts
layers	normalized	double	Normalised expression values
obs	celltype	string	Cell type information
obs	batch	string	Batch information
obs	tissue	string	Tissue information
obs	size_factors	double	The size factors created by the normalisation method, if any.
obsm	X_pca	double	The resulting PCA embedding.
varm	pca_loadings	double	The PCA loadings matrix.
uns	dataset_id	string	A unique identifier for the dataset
uns	normalization_id	string	Which normalization was used
uns	pca_variance	double	The PCA variance objects.

Example:

AnnData object
 obs: 'celltype', 'batch', 'tissue', 'size_factors'
 uns: 'dataset_id', 'normalization_id', 'pca_variance'
 obsm: 'X_pca'
 varm: 'pca_loadings'
 layers: 'counts', 'normalized'

`Raw Dataset`

An unprocessed dataset as output by a dataset loader.

Used in:

dataset loader: output (as output)
normalization: input (as input)

Slots:

struct	name	type	description
layers	counts	integer	Raw counts
obs	celltype	string	Cell type information
obs	batch	string	Batch information
obs	tissue	string	Tissue information
uns	dataset_id	string	A unique identifier for the dataset

Example:

AnnData object
 obs: 'celltype', 'batch', 'tissue'
 uns: 'dataset_id'
 layers: 'counts'

Component API

`Dataset Loader`

Arguments:

Name	Type	Direction	Description
`--output`	Raw Dataset	output	An unprocessed dataset as output by a dataset loader.

`Normalization`

Arguments:

Name	Type	Direction	Description
`--input`	Raw Dataset	input	An unprocessed dataset as output by a dataset loader.
`--output`	Normalized Dataset	output	A normalized dataset
`--layer_output`	`string`	input	The name of the layer in which to store the normalized data.
`--obs_size_factors`	`string`	input	In which .obs slot to store the size factors (if any).

`Processor Hvg`

Arguments:

Name	Type	Direction	Description
`--input`	Dataset+Pca	input	A normalised data with a PCA embedding
`--layer_input`	`string`	input	Which layer to use as input for the PCA.
`--output`	Dataset+Pca+Hvg	output	A normalised data with a PCA embedding and HVG selection
`--var_hvg`	`string`	input	In which .var slot to store whether a feature is considered to be hvg.
`--var_hvg_score`	`string`	input	In which .var slot to store whether a ranking of the features by variance.
`--num_features`	`integer`	input	The number of HVG to select

`Processor Pca`

Arguments:

Name	Type	Direction	Description
`--input`	Normalized Dataset	input	A normalized dataset
`--layer_input`	`string`	input	Which layer to use as input for the PCA.
`--output`	Dataset+Pca	output	A normalised data with a PCA embedding
`--obsm_embedding`	`string`	input	In which .obsm slot to store the resulting embedding.
`--varm_loadings`	`string`	input	In which .varm slot to store the resulting loadings matrix.
`--uns_variance`	`string`	input	In which .uns slot to store the resulting variance objects.
`--num_components`	`integer`	input	Number of principal components to compute. Defaults to 50, or 1 - minimum dimension size of selected representation.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github		.github
common @ f01ff21		common @ f01ff21
scripts		scripts
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
README.qmd		README.qmd
_viash.yaml		_viash.yaml
main.nf		main.nf
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

openproblems datasets

Pipeline topology

File format API

`Dataset+Pca+Hvg`

`Normalized Dataset`

`Dataset+Pca`

`Raw Dataset`

Component API

`Dataset Loader`

`Normalization`

`Processor Hvg`

`Processor Pca`

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

openproblems-bio/datasets

Folders and files

Latest commit

History

Repository files navigation

openproblems datasets

Pipeline topology

File format API

Dataset+Pca+Hvg

Normalized Dataset

Dataset+Pca

Raw Dataset

Component API

Dataset Loader

Normalization

Processor Hvg

Processor Pca

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`Dataset+Pca+Hvg`

`Normalized Dataset`

`Dataset+Pca`

`Raw Dataset`

`Dataset Loader`

`Normalization`

`Processor Hvg`

`Processor Pca`

Packages