Skip to content

Commit 381c33b

Browse files
authored
Merge pull request #54 from HiDiHlabs/restructure
Small improvements/refactorings
2 parents 2ebd2e3 + c80bc28 commit 381c33b

File tree

10 files changed

+89
-58
lines changed

10 files changed

+89
-58
lines changed

.pre-commit-config.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,14 +15,14 @@ repos:
1515
- id: no-commit-to-branch
1616
args: [--branch=main]
1717
- repo: https://github.com/astral-sh/ruff-pre-commit
18-
rev: v0.11.4
18+
rev: v0.11.12
1919
hooks:
2020
# Linter
21-
- id: ruff
21+
- id: ruff-check
2222
# Formatter
2323
- id: ruff-format
2424
- repo: https://github.com/pre-commit/mirrors-mypy
25-
rev: v1.15.0
25+
rev: v1.16.0
2626
hooks:
2727
- id: mypy
2828
additional_dependencies:

.readthedocs.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
version: 2
22
build:
3-
os: ubuntu-22.04
3+
os: ubuntu-24.04
44
tools:
55
python: "3.12"
66
sphinx:

LICENSE

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
MIT License
22

3-
Copyright (c) 2023 sebastiantiesmeyer
3+
Copyright (c) 2025 Sebastian Tiesmeyer, Niklas Müller-Bötticher, Naveed Ishaque,
4+
Roland Eils, Berlin Institute of Health @ Charité
45

56
Permission is hereby granted, free of charge, to any person obtaining a copy
67
of this software and associated documentation files (the "Software"), to deal

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Much of spatial biology uses microscopic tissue slices to study the spatial dist
1111
![3D slice visualization](docs/resources/cell_overlap_visualization.jpg)
1212

1313
Ovrl.py is a quality-control tool for spatial transcriptomics data that can help analysts find sources of vertical signal inconsistency in their data.
14-
It is works with imaging-based spatial transcriptomics data, such as 10x genomics' Xenium or vizgen's MERFISH platforms.
14+
It is works with imaging-based spatial transcriptomics data, such as 10x genomics' Xenium or vizgen's MERSCOPE platforms.
1515
The main feature of the tool is the production of 'signal integrity maps' that can help analysts identify sources of signal inconsistency in their data.
1616
Users can also use the built-in 3D visualisation tool to explore regions of signal inconsistency in their data on a molecular level.
1717

@@ -38,7 +38,7 @@ import pandas as pd
3838
import ovrlpy
3939

4040
# define ovrlpy analysis parameters
41-
n_components = 20
41+
n_components = 20 # number pf PCA components
4242

4343
# load the data
4444
coordinate_df = pd.read_csv('path/to/coordinate_file.csv')
@@ -90,7 +90,7 @@ doublet_to_show = 0
9090

9191
x, y = doublets["x", "y"].row(doublet_to_show)
9292

93-
fig = ovrlpy.plot_region_of_interest(dataset, x, y, window_size=window_size)
93+
fig = ovrlpy.plot_region_of_interest(dataset, x, y, window_size=50)
9494
```
9595

9696
![plot_region_of_interest output](docs/resources/plot_roi.png)

docs/source/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Introduction
99
In spatial biology, tissue slices are commonly used to study the spatial distribution of cells and molecules. However, since these slices represent 3D structures in 2D, overlapping structures in the vertical dimension can lead to artefacts and inconsistencies in the data.
1010

1111
**ovrlpy** is a quality-control tool for spatial transcriptomics data that can help analysts find sources of vertical signal inconsistency in their data.
12-
It is works with imaging-based spatial transcriptomics data, such as 10x Genomics' Xenium or Vizgen's MERFISH platforms.
12+
It is works with imaging-based spatial transcriptomics data, such as 10x Genomics' Xenium or Vizgen's MERSCOPE platforms.
1313
The main feature of the tool is the production of 'signal integrity maps' that can help analysts identify sources of signal inconsistency in their data.
1414
Users can also use the built-in 3D visualisation tool to explore regions of signal inconsistency in their data on a molecular level.
1515

docs/source/tutorials/vizgen_liver.ipynb

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,9 @@
55
"id": "8ef9e021-1b2e-4086-868a-86d7e77b6f04",
66
"metadata": {},
77
"source": [
8-
"# MERFISH mouse liver\n",
8+
"# MERSCOPE mouse liver\n",
99
"\n",
10-
"In this notebook, we will use ovrlpy to investigate the [Vizgen MERFISH's mouse liver dataset](https://info.vizgen.com/mouse-liver-data).\n",
10+
"In this notebook, we will use ovrlpy to investigate the [Vizgen MERSCOPE's mouse liver dataset](https://info.vizgen.com/mouse-liver-data).\n",
1111
"\n",
1212
"We want to create a signal embedding of the transcriptome, and a vertical signal incoherence map to identify locations with a high risk of containing spatial doublets."
1313
]
@@ -78,7 +78,7 @@
7878
}
7979
],
8080
"source": [
81-
"coordinate_df = ovrlpy.io.read_MERFISH(data_path / \"detected_transcripts.csv\")\n",
81+
"coordinate_df = ovrlpy.io.read_MERSCOPE(data_path / \"detected_transcripts.csv\")\n",
8282
"\n",
8383
"print(f\"Number of transcripts: {len(coordinate_df):,}\")"
8484
]

ovrlpy/_ovrlp.py

Lines changed: 19 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -84,13 +84,13 @@ class Ovrlp:
8484
The center of gravity of each celltype in the 2D embedding, used for UMAP annotation.
8585
celltype_assignments : numpy.ndarray
8686
The assignments of the cell types.
87-
pca_2d : sklearn.decomposition.PCA
88-
The PCA object used for the 2D embedding.
89-
embedder_2d : umap.UMAP
87+
pca : sklearn.decomposition.PCA
88+
The PCA object used for the 2D embedding and calculating the VSI score.
89+
umap_2d : umap.UMAP
9090
The UMAP object used for the 2D embedding.
91-
pca_3d : sklearn.decomposition.PCA
91+
pca_rgb : sklearn.decomposition.PCA
9292
The PCA object used for the 3D RGB embedding.
93-
embedder_3d : umap.UMAP
93+
umap_rgb : umap.UMAP
9494
The UMAP object used for the 3D RGB embedding.
9595
genes : list
9696
A list of genes to utilize in the model.
@@ -147,10 +147,10 @@ def __init__(
147147
n_jobs = n_workers if cumap_kwargs.get("random_state") is None else 1
148148
cumap_kwargs["n_jobs"] = n_jobs
149149

150-
self.pca_2d = PCA(n_components=n_components, random_state=random_state)
151-
self.embedder_2d = UMAP(**(umap_kwargs | {"n_components": 2}))
152-
self.pca_3d = PCA(n_components=3, random_state=random_state)
153-
self.embedder_3d = UMAP(**(cumap_kwargs | {"n_components": 3}))
150+
self.pca = PCA(n_components=n_components, random_state=random_state)
151+
self.umap_2d = UMAP(**(umap_kwargs | {"n_components": 2}))
152+
self.pca_rgb = PCA(n_components=3, random_state=random_state)
153+
self.umap_rgb = UMAP(**(cumap_kwargs | {"n_components": 3}))
154154

155155
def process_coordinates(self, gridsize: float = 1, **kwargs):
156156
"""
@@ -225,19 +225,19 @@ def fit_pseudocells(self, pseudocells: AnnData, *, fit_umap: bool = True):
225225

226226
self.pseudocells = pseudocells
227227
X = pseudocells[:, self.genes].X
228-
self.pca_2d.fit(X)
228+
self.pca.fit(X)
229229

230230
if fit_umap:
231-
factors = self.pca_2d.transform(X)
231+
factors = self.pca.transform(X)
232232

233233
print(f"Modeling {factors.shape[1]} pseudo-celltype clusters;")
234234

235-
self.pseudocells.obsm["2D_UMAP"] = self.embedder_2d.fit_transform(factors)
235+
self.pseudocells.obsm["2D_UMAP"] = self.umap_2d.fit_transform(factors)
236236

237-
embedding_color = self.embedder_3d.fit_transform(
237+
embedding_color = self.umap_rgb.fit_transform(
238238
factors / norm(factors, axis=1, keepdims=True)
239239
)
240-
embedding_color = _fill_color_axes(embedding_color, self.pca_3d, fit=True)
240+
embedding_color = _fill_color_axes(embedding_color, self.pca_rgb, fit=True)
241241

242242
self._colors_min_max = (
243243
embedding_color.min(axis=0),
@@ -405,7 +405,7 @@ def compute_VSI(self, *, min_transcripts: float = 2):
405405
_calculate_embedding,
406406
gene_queue,
407407
patch_mask,
408-
self.pca_2d.components_,
408+
self.pca.components_,
409409
bandwidth=self.KDE_bandwidth,
410410
dtype=self.dtype,
411411
)
@@ -602,11 +602,11 @@ def transform_pseudocells(
602602

603603
embedding, embedding_color = _transform_embeddings(
604604
pseudocells.to_numpy(),
605-
self.pca_2d,
606-
embedder_2d=self.embedder_2d,
607-
embedder_3d=self.embedder_3d,
605+
self.pca,
606+
umap_2d=self.umap_2d,
607+
umap_rgb=self.umap_rgb,
608608
)
609-
embedding_color = _fill_color_axes(embedding_color, self.pca_3d)
609+
embedding_color = _fill_color_axes(embedding_color, self.pca_rgb)
610610
color_min, color_max = self._colors_min_max
611611
embedding_color = (embedding_color - color_min) / (color_max - color_min)
612612
embedding_color = np.clip(embedding_color, 0, 1)

ovrlpy/_plotting.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,8 @@
2929
][::-1],
3030
)
3131

32+
VSI = "vertical signal integrity"
33+
3234

3335
def _plot_scalebar(ax: Axes, dx: float = 1, units="um", **kwargs):
3436
ax.add_artist(ScaleBar(dx, units=units, **kwargs))
@@ -323,15 +325,14 @@ def plot_signal_integrity(
323325
bars = ax_hist.barh(bins[1:-1], vals[1:], height=0.01)
324326
for i, bar in enumerate(bars):
325327
bar.set_color(colors[i])
326-
ax_hist.set(ylim=(0, 1), ylabel="signal integrity")
328+
ax_hist.set(ylim=(0, 1), ylabel=VSI, xticks=[])
327329
ax_hist.yaxis.tick_right()
328330
ax_hist.yaxis.set_label_position("right")
329-
ax_hist.set_xticks([], [])
330331
ax_hist.invert_xaxis()
331332
ax_hist.spines[["top", "bottom", "left"]].set_visible(False)
332333

333334
else:
334-
fig.colorbar(img)
335+
fig.colorbar(img, label=VSI)
335336

336337
return fig
337338

@@ -402,7 +403,7 @@ def plot_region_of_interest(
402403

403404
ax_integrity.set_title("ROI, signal integrity")
404405
ax_integrity.invert_yaxis()
405-
fig.colorbar(img)
406+
fig.colorbar(img, label=VSI)
406407

407408
ax_integrity.set_xlim(x - window_size, x + window_size)
408409
ax_integrity.set_ylim(y - window_size, y + window_size)

ovrlpy/_utils.py

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -78,14 +78,12 @@ def _minmax_scaling(x: np.ndarray):
7878
return (x - x_min) / (x_max - x_min)
7979

8080

81-
def _transform_embeddings(expression, pca: PCA, embedder_2d: UMAP, embedder_3d: UMAP):
81+
def _transform_embeddings(expression, pca: PCA, umap_2d: UMAP, umap_rgb: UMAP):
8282
"""fit the expression data into the umap embeddings after PCA transformation"""
8383
factors = pca.transform(expression)
8484

85-
embedding = embedder_2d.transform(factors)
86-
embedding_color = embedder_3d.transform(
87-
factors / norm(factors, axis=1, keepdims=True)
88-
)
85+
embedding = umap_2d.transform(factors)
86+
embedding_color = umap_rgb.transform(factors / norm(factors, axis=1, keepdims=True))
8987

9088
return embedding, embedding_color
9189

ovrlpy/io.py

Lines changed: 49 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ def read_Xenium(
4141
*,
4242
min_qv: float | None = None,
4343
remove_features: Collection[str] = XENIUM_CTRLS,
44+
additional_columns: Collection[str] = [],
4445
n_threads: int | None = None,
4546
) -> pl.DataFrame:
4647
"""
@@ -56,6 +57,8 @@ def read_Xenium(
5657
remove_features : collections.abc.Collection[str], optional
5758
List of regex patterns to filter the 'feature_name' column,
5859
:py:attr:`ovrlpy.io.XENIUM_CTRLS` by default.
60+
additional_columns : collections.abc.Collection[str], optional
61+
Additional columns to load from the transcripts file.
5962
n_threads : int | None, optional
6063
Number of threads used for parsing the input file.
6164
If None, will default to number of available CPUs.
@@ -65,7 +68,7 @@ def read_Xenium(
6568
polars.DataFrame
6669
"""
6770
filepath = Path(filepath)
68-
columns = list(_XENIUM_COLUMNS.keys())
71+
columns = list(set(_XENIUM_COLUMNS.keys()) | set(additional_columns))
6972

7073
if filepath.suffix == ".parquet":
7174
transcripts = pl.scan_parquet(filepath)
@@ -87,7 +90,7 @@ def read_Xenium(
8790
)
8891

8992
else:
90-
if min_qv is not None:
93+
if min_qv is not None and "qv" not in additional_columns:
9194
columns.append("qv")
9295
transcripts = pl.read_csv(
9396
filepath,
@@ -97,26 +100,29 @@ def read_Xenium(
97100
)
98101

99102
if min_qv is not None:
100-
transcripts = transcripts.filter(pl.col("qv") >= min_qv).drop("qv")
103+
transcripts = transcripts.filter(pl.col("qv") >= min_qv)
104+
if "qv" not in additional_columns:
105+
transcripts = transcripts.drop("qv")
101106

102107
transcripts = transcripts.rename(_XENIUM_COLUMNS)
103108
transcripts = _filter_genes(transcripts, remove_features)
104109

105110
return transcripts
106111

107112

108-
# Vizgen MERFISH
109-
_MERFISH_COLUMNS = {"gene": "gene", "global_x": "x", "global_y": "y", "global_z": "z"}
113+
# Vizgen MERSCOPE
114+
_MERSCOPE_COLUMNS = {"gene": "gene", "global_x": "x", "global_y": "y", "global_z": "z"}
110115

111-
MERFISH_CTRLS = ["^Blank"]
116+
MERSCOPE_CTRLS = ["^Blank"]
112117
"""Patterns for Vizgen controls"""
113118

114119

115-
def read_MERFISH(
120+
def read_MERSCOPE(
116121
filepath: str | os.PathLike,
117122
z_scale: float = 1.5,
118123
*,
119-
remove_genes: Collection[str] = MERFISH_CTRLS,
124+
remove_genes: Collection[str] = MERSCOPE_CTRLS,
125+
additional_columns: Collection[str] = [],
120126
n_threads: int | None = None,
121127
) -> pl.DataFrame:
122128
"""
@@ -125,12 +131,14 @@ def read_MERFISH(
125131
Parameters
126132
----------
127133
filepath : os.PathLike or str
128-
Path to the Vizgen transcripts file.
134+
Path to the Vizgen transcripts file. Both, .csv(.gz) and .parquet files, are supported.
129135
z_scale : float
130136
Factor to scale z-plane index to um, i.e. distance between z-planes.
131137
remove_genes : collections.abc.Collection[str], optional
132138
List of regex patterns to filter the 'gene' column,
133-
:py:attr:`ovrlpy.io.MERFISH_CTRLS` by default.
139+
:py:attr:`ovrlpy.io.MERSCOPE_CTRLS` by default.
140+
additional_columns : collections.abc.Collection[str], optional
141+
Additional columns to load from the transcripts file.
134142
n_threads : int | None, optional
135143
Number of threads used for parsing the input file.
136144
If None, will default to number of available CPUs.
@@ -139,18 +147,38 @@ def read_MERFISH(
139147
-------
140148
polars.DataFrame
141149
"""
150+
filepath = Path(filepath)
151+
columns = list(set(_MERSCOPE_COLUMNS.keys()) | set(additional_columns))
142152

143-
transcripts = pl.read_csv(
144-
Path(filepath),
145-
columns=list(_MERFISH_COLUMNS.keys()),
146-
schema_overrides={"gene": pl.Categorical},
147-
n_threads=n_threads,
148-
).rename(_MERFISH_COLUMNS)
153+
if filepath.suffixes[-2:] == [".csv", ".gz"]:
154+
transcripts = pl.read_csv(
155+
filepath,
156+
columns=columns,
157+
schema_overrides={"gene": pl.Categorical},
158+
n_threads=n_threads,
159+
)
149160

161+
else:
162+
if filepath.suffix == ".parquet":
163+
transcripts = pl.scan_parquet(filepath)
164+
elif filepath.suffix == ".csv":
165+
transcripts = pl.scan_csv(filepath)
166+
else:
167+
raise ValueError(
168+
"Unsupported file format; must be one of .csv(.gz) or .parquet"
169+
)
170+
171+
with pl.StringCache():
172+
transcripts = (
173+
transcripts.select(columns)
174+
.with_columns(pl.col("gene").cast(pl.String).cast(pl.Categorical))
175+
.collect()
176+
)
177+
178+
transcripts = transcripts.rename(_MERSCOPE_COLUMNS)
150179
transcripts = _filter_genes(transcripts, remove_genes)
151180

152181
# convert plane to um
153-
154182
transcripts = transcripts.with_columns(pl.col("z") * z_scale)
155183

156184
return transcripts
@@ -168,6 +196,7 @@ def read_CosMx(
168196
scale: Mapping[str, float] = {"xy": 0.12028, "z": 0.8},
169197
*,
170198
remove_targets: Collection[str] = COSMX_CTRLS,
199+
additional_columns: Collection[str] = [],
171200
n_threads: int | None = None,
172201
) -> pl.DataFrame:
173202
"""
@@ -182,6 +211,8 @@ def read_CosMx(
182211
remove_targets : collections.abc.Collection[str], optional
183212
List of regex patterns to filter the 'target' column,
184213
:py:attr:`ovrlpy.io.COSMX_CTRLS` by default.
214+
additional_columns : collections.abc.Collection[str], optional
215+
Additional columns to load from the transcripts file.
185216
n_threads : int | None, optional
186217
Number of threads used for parsing the input file.
187218
If None, will default to number of available CPUs.
@@ -193,7 +224,7 @@ def read_CosMx(
193224

194225
transcripts = pl.read_csv(
195226
Path(filepath),
196-
columns=list(_COSMX_COLUMNS.keys()),
227+
columns=list(set(_COSMX_COLUMNS.keys()) | set(additional_columns)),
197228
schema_overrides={"target": pl.Categorical},
198229
n_threads=n_threads,
199230
).rename(_COSMX_COLUMNS)

0 commit comments

Comments
 (0)