Skip to content

Commit

Permalink
Merge pull request #77 from daniel-unyi-42/main
Browse files Browse the repository at this point in the history
Update tutorial notebook
  • Loading branch information
daniel-unyi-42 authored Jan 8, 2025
2 parents ed570a5 + 7e6b4d8 commit e7b89a4
Show file tree
Hide file tree
Showing 9 changed files with 1,035 additions and 1,084 deletions.
6 changes: 1 addition & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,8 @@

[![pre-commit.ci status](https://results.pre-commit.ci/badge/github/EliHei2/segger_dev/main.svg)](https://results.pre-commit.ci/latest/github/EliHei2/segger_dev/main)



**Important note (Dec 2024)**: As segger is currently undergoing constant development we highly recommending installing segger directly via github.


**segger** is a cutting-edge tool for **cell segmentation** in **single-molecule spatial omics** datasets. By leveraging **graph neural networks (GNNs)** and heterogeneous graphs, segger offers unmatched accuracy and scalability.

# How segger Works
Expand Down Expand Up @@ -52,7 +49,7 @@ segger tackles these with a **graph-based approach**, achieving superior segment

---

## Installation
## Installation

**Important note (Dec 2024)**: As segger is currently undergoing constant development we highly recommending installing segger directly via github.

Expand All @@ -78,7 +75,6 @@ pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -

Afterwards choose the installation method that best suits your needs.


### GitHub Installation

For a straightforward local installation from GitHub, clone the repository and install the package using `pip`:
Expand Down
1,848 changes: 914 additions & 934 deletions docs/notebooks/segger_tutorial.ipynb

Large diffs are not rendered by default.

74 changes: 32 additions & 42 deletions scripts/create_data_fast_sample.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,87 +7,77 @@
import numpy as np
from segger.data.parquet._utils import get_polygons_from_xy

xenium_data_dir = Path('data_raw/breast_cancer/Xenium_FFPE_Human_Breast_Cancer_Rep1/outs/')
segger_data_dir = Path('data_tidy/pyg_datasets/bc_rep1_emb')
xenium_data_dir = Path("data_raw/breast_cancer/Xenium_FFPE_Human_Breast_Cancer_Rep1/outs/")
segger_data_dir = Path("data_tidy/pyg_datasets/bc_rep1_emb")


scrnaseq_file = Path('/omics/groups/OE0606/internal/tangy/tasks/schier/data/atals_filtered.h5ad')
celltype_column = 'celltype_major'
gene_celltype_abundance_embedding = calculate_gene_celltype_abundance_embedding(
sc.read(scrnaseq_file),
celltype_column
)
scrnaseq_file = Path("/omics/groups/OE0606/internal/tangy/tasks/schier/data/atals_filtered.h5ad")
celltype_column = "celltype_major"
gene_celltype_abundance_embedding = calculate_gene_celltype_abundance_embedding(sc.read(scrnaseq_file), celltype_column)

sample = STSampleParquet(
base_dir=xenium_data_dir,
n_workers=4,
sample_type='xenium',
weights=gene_celltype_abundance_embedding, # uncomment if gene-celltype embeddings are available
sample_type="xenium",
weights=gene_celltype_abundance_embedding, # uncomment if gene-celltype embeddings are available
)

transcripts = pd.read_parquet(
xenium_data_dir / 'transcripts.parquet',
filters=[[('overlaps_nucleus', '=', 1)]]
)
boundaries = pd.read_parquet(xenium_data_dir / 'nucleus_boundaries.parquet')
transcripts = pd.read_parquet(xenium_data_dir / "transcripts.parquet", filters=[[("overlaps_nucleus", "=", 1)]])
boundaries = pd.read_parquet(xenium_data_dir / "nucleus_boundaries.parquet")

sizes = transcripts.groupby('cell_id').size()
polygons = get_polygons_from_xy(boundaries, 'vertex_x', 'vertex_y', 'cell_id')
sizes = transcripts.groupby("cell_id").size()
polygons = get_polygons_from_xy(boundaries, "vertex_x", "vertex_y", "cell_id")
densities = polygons[sizes.index].area / sizes
bd_width = polygons.minimum_bounding_radius().median() * 2

# 1/4 median boundary diameter
dist_tx = bd_width / 4
# 90th percentile density of bounding circle with radius=dist_tx
k_tx = math.ceil(np.quantile(dist_tx ** 2 * np.pi * densities, 0.9))
k_tx = math.ceil(np.quantile(dist_tx**2 * np.pi * densities, 0.9))

print(k_tx)
print(dist_tx)


sample.save(
data_dir=segger_data_dir,
k_bd=3,
dist_bd=15.0,
k_tx=dist_tx,
dist_tx=k_tx,
tile_width=120,
tile_height=120,
neg_sampling_ratio=5.0,
frac=1.0,
val_prob=0.1,
test_prob=0.1,
data_dir=segger_data_dir,
k_bd=3,
dist_bd=15.0,
k_tx=dist_tx,
dist_tx=k_tx,
tile_width=120,
tile_height=120,
neg_sampling_ratio=5.0,
frac=1.0,
val_prob=0.1,
test_prob=0.1,
)


xenium_data_dir = Path('data_tidy/bc_5k')
segger_data_dir = Path('data_tidy/pyg_datasets/bc_5k_emb')

xenium_data_dir = Path("data_tidy/bc_5k")
segger_data_dir = Path("data_tidy/pyg_datasets/bc_5k_emb")


sample = STSampleParquet(
base_dir=xenium_data_dir,
n_workers=1,
sample_type='xenium',
weights=gene_celltype_abundance_embedding, # uncomment if gene-celltype embeddings are available
sample_type="xenium",
weights=gene_celltype_abundance_embedding, # uncomment if gene-celltype embeddings are available
)


transcripts = pd.read_parquet(
xenium_data_dir / 'transcripts.parquet',
filters=[[('overlaps_nucleus', '=', 1)]]
)
boundaries = pd.read_parquet(xenium_data_dir / 'nucleus_boundaries.parquet')
transcripts = pd.read_parquet(xenium_data_dir / "transcripts.parquet", filters=[[("overlaps_nucleus", "=", 1)]])
boundaries = pd.read_parquet(xenium_data_dir / "nucleus_boundaries.parquet")

sizes = transcripts.groupby('cell_id').size()
polygons = get_polygons_from_xy(boundaries, 'vertex_x', 'vertex_y', 'cell_id')
sizes = transcripts.groupby("cell_id").size()
polygons = get_polygons_from_xy(boundaries, "vertex_x", "vertex_y", "cell_id")
densities = polygons[sizes.index].area / sizes
bd_width = polygons.minimum_bounding_radius().median() * 2

# 1/4 median boundary diameter
dist_tx = bd_width / 4
# 90th percentile density of bounding circle with radius=dist_tx
k_tx = math.ceil(np.quantile(dist_tx ** 2 * np.pi * densities, 0.9))
k_tx = math.ceil(np.quantile(dist_tx**2 * np.pi * densities, 0.9))

print(k_tx)
print(dist_tx)
6 changes: 2 additions & 4 deletions scripts/predict_model_sample.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,11 @@
import dask.dataframe as dd



seg_tag = "bc_fast_data_emb_major"
model_version = 1

segger_data_dir = Path('data_tidy/pyg_datasets') / seg_tag
models_dir = Path("./models") / seg_tag
segger_data_dir = Path("data_tidy/pyg_datasets") / seg_tag
models_dir = Path("./models") / seg_tag
benchmarks_dir = Path("/dkfz/cluster/gpu/data/OE0606/elihei/segger_experiments/data_tidy/benchmarks/xe_rep1_bc")
transcripts_file = "data_raw/xenium/Xenium_FFPE_Human_Breast_Cancer_Rep1/transcripts.parquet"
# Initialize the Lightning data module
Expand Down Expand Up @@ -58,4 +57,3 @@
gpu_ids=["0"],
# client=client
)

Loading

0 comments on commit e7b89a4

Please sign in to comment.