5. Step D — Manual granule subtyping only (benchmark_subtyping.ipynb)

← Tutorial index

Automated rule-based subtyping (classify_granules) is not covered here; the benchmark notebook’s primary annotation path is manual.

Step D1 — Normalize for clustering

After profile, store raw counts in layers["counts"], then apply sc.pp.normalize_total and sc.pp.log1p on X (or follow your notebook’s exact normalization for consistency with saved h5ad files).

Step D2 — Choose k and run MiniBatch k-means

Fix n_clusters (e.g. 15 in the benchmark), seed, batch_size, n_init. Fit on the matrix used for clustering (dense X if sparse).

import numpy as np
import pandas as pd
from sklearn.cluster import MiniBatchKMeans

def run_manual_subtyping(granule_adata, n_clusters, seed, batch_size=5000, n_init=20, obs_key="granule_subtype_kmeans"):
    data = granule_adata.X.copy()
    if hasattr(data, "toarray"):
        data = data.toarray()
    np.random.seed(seed)
    kmeans = MiniBatchKMeans(
        n_clusters=n_clusters,
        batch_size=batch_size,
        random_state=seed,
        n_init=n_init,
    )
    kmeans.fit(data)
    granule_adata.obs[obs_key] = kmeans.labels_.astype(str)
    granule_adata.obs[obs_key] = pd.Categorical(
        granule_adata.obs[obs_key],
        categories=[str(i) for i in range(n_clusters)],
        ordered=True,
    )
    return granule_adata

Output: obs[obs_key] — string cluster ids "0", "1", ….

Step D3 — Heatmap-driven biology

  1. Pick a reference gene list (e.g. synaptic markers overlapping var_names).

  2. Plot scanpy.pl.heatmap with groupby=obs_key, standard_scale="var", to see which clusters look pre-synaptic, post-synaptic, dendritic, mixed, etc.

Step D4 — Manual mapping dictionary

Build a mapping from biological subtype names to lists of cluster id strings:

def apply_manual_annotation(granule_adata, mapping, cluster_column="granule_subtype_kmeans"):
    k2sub = {}
    for subtype, clusters in mapping.items():
        for c in clusters:
            k2sub[c] = subtype
    granule_adata.obs["granule_subtype_manual"] = (
        granule_adata.obs[cluster_column].astype(str).map(k2sub)
    )
    granule_adata.obs["granule_subtype_manual_simple"] = granule_adata.obs["granule_subtype_manual"].apply(
        lambda s: "mixed" if pd.notna(s) and " & " in str(s) else str(s)
    )
    return granule_adata

Convention: finer labels live in granule_subtype_manual (e.g. "pre & post"); granule_subtype_manual_simple collapses any label containing " & " to "mixed" for density and summaries.

Step D5 — Paired WT + AD objects (if applicable)

For cross-sample workflows, concatenate WT and AD granule_adata objects, restrict to common genes, normalize, run k-means once on the combined matrix, then annotate with a single MANUAL_SUBTYPE_MAPPING keyed by filename or setting. The benchmark notebook uses obs["sample"] in ("WT", "AD") or batch labels.

Next: Step E — WT vs AD density