# 5. Step D — Manual granule subtyping only (`benchmark_subtyping.ipynb`)

[← Tutorial index](./README.md)

Automated rule-based subtyping (`classify_granules`) is **not** covered here; the benchmark notebook’s primary annotation path is **manual**.

### Step D1 — Normalize for clustering

After **`profile`**, store raw counts in **`layers["counts"]`**, then apply **`sc.pp.normalize_total`** and **`sc.pp.log1p`** on **`X`** (or follow your notebook’s exact normalization for consistency with saved `h5ad` files).

### Step D2 — Choose k and run MiniBatch k-means

Fix **`n_clusters`** (e.g. 15 in the benchmark), **`seed`**, **`batch_size`**, **`n_init`**. Fit on the matrix used for clustering (dense `X` if sparse).

```python
import numpy as np
import pandas as pd
from sklearn.cluster import MiniBatchKMeans

def run_manual_subtyping(granule_adata, n_clusters, seed, batch_size=5000, n_init=20, obs_key="granule_subtype_kmeans"):
    data = granule_adata.X.copy()
    if hasattr(data, "toarray"):
        data = data.toarray()
    np.random.seed(seed)
    kmeans = MiniBatchKMeans(
        n_clusters=n_clusters,
        batch_size=batch_size,
        random_state=seed,
        n_init=n_init,
    )
    kmeans.fit(data)
    granule_adata.obs[obs_key] = kmeans.labels_.astype(str)
    granule_adata.obs[obs_key] = pd.Categorical(
        granule_adata.obs[obs_key],
        categories=[str(i) for i in range(n_clusters)],
        ordered=True,
    )
    return granule_adata
```

**Output:** **`obs[obs_key]`** — string cluster ids `"0"`, `"1"`, ….

### Step D3 — Heatmap-driven biology

1. Pick a **reference gene list** (e.g. synaptic markers overlapping `var_names`).
2. Plot **`scanpy.pl.heatmap`** with **`groupby=obs_key`**, **`standard_scale="var"`**, to see which clusters look pre-synaptic, post-synaptic, dendritic, mixed, etc.

### Step D4 — Manual mapping dictionary

Build a **`mapping`** from biological subtype names to **lists of cluster id strings**:

```python
def apply_manual_annotation(granule_adata, mapping, cluster_column="granule_subtype_kmeans"):
    k2sub = {}
    for subtype, clusters in mapping.items():
        for c in clusters:
            k2sub[c] = subtype
    granule_adata.obs["granule_subtype_manual"] = (
        granule_adata.obs[cluster_column].astype(str).map(k2sub)
    )
    granule_adata.obs["granule_subtype_manual_simple"] = granule_adata.obs["granule_subtype_manual"].apply(
        lambda s: "mixed" if pd.notna(s) and " & " in str(s) else str(s)
    )
    return granule_adata
```

**Convention:** finer labels live in **`granule_subtype_manual`** (e.g. `"pre & post"`); **`granule_subtype_manual_simple`** collapses any label containing **`" & "`** to **`"mixed"`** for density and summaries.

### Step D5 — Paired WT + AD objects (if applicable)

For cross-sample workflows, concatenate WT and AD **`granule_adata`** objects, restrict to common genes, normalize, run k-means once on the combined matrix, then annotate with a single **`MANUAL_SUBTYPE_MAPPING`** keyed by filename or setting. The benchmark notebook uses **`obs["sample"]`** in **`("WT", "AD")`** or batch labels.

**Next:** [Step E — WT vs AD density](./06_density_wt_ad.md)