Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

In this notebook, we will:

  1. Take a folder of audio files (e.g., songs or stems).

  2. For each track, compute a compact acoustic embedding using:

    • Mel-spectrogram statistics (how energy is distributed across frequency)

    • MFCC statistics (a classic timbre representation)

    • Spectral features (centroid, bandwidth, rolloff, flatness, RMS, zero-crossing rate)

  3. Stack these embeddings into a feature matrix (one vector per track).

  4. Run two clustering methods:

    • k-means (you choose the number of clusters)

    • HDBSCAN (finds clusters and outliers automatically)

The acoustic embedding is a fixed-length numeric vector that summarizes the “sound fingerprint” of a track.
We will use these vectors as inputs for clustering (k-means & HDBSCAN).

Install dependencies

# --- Install dependencies ---
!pip install librosa hdbscan umap-learn --quiet

Imports and configuration

Step 1 — Imports and configuration

We will:

  • Import librosa for audio feature extraction.

  • Import NumPy / pandas for numerical work and tables.

  • Import scikit-learn for k-means and scaling.

  • Import HDBSCAN for density-based clustering.

  • Set the folder that contains our audio files.

In Colab, you can either:

  • Upload audio files directly, or

  • Mount Google Drive and point AUDIO_FOLDER to a folder in your Drive.

Imports and configuration

# --- Cell 2: Imports and configuration ---
from pathlib import Path
import warnings
import math

import numpy as np
import pandas as pd
import librosa

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import hdbscan

print("librosa version:", librosa.__version__)

# === USER: Set this to your audio folder ===
# Example if using Google Drive (after mounting):
# AUDIO_FOLDER = Path("/content/drive/MyDrive/LS100/audio_tracks")
AUDIO_FOLDER = Path("/Users/souvikmandal/Documents/06_Teaching_Mentoring/LS100_comp_etho/2025/media/audio/music_tracks/Audio_Clips")  # <-- CHANGE THIS

SAMPLE_RATE_TARGET = 22050   # resample target for feature extraction (Hz)
N_MELS = 64                  # number of mel bands
N_MFCC = 20                  # number of MFCC coefficients

print("Audio folder:", AUDIO_FOLDER)
librosa version: 0.11.0
Audio folder: /Users/souvikmandal/Documents/06_Teaching_Mentoring/LS100_comp_etho/2025/media/audio/music_tracks/Audio_Clips

Helper: list files and quick check

We’ll search the folder (and subfolders) for common audio extensions and make sure we have files to work with.

# --- Cell 3: Helper function: list audio files ---

def list_audio_files(folder: Path):
    exts = [".wav", ".mp3", ".flac", ".ogg", ".m4a"]
    files = []
    for ext in exts:
        files.extend(folder.glob(f"*{ext}"))
        files.extend(folder.glob(f"**/*{ext}"))  # include subfolders
    files = sorted(set(files))
    return files

if not AUDIO_FOLDER.exists():
    raise FileNotFoundError(f"AUDIO_FOLDER does not exist: {AUDIO_FOLDER}")

audio_files = list_audio_files(AUDIO_FOLDER)
print(f"Found {len(audio_files)} audio files.")
for p in audio_files[:5]:
    print("  ", p.name)
Found 13 audio files.
   Mediu Zhiga.wav
   Ra Bacheeza.wav
   [SP] Alfonso Ortiz Tirado - TE QUIERO DIJISTE.mp3
   [SP] Alvaro Carrillo - Pinotepa Nacional.mp3
   [SP] Lagrimas Negras.mp3

Acoustic Embedding: Idea

For each audio file, we will compute a fixed-length feature vector that summarizes its sound:

  1. Mel-spectrogram statistics

    • Compute a Mel-spectrogram (frequency vs time on a “human” scale).

    • Convert to dB.

    • Take mean and standard deviation across time for each Mel band.
      → This captures overall spectral shape / timbre.

  2. MFCC statistics

    • Compute MFCCs (a compact representation of timbre).

    • Take mean and standard deviation over time for each MFCC coefficient.

  3. Spectral features (each with mean and std over time)

    • Spectral centroid (brightness)

    • Spectral bandwidth

    • Spectral rolloff (e.g., 85% energy)

    • Spectral flatness (tonal vs noise-like)

    • RMS energy (loudness)

    • Zero-crossing rate (noisiness)

All these are concatenated into a single embedding vector per track:

One row per track, one column per feature → ready for clustering.

Acoustic embedding function

# --- Cell 4: Acoustic embedding for one track ---

def extract_acoustic_embedding(
    path: Path,
    sr_target: int = SAMPLE_RATE_TARGET,
    n_mels: int = N_MELS,
    n_mfcc: int = N_MFCC,
) -> dict:
    """
    Load an audio file, compute mel + MFCC + spectral statistics,
    and return a dict with:
      - file_name, duration, sample_rate
      - embedding (1D numpy array)
      - feature_names (list of strings, same length as embedding)
    """
    # Load mono audio at target sample rate
    y, sr = librosa.load(path, sr=sr_target, mono=True)
    duration = len(y) / sr

    # Trim leading/trailing silence a bit (optional, helps avoid long tails of silence)
    y_trim, _ = librosa.effects.trim(y, top_db=30)
    if len(y_trim) < int(0.5 * sr):
        # too little audio after trimming → fall back to original
        y_trim = y

    # Mel-spectrogram
    S = librosa.feature.melspectrogram(
        y=y_trim,
        sr=sr,
        n_fft=2048,
        hop_length=512,
        n_mels=n_mels,
        power=2.0,
    )
    S_db = librosa.power_to_db(S, ref=np.max)

    mel_mean = np.mean(S_db, axis=1)
    mel_std  = np.std(S_db, axis=1)

    # MFCC from the mel-spectrogram
    mfcc = librosa.feature.mfcc(S=S_db, sr=sr, n_mfcc=n_mfcc)
    mfcc_mean = np.mean(mfcc, axis=1)
    mfcc_std  = np.std(mfcc, axis=1)

    # Spectral features (computed on trimmed waveform)
    spec_cent = librosa.feature.spectral_centroid(y=y_trim, sr=sr)[0]
    spec_bw   = librosa.feature.spectral_bandwidth(y=y_trim, sr=sr)[0]
    spec_roll = librosa.feature.spectral_rolloff(y=y_trim, sr=sr, roll_percent=0.85)[0]
    spec_flat = librosa.feature.spectral_flatness(y=y_trim)[0]
    rms       = librosa.feature.rms(y=y_trim)[0]
    zcr       = librosa.feature.zero_crossing_rate(y_trim)[0]

    def stats(x):
        return np.array([np.mean(x), np.std(x)], dtype=float)

    spec_stats = np.concatenate([
        stats(spec_cent),
        stats(spec_bw),
        stats(spec_roll),
        stats(spec_flat),
        stats(rms),
        stats(zcr),
    ])

    # Build feature names
    feat_names = []
    # mel
    for i in range(n_mels):
        feat_names.append(f"mel_mean_{i}")
    for i in range(n_mels):
        feat_names.append(f"mel_std_{i}")
    # mfcc
    for i in range(n_mfcc):
        feat_names.append(f"mfcc_mean_{i}")
    for i in range(n_mfcc):
        feat_names.append(f"mfcc_std_{i}")
    # spectral
    spec_labels = [
        "spec_cent", "spec_bw", "spec_roll", "spec_flat",
        "rms", "zcr"
    ]
    for name in spec_labels:
        feat_names.append(f"{name}_mean")
        feat_names.append(f"{name}_std")

    embedding = np.concatenate([mel_mean, mel_std, mfcc_mean, mfcc_std, spec_stats])

    assert len(embedding) == len(feat_names), "Feature length mismatch"

    return {
        "file_name": path.name,
        "file_path": str(path),
        "duration_sec": float(duration),
        "sample_rate": int(sr),
        "embedding": embedding,
        "feature_names": feat_names,
    }

# Quick sanity check on one file (if available)
if audio_files:
    test_meta = extract_acoustic_embedding(audio_files[0])
    print("One embedding size:", len(test_meta["embedding"]))
    print("First 5 feature names:", test_meta["feature_names"][:5])
One embedding size: 180
First 5 feature names: ['mel_mean_0', 'mel_mean_1', 'mel_mean_2', 'mel_mean_3', 'mel_mean_4']

Compute embeddings for all tracks

Now we will loop over all audio files in the folder and:

  • Compute the acoustic embedding for each track.

  • Store embeddings and basic metadata in a pandas DataFrame.

  • This DataFrame will be our feature matrix for clustering.

Compute embeddings

# --- Cell 5: Compute embeddings for all tracks ---

all_rows = []
feature_names = None

for i, path in enumerate(audio_files):
    print(f"[{i+1}/{len(audio_files)}] {path.name}")
    try:
        meta = extract_acoustic_embedding(path)
        if feature_names is None:
            feature_names = meta["feature_names"]
        row = {
            "file_name": meta["file_name"],
            "file_path": meta["file_path"],
            "duration_sec": meta["duration_sec"],
            "sample_rate": meta["sample_rate"],
        }
        # add embedding dimensions
        for fname, val in zip(feature_names, meta["embedding"]):
            row[fname] = float(val)
        all_rows.append(row)
    except Exception as e:
        print(f"  ⚠️ Error on {path.name}: {e}")

emb_df = pd.DataFrame(all_rows)
print("\nEmbedding DataFrame shape:", emb_df.shape)
emb_df.head()
[1/13] Mediu Zhiga.wav
[2/13] Ra Bacheeza.wav
[3/13] [SP] Alfonso Ortiz Tirado - TE QUIERO DIJISTE.mp3
[4/13] [SP] Alvaro Carrillo - Pinotepa Nacional.mp3
[5/13] [SP] Lagrimas Negras.mp3
[6/13] [SP] Los Panchos - Contigo.mp3
[7/13] [SP] Los Panchos - Jamas Jamas Jamas.mp3
[8/13] [SP] Los Panchos - Te Quiero Dijiste.mp3
[9/13] [SP] Soledad y el Mar - Natalia Lafourcade.mp3
[10/13] [ZAP] Binni Gula_za - Ni_bixi Dxi Zina.mp3
[11/13] [ZAP] Mediu Zhiga.mp3
[12/13] [ZAP] Ra Bacheeza.mp3
[13/13] [ZAP] Sabor a Mi - Trio Galenos Y Mario Carrillo.mp3

Embedding DataFrame shape: (13, 184)
Loading...

Prepare feature matrix for clustering

To cluster tracks, we will:

  1. Extract only the numeric feature columns (embedding dimensions).

  2. Standardize them using z-score scaling (mean 0, std 1) so that:

    • Mel bands, MFCCs, and spectral features are comparable in scale.

  3. Keep file_name and duration_sec for interpreting results later.

Build X and scale

# --- Cell 6: Build feature matrix X and standardize ---

if feature_names is None:
    raise RuntimeError("No embeddings were computed. Check earlier cells.")

# X will contain only the embedding dimensions
X = emb_df[feature_names].values

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Feature matrix shape (num_tracks, num_features):", X_scaled.shape)
Feature matrix shape (num_tracks, num_features): (13, 180)

k-means clustering

Step 6 — k-means clustering on acoustic embeddings

We first use k-means:

  • You choose the number of clusters k.

  • The algorithm pulls tracks into k groups based on their acoustic embeddings.

  • Each track gets a cluster ID: 0, 1, 2, …, k−1.

k-means assumes clusters are roughly spherical and of similar size.
It is simple and fast, but sometimes misses irregular or uneven clusters.

# --- Cell 7: k-means clustering ---

# === USER: choose the number of clusters ===
K = 2  # try 3, 4, 5, ... and compare

kmeans = KMeans(n_clusters=K, random_state=42, n_init=10)
kmeans_labels = kmeans.fit_predict(X_scaled)

emb_df["cluster_kmeans"] = kmeans_labels

print("k-means cluster counts:")
print(emb_df["cluster_kmeans"].value_counts().sort_index())

emb_df[["file_name", "cluster_kmeans"]].head(20)
k-means cluster counts:
cluster_kmeans
0    9
1    4
Name: count, dtype: int64
Loading...

HDBSCAN clustering

Next, we use HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise):

  • It can find clusters of different densities and shapes.

  • It does not require choosing k ahead of time.

  • It can label some tracks as noise/outliers with label -1.

We will:

  • Run HDBSCAN on the same standardized feature matrix.

  • Get a cluster label per track (cluster_hdbscan).

# --- Cell 8: HDBSCAN clustering ---

# === USER: tweak these if needed ===
MIN_CLUSTER_SIZE = 4   # minimum number of tracks to form a cluster
MIN_SAMPLES      = 2  # None → defaults to MIN_CLUSTER_SIZE; or set an int

hdbscan_clusterer = hdbscan.HDBSCAN(
    min_cluster_size=MIN_CLUSTER_SIZE,
    min_samples=MIN_SAMPLES,
    metric="euclidean",
    cluster_selection_method="eom"
)

hdbscan_labels = hdbscan_clusterer.fit_predict(X_scaled)
emb_df["cluster_hdbscan"] = hdbscan_labels

print("HDBSCAN cluster counts (including -1 = noise):")
print(emb_df["cluster_hdbscan"].value_counts().sort_index())

emb_df[["file_name", "cluster_hdbscan"]].head(10)
HDBSCAN cluster counts (including -1 = noise):
cluster_hdbscan
-1    13
Name: count, dtype: int64
/Users/souvikmandal/AudioVenv/lib/python3.11/site-packages/sklearn/utils/deprecation.py:132: FutureWarning: 'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.
  warnings.warn(
/Users/souvikmandal/AudioVenv/lib/python3.11/site-packages/sklearn/utils/deprecation.py:132: FutureWarning: 'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.
  warnings.warn(
Loading...

Save embeddings and cluster labels

Finally, we will save:

  • A CSV file with:

    • file_name, duration, embedding features, k-means cluster, HDBSCAN cluster.

  • (Optional) You can also save as JSON or pickle if you want to reload easily.

This CSV can then be used in a separate notebook for:

  • visualizations (e.g., 2D scatter plots using PCA/UMAP),

  • checking which songs fall into which cluster,

  • building recommendation or similarity tools.

# --- Cell 9: Save embeddings + cluster labels to CSV ---

OUTPUT_CSV = AUDIO_FOLDER / "acoustic_embeddings_with_clusters.csv"
emb_df.to_csv(OUTPUT_CSV, index=False)

print("Saved embeddings + clusters to:")
print(OUTPUT_CSV)
Saved embeddings + clusters to:
/Users/souvikmandal/Documents/06_Teaching_Mentoring/LS100_comp_etho/2025/media/audio/music_tracks/Audio_Clips/acoustic_embeddings_with_clusters.csv

Visualizing Clusters in 2D with PCA + Plotly - Setup

Our acoustic embeddings live in a high-dimensional space (hundreds of features per track).
To visualize them, we’ll compress them down to 2 dimensions using PCA (Principal Component Analysis):

  • PCA finds directions (components) that capture the most variance in the data.

  • We’ll project each track’s embedding to (PC1, PC2) and make scatter plots.

We’ll then color points by:

  • k-means cluster ID

  • HDBSCAN cluster ID (with -1 = noise/outliers)

This will let us see how clusters are arranged in the acoustic space.

# --- PCA projection to 2D and Plotly setup ---

from sklearn.decomposition import PCA
import plotly.express as px

# Compute 2D PCA embedding from X_scaled
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)

emb_df["pca_x"] = X_pca[:, 0]
emb_df["pca_y"] = X_pca[:, 1]

print("Explained variance by PC1 and PC2:",
      pca.explained_variance_ratio_[0],
      pca.explained_variance_ratio_[1])
Explained variance by PC1 and PC2: 0.2872734305504054 0.23867184477917308

PCA scatter plot colored by k-means clusters

Each point is one track:

  • Position = projection of its acoustic embedding to the first two principal components (PC1, PC2).

  • Color = k-means cluster ID.

  • Hover text = file name.

This gives a geometric picture of how k-means partitioned the acoustic space.

Below are the color options for the dots in the plot.

px.colors.qualitative.Vivid
px.colors.qualitative.Dark24
px.colors.qualitative.Set1   # very bold, good for small number of clusters
px.colors.qualitative.Set3   # pastel but distinct
px.colors.qualitative.Alphabet  # huge palette

Just replace

color_discrete_sequence=px.colors.qualitative.Bold
# --- k-means PCA scatter plot ---

# Ensure cluster labels are treated as categorical, not numeric
emb_df["cluster_kmeans_str"] = emb_df["cluster_kmeans"].astype(str)

fig_k = px.scatter(
    emb_df,
    x="pca_x",
    y="pca_y",
    color="cluster_kmeans_str",             # use string labels → categorical colors
    color_discrete_sequence=px.colors.qualitative.Bold,  # <-- HIGH CONTRAST
    hover_name="file_name",
    hover_data={
        "file_path": False,
        "duration_sec": True,
        "cluster_kmeans_str": True,
    },
    title="PCA Projection of Acoustic Embeddings — Colored by k-means Cluster",
    labels={"pca_x": "PC1", "pca_y": "PC2", "cluster_kmeans_str": "k-means cluster"},
)

fig_k.update_layout(
    legend_title_text="k-means cluster",
    width=800,
    height=500,
    plot_bgcolor="#F0F2F5",     # subtle grey for better contrast
)

fig_k.show()

Loading...

Visualize HDBSCAN clusters

PCA scatter plot colored by HDBSCAN clusters

Now we color the same PCA projection by HDBSCAN cluster labels:

  • Each color = HDBSCAN cluster ID.

  • Label -1 means “noise” or “unclustered / outlier” points.

This can look quite different from k-means:

  • k-means forces every track into some cluster.

  • HDBSCAN is allowed to say, “these tracks don’t belong to any dense group.”

# --- Improved HDBSCAN PCA scatter plot with discrete colors ---

# Convert cluster labels to string categories
emb_df["cluster_hdbscan_str"] = emb_df["cluster_hdbscan"].astype(str)
emb_df.loc[emb_df["cluster_hdbscan"] == -1, "cluster_hdbscan_str"] = "noise (-1)"

# Prepare a high-contrast color palette
palette = px.colors.qualitative.Dark24.copy()

# Ensure noise has a consistent neutral color
noise_color = "#7f7f7f"  # medium grey

unique_clusters = emb_df["cluster_hdbscan_str"].unique().tolist()

# Assign colors: noise gets grey, others use Dark24 cycling
color_map = {}
palette_i = 0
for c in unique_clusters:
    if c == "noise (-1)":
        color_map[c] = noise_color
    else:
        color_map[c] = palette[palette_i % len(palette)]
        palette_i += 1

# Build the figure
fig_h = px.scatter(
    emb_df,
    x="pca_x",
    y="pca_y",
    color="cluster_hdbscan_str",
    color_discrete_map=color_map,             # <-- Force our custom mapping
    hover_name="file_name",
    hover_data={
        "file_path": False,
        "duration_sec": True,
        "cluster_hdbscan": True,
    },
    title="PCA Projection of Acoustic Embeddings — Colored by HDBSCAN Cluster",
    labels={"pca_x": "PC1", "pca_y": "PC2", "cluster_hdbscan_str": "HDBSCAN cluster"},
)

fig_h.update_layout(
    legend_title_text="HDBSCAN cluster",
    width=800,
    height=500,
    plot_bgcolor="#F0F2F5",   # subtle grey to help contrast
)

fig_h.show()
Loading...

UMAP: Nonlinear 2D Embedding of Acoustic Space

PCA gave us a linear 2D summary of the embeddings.
Now we will use UMAP (Uniform Manifold Approximation and Projection):

  • UMAP is a nonlinear dimensionality reduction method.

  • It tries to preserve local neighborhoods: points that are close in high dimension tend to stay close in the 2D map.

  • This often reveals curved or irregular cluster structure that PCA misses.

We will:

  1. Compute a 2D UMAP projection of our standardized feature matrix X_scaled.

  2. Make two scatter plots (Plotly):

    • one colored by k-means cluster

    • one colored by HDBSCAN cluster

# --- Cell: Compute UMAP 2D projection ---

import umap

umap_model = umap.UMAP(
    n_neighbors=10,      # how many neighbors define "local"
    min_dist=0.1,        # how compact clusters are
    metric="euclidean",
    random_state=42,
)

X_umap = umap_model.fit_transform(X_scaled)

emb_df["umap_x"] = X_umap[:, 0]
emb_df["umap_y"] = X_umap[:, 1]

print("UMAP embedding shape:", X_umap.shape)
/Users/souvikmandal/AudioVenv/lib/python3.11/site-packages/umap/umap_.py:1952: UserWarning:

n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.

UMAP embedding shape: (13, 2)

UMAP + k-means visualization

# --- UMAP scatter plot for k-means clusters ---

# Ensure categorical labels
emb_df["cluster_kmeans_str"] = emb_df["cluster_kmeans"].astype(str)

fig_umap_k = px.scatter(
    emb_df,
    x="umap_x",
    y="umap_y",
    color="cluster_kmeans_str",
    color_discrete_sequence=px.colors.qualitative.Dark24,
    hover_name="file_name",
    hover_data={
        "file_path": False,
        "duration_sec": True,
        "cluster_kmeans_str": True,
    },
    title="UMAP of Acoustic Embeddings — Colored by k-means Cluster",
    labels={"umap_x": "UMAP-1", "umap_y": "UMAP-2", "cluster_kmeans_str": "k-means cluster"},
)

fig_umap_k.update_layout(
    legend_title_text="k-means cluster",
    width=800,
    height=500,
    plot_bgcolor="#F0F2F5",
)

fig_umap_k.show()
Loading...

UMAP + HDBSCAN visualization

# --- UMAP scatter plot for HDBSCAN clusters ---

# Make sure we have the string labels with "noise (-1)"
emb_df["cluster_hdbscan_str"] = emb_df["cluster_hdbscan"].astype(str)
emb_df.loc[emb_df["cluster_hdbscan"] == -1, "cluster_hdbscan_str"] = "noise (-1)"

palette = px.colors.qualitative.Dark24.copy()
noise_color = "#7f7f7f"

unique_clusters = emb_df["cluster_hdbscan_str"].unique().tolist()
color_map = {}
palette_i = 0
for c in unique_clusters:
    if c == "noise (-1)":
        color_map[c] = noise_color
    else:
        color_map[c] = palette[palette_i % len(palette)]
        palette_i += 1

fig_umap_h = px.scatter(
    emb_df,
    x="umap_x",
    y="umap_y",
    color="cluster_hdbscan_str",
    color_discrete_map=color_map,
    hover_name="file_name",
    hover_data={
        "file_path": False,
        "duration_sec": True,
        "cluster_hdbscan": True,
    },
    title="UMAP of Acoustic Embeddings — Colored by HDBSCAN Cluster",
    labels={"umap_x": "UMAP-1", "umap_y": "UMAP-2", "cluster_hdbscan_str": "HDBSCAN cluster"},
)

fig_umap_h.update_layout(
    legend_title_text="HDBSCAN cluster",
    width=800,
    height=500,
    plot_bgcolor="#F0F2F5",
)

fig_umap_h.show()
Loading...

Cluster Summaries

To interpret the clusters, we will compute simple summary statistics:

  • Number of tracks in each cluster

  • Average, minimum, and maximum track duration

We will do this for:

  • k-means clusters

  • HDBSCAN clusters (ignoring the noise cluster -1 for summaries)

Summaries for k-means

# --- Cluster summaries: k-means (all numeric metadata) ---

# Columns to exclude from "metadata" summarization
exclude_cols = set(feature_names) | {
    "cluster_kmeans",
    "cluster_kmeans_str",
    "cluster_hdbscan",
    "cluster_hdbscan_str",
    "pca_x", "pca_y",
    "umap_x", "umap_y",
}

# Metadata candidates = all other columns
metadata_cols = [c for c in emb_df.columns if c not in exclude_cols]

# Among those, pick only numeric columns for aggregation
numeric_meta_cols = emb_df[metadata_cols].select_dtypes(include="number").columns.tolist()

print("Numeric metadata columns being summarized (k-means):")
print(numeric_meta_cols)

# Build aggregation dict: for each numeric metadata column, compute mean/min/max
agg_dict = {"file_name": ("file_name", "count")}  # count = number of tracks
for col in numeric_meta_cols:
    agg_dict[f"{col}_mean"] = (col, "mean")
    agg_dict[f"{col}_min"]  = (col, "min")
    agg_dict[f"{col}_max"]  = (col, "max")

summary_k = (
    emb_df
    .groupby("cluster_kmeans")
    .agg(**agg_dict)
    .rename(columns={"file_name": "num_tracks"})
    .reset_index()
    .sort_values("cluster_kmeans")
)

summary_k

Numeric metadata columns being summarized (k-means):
['duration_sec', 'sample_rate']
Loading...

summaries for HDBSCAN (ignoring noise)

# --- Cluster summaries: HDBSCAN (all numeric metadata, non-noise only) ---

mask_non_noise = emb_df["cluster_hdbscan"] != -1
df_h = emb_df[mask_non_noise].copy()

if df_h.empty:
    print("No non-noise HDBSCAN clusters to summarize.")
else:
    # Reuse the same logic as before, but on df_h
    exclude_cols_h = set(feature_names) | {
        "cluster_kmeans",
        "cluster_kmeans_str",
        "cluster_hdbscan",
        "cluster_hdbscan_str",
        "pca_x", "pca_y",
        "umap_x", "umap_y",
    }

    metadata_cols_h = [c for c in df_h.columns if c not in exclude_cols_h]
    numeric_meta_cols_h = df_h[metadata_cols_h].select_dtypes(include="number").columns.tolist()

    print("Numeric metadata columns being summarized (HDBSCAN, non-noise):")
    print(numeric_meta_cols_h)

    agg_dict_h = {"file_name": ("file_name", "count")}
    for col in numeric_meta_cols_h:
        agg_dict_h[f"{col}_mean"] = (col, "mean")
        agg_dict_h[f"{col}_min"]  = (col, "min")
        agg_dict_h[f"{col}_max"]  = (col, "max")

    summary_h = (
        df_h
        .groupby("cluster_hdbscan")
        .agg(**agg_dict_h)
        .rename(columns={"file_name": "num_tracks"})
        .reset_index()
        .sort_values("cluster_hdbscan")
    )

    summary_h

No non-noise HDBSCAN clusters to summarize.