In this notebook, we will:
Take a folder of audio files (e.g., songs or stems).
For each track, compute a compact acoustic embedding using:
Mel-spectrogram statistics (how energy is distributed across frequency)
MFCC statistics (a classic timbre representation)
Spectral features (centroid, bandwidth, rolloff, flatness, RMS, zero-crossing rate)
Stack these embeddings into a feature matrix (one vector per track).
Run two clustering methods:
k-means (you choose the number of clusters)
HDBSCAN (finds clusters and outliers automatically)
The acoustic embedding is a fixed-length numeric vector that summarizes the “sound fingerprint” of a track.
We will use these vectors as inputs for clustering (k-means & HDBSCAN).
Install dependencies¶
# --- Install dependencies ---
!pip install librosa hdbscan umap-learn --quiet
Imports and configuration¶
Step 1 — Imports and configuration¶
We will:
Import librosa for audio feature extraction.
Import NumPy / pandas for numerical work and tables.
Import scikit-learn for k-means and scaling.
Import HDBSCAN for density-based clustering.
Set the folder that contains our audio files.
In Colab, you can either:
Upload audio files directly, or
Mount Google Drive and point
AUDIO_FOLDERto a folder in your Drive.
Imports and configuration¶
# --- Cell 2: Imports and configuration ---
from pathlib import Path
import warnings
import math
import numpy as np
import pandas as pd
import librosa
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import hdbscan
print("librosa version:", librosa.__version__)
# === USER: Set this to your audio folder ===
# Example if using Google Drive (after mounting):
# AUDIO_FOLDER = Path("/content/drive/MyDrive/LS100/audio_tracks")
AUDIO_FOLDER = Path("/Users/souvikmandal/Documents/06_Teaching_Mentoring/LS100_comp_etho/2025/media/audio/music_tracks/Audio_Clips") # <-- CHANGE THIS
SAMPLE_RATE_TARGET = 22050 # resample target for feature extraction (Hz)
N_MELS = 64 # number of mel bands
N_MFCC = 20 # number of MFCC coefficients
print("Audio folder:", AUDIO_FOLDER)
librosa version: 0.11.0
Audio folder: /Users/souvikmandal/Documents/06_Teaching_Mentoring/LS100_comp_etho/2025/media/audio/music_tracks/Audio_Clips
Helper: list files and quick check¶
We’ll search the folder (and subfolders) for common audio extensions and make sure we have files to work with.
# --- Cell 3: Helper function: list audio files ---
def list_audio_files(folder: Path):
exts = [".wav", ".mp3", ".flac", ".ogg", ".m4a"]
files = []
for ext in exts:
files.extend(folder.glob(f"*{ext}"))
files.extend(folder.glob(f"**/*{ext}")) # include subfolders
files = sorted(set(files))
return files
if not AUDIO_FOLDER.exists():
raise FileNotFoundError(f"AUDIO_FOLDER does not exist: {AUDIO_FOLDER}")
audio_files = list_audio_files(AUDIO_FOLDER)
print(f"Found {len(audio_files)} audio files.")
for p in audio_files[:5]:
print(" ", p.name)
Found 13 audio files.
Mediu Zhiga.wav
Ra Bacheeza.wav
[SP] Alfonso Ortiz Tirado - TE QUIERO DIJISTE.mp3
[SP] Alvaro Carrillo - Pinotepa Nacional.mp3
[SP] Lagrimas Negras.mp3
Acoustic Embedding: Idea¶
For each audio file, we will compute a fixed-length feature vector that summarizes its sound:
Mel-spectrogram statistics
Compute a Mel-spectrogram (frequency vs time on a “human” scale).
Convert to dB.
Take mean and standard deviation across time for each Mel band.
→ This captures overall spectral shape / timbre.
MFCC statistics
Compute MFCCs (a compact representation of timbre).
Take mean and standard deviation over time for each MFCC coefficient.
Spectral features (each with mean and std over time)
Spectral centroid (brightness)
Spectral bandwidth
Spectral rolloff (e.g., 85% energy)
Spectral flatness (tonal vs noise-like)
RMS energy (loudness)
Zero-crossing rate (noisiness)
All these are concatenated into a single embedding vector per track:
One row per track, one column per feature → ready for clustering.
Acoustic embedding function¶
# --- Cell 4: Acoustic embedding for one track ---
def extract_acoustic_embedding(
path: Path,
sr_target: int = SAMPLE_RATE_TARGET,
n_mels: int = N_MELS,
n_mfcc: int = N_MFCC,
) -> dict:
"""
Load an audio file, compute mel + MFCC + spectral statistics,
and return a dict with:
- file_name, duration, sample_rate
- embedding (1D numpy array)
- feature_names (list of strings, same length as embedding)
"""
# Load mono audio at target sample rate
y, sr = librosa.load(path, sr=sr_target, mono=True)
duration = len(y) / sr
# Trim leading/trailing silence a bit (optional, helps avoid long tails of silence)
y_trim, _ = librosa.effects.trim(y, top_db=30)
if len(y_trim) < int(0.5 * sr):
# too little audio after trimming → fall back to original
y_trim = y
# Mel-spectrogram
S = librosa.feature.melspectrogram(
y=y_trim,
sr=sr,
n_fft=2048,
hop_length=512,
n_mels=n_mels,
power=2.0,
)
S_db = librosa.power_to_db(S, ref=np.max)
mel_mean = np.mean(S_db, axis=1)
mel_std = np.std(S_db, axis=1)
# MFCC from the mel-spectrogram
mfcc = librosa.feature.mfcc(S=S_db, sr=sr, n_mfcc=n_mfcc)
mfcc_mean = np.mean(mfcc, axis=1)
mfcc_std = np.std(mfcc, axis=1)
# Spectral features (computed on trimmed waveform)
spec_cent = librosa.feature.spectral_centroid(y=y_trim, sr=sr)[0]
spec_bw = librosa.feature.spectral_bandwidth(y=y_trim, sr=sr)[0]
spec_roll = librosa.feature.spectral_rolloff(y=y_trim, sr=sr, roll_percent=0.85)[0]
spec_flat = librosa.feature.spectral_flatness(y=y_trim)[0]
rms = librosa.feature.rms(y=y_trim)[0]
zcr = librosa.feature.zero_crossing_rate(y_trim)[0]
def stats(x):
return np.array([np.mean(x), np.std(x)], dtype=float)
spec_stats = np.concatenate([
stats(spec_cent),
stats(spec_bw),
stats(spec_roll),
stats(spec_flat),
stats(rms),
stats(zcr),
])
# Build feature names
feat_names = []
# mel
for i in range(n_mels):
feat_names.append(f"mel_mean_{i}")
for i in range(n_mels):
feat_names.append(f"mel_std_{i}")
# mfcc
for i in range(n_mfcc):
feat_names.append(f"mfcc_mean_{i}")
for i in range(n_mfcc):
feat_names.append(f"mfcc_std_{i}")
# spectral
spec_labels = [
"spec_cent", "spec_bw", "spec_roll", "spec_flat",
"rms", "zcr"
]
for name in spec_labels:
feat_names.append(f"{name}_mean")
feat_names.append(f"{name}_std")
embedding = np.concatenate([mel_mean, mel_std, mfcc_mean, mfcc_std, spec_stats])
assert len(embedding) == len(feat_names), "Feature length mismatch"
return {
"file_name": path.name,
"file_path": str(path),
"duration_sec": float(duration),
"sample_rate": int(sr),
"embedding": embedding,
"feature_names": feat_names,
}
# Quick sanity check on one file (if available)
if audio_files:
test_meta = extract_acoustic_embedding(audio_files[0])
print("One embedding size:", len(test_meta["embedding"]))
print("First 5 feature names:", test_meta["feature_names"][:5])
One embedding size: 180
First 5 feature names: ['mel_mean_0', 'mel_mean_1', 'mel_mean_2', 'mel_mean_3', 'mel_mean_4']
Compute embeddings for all tracks¶
Now we will loop over all audio files in the folder and:
Compute the acoustic embedding for each track.
Store embeddings and basic metadata in a pandas DataFrame.
This DataFrame will be our feature matrix for clustering.
Compute embeddings¶
# --- Cell 5: Compute embeddings for all tracks ---
all_rows = []
feature_names = None
for i, path in enumerate(audio_files):
print(f"[{i+1}/{len(audio_files)}] {path.name}")
try:
meta = extract_acoustic_embedding(path)
if feature_names is None:
feature_names = meta["feature_names"]
row = {
"file_name": meta["file_name"],
"file_path": meta["file_path"],
"duration_sec": meta["duration_sec"],
"sample_rate": meta["sample_rate"],
}
# add embedding dimensions
for fname, val in zip(feature_names, meta["embedding"]):
row[fname] = float(val)
all_rows.append(row)
except Exception as e:
print(f" ⚠️ Error on {path.name}: {e}")
emb_df = pd.DataFrame(all_rows)
print("\nEmbedding DataFrame shape:", emb_df.shape)
emb_df.head()
[1/13] Mediu Zhiga.wav
[2/13] Ra Bacheeza.wav
[3/13] [SP] Alfonso Ortiz Tirado - TE QUIERO DIJISTE.mp3
[4/13] [SP] Alvaro Carrillo - Pinotepa Nacional.mp3
[5/13] [SP] Lagrimas Negras.mp3
[6/13] [SP] Los Panchos - Contigo.mp3
[7/13] [SP] Los Panchos - Jamas Jamas Jamas.mp3
[8/13] [SP] Los Panchos - Te Quiero Dijiste.mp3
[9/13] [SP] Soledad y el Mar - Natalia Lafourcade.mp3
[10/13] [ZAP] Binni Gula_za - Ni_bixi Dxi Zina.mp3
[11/13] [ZAP] Mediu Zhiga.mp3
[12/13] [ZAP] Ra Bacheeza.mp3
[13/13] [ZAP] Sabor a Mi - Trio Galenos Y Mario Carrillo.mp3
Embedding DataFrame shape: (13, 184)
Prepare feature matrix for clustering¶
To cluster tracks, we will:
Extract only the numeric feature columns (embedding dimensions).
Standardize them using z-score scaling (mean 0, std 1) so that:
Mel bands, MFCCs, and spectral features are comparable in scale.
Keep
file_nameandduration_secfor interpreting results later.
Build X and scale¶
# --- Cell 6: Build feature matrix X and standardize ---
if feature_names is None:
raise RuntimeError("No embeddings were computed. Check earlier cells.")
# X will contain only the embedding dimensions
X = emb_df[feature_names].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("Feature matrix shape (num_tracks, num_features):", X_scaled.shape)
Feature matrix shape (num_tracks, num_features): (13, 180)
k-means clustering¶
Step 6 — k-means clustering on acoustic embeddings¶
We first use k-means:
You choose the number of clusters
k.The algorithm pulls tracks into
kgroups based on their acoustic embeddings.Each track gets a cluster ID: 0, 1, 2, …, k−1.
k-means assumes clusters are roughly spherical and of similar size.
It is simple and fast, but sometimes misses irregular or uneven clusters.
# --- Cell 7: k-means clustering ---
# === USER: choose the number of clusters ===
K = 2 # try 3, 4, 5, ... and compare
kmeans = KMeans(n_clusters=K, random_state=42, n_init=10)
kmeans_labels = kmeans.fit_predict(X_scaled)
emb_df["cluster_kmeans"] = kmeans_labels
print("k-means cluster counts:")
print(emb_df["cluster_kmeans"].value_counts().sort_index())
emb_df[["file_name", "cluster_kmeans"]].head(20)
k-means cluster counts:
cluster_kmeans
0 9
1 4
Name: count, dtype: int64
HDBSCAN clustering¶
Next, we use HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise):
It can find clusters of different densities and shapes.
It does not require choosing k ahead of time.
It can label some tracks as noise/outliers with label
-1.
We will:
Run HDBSCAN on the same standardized feature matrix.
Get a cluster label per track (
cluster_hdbscan).
# --- Cell 8: HDBSCAN clustering ---
# === USER: tweak these if needed ===
MIN_CLUSTER_SIZE = 4 # minimum number of tracks to form a cluster
MIN_SAMPLES = 2 # None → defaults to MIN_CLUSTER_SIZE; or set an int
hdbscan_clusterer = hdbscan.HDBSCAN(
min_cluster_size=MIN_CLUSTER_SIZE,
min_samples=MIN_SAMPLES,
metric="euclidean",
cluster_selection_method="eom"
)
hdbscan_labels = hdbscan_clusterer.fit_predict(X_scaled)
emb_df["cluster_hdbscan"] = hdbscan_labels
print("HDBSCAN cluster counts (including -1 = noise):")
print(emb_df["cluster_hdbscan"].value_counts().sort_index())
emb_df[["file_name", "cluster_hdbscan"]].head(10)
HDBSCAN cluster counts (including -1 = noise):
cluster_hdbscan
-1 13
Name: count, dtype: int64
/Users/souvikmandal/AudioVenv/lib/python3.11/site-packages/sklearn/utils/deprecation.py:132: FutureWarning: 'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.
warnings.warn(
/Users/souvikmandal/AudioVenv/lib/python3.11/site-packages/sklearn/utils/deprecation.py:132: FutureWarning: 'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.
warnings.warn(
Save embeddings and cluster labels¶
Finally, we will save:
A CSV file with:
file_name, duration, embedding features, k-means cluster, HDBSCAN cluster.
(Optional) You can also save as JSON or pickle if you want to reload easily.
This CSV can then be used in a separate notebook for:
visualizations (e.g., 2D scatter plots using PCA/UMAP),
checking which songs fall into which cluster,
building recommendation or similarity tools.
# --- Cell 9: Save embeddings + cluster labels to CSV ---
OUTPUT_CSV = AUDIO_FOLDER / "acoustic_embeddings_with_clusters.csv"
emb_df.to_csv(OUTPUT_CSV, index=False)
print("Saved embeddings + clusters to:")
print(OUTPUT_CSV)
Saved embeddings + clusters to:
/Users/souvikmandal/Documents/06_Teaching_Mentoring/LS100_comp_etho/2025/media/audio/music_tracks/Audio_Clips/acoustic_embeddings_with_clusters.csv
Visualizing Clusters in 2D with PCA + Plotly - Setup¶
Our acoustic embeddings live in a high-dimensional space (hundreds of features per track).
To visualize them, we’ll compress them down to 2 dimensions using PCA (Principal Component Analysis):
PCA finds directions (components) that capture the most variance in the data.
We’ll project each track’s embedding to
(PC1, PC2)and make scatter plots.
We’ll then color points by:
k-means cluster ID
HDBSCAN cluster ID (with
-1= noise/outliers)
This will let us see how clusters are arranged in the acoustic space.
# --- PCA projection to 2D and Plotly setup ---
from sklearn.decomposition import PCA
import plotly.express as px
# Compute 2D PCA embedding from X_scaled
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)
emb_df["pca_x"] = X_pca[:, 0]
emb_df["pca_y"] = X_pca[:, 1]
print("Explained variance by PC1 and PC2:",
pca.explained_variance_ratio_[0],
pca.explained_variance_ratio_[1])
Explained variance by PC1 and PC2: 0.2872734305504054 0.23867184477917308
PCA scatter plot colored by k-means clusters¶
Each point is one track:
Position = projection of its acoustic embedding to the first two principal components (PC1, PC2).
Color = k-means cluster ID.
Hover text = file name.
This gives a geometric picture of how k-means partitioned the acoustic space.
Below are the color options for the dots in the plot.¶
px.colors.qualitative.Vivid
px.colors.qualitative.Dark24
px.colors.qualitative.Set1 # very bold, good for small number of clusters
px.colors.qualitative.Set3 # pastel but distinct
px.colors.qualitative.Alphabet # huge paletteJust replace
color_discrete_sequence=px.colors.qualitative.Bold# --- k-means PCA scatter plot ---
# Ensure cluster labels are treated as categorical, not numeric
emb_df["cluster_kmeans_str"] = emb_df["cluster_kmeans"].astype(str)
fig_k = px.scatter(
emb_df,
x="pca_x",
y="pca_y",
color="cluster_kmeans_str", # use string labels → categorical colors
color_discrete_sequence=px.colors.qualitative.Bold, # <-- HIGH CONTRAST
hover_name="file_name",
hover_data={
"file_path": False,
"duration_sec": True,
"cluster_kmeans_str": True,
},
title="PCA Projection of Acoustic Embeddings — Colored by k-means Cluster",
labels={"pca_x": "PC1", "pca_y": "PC2", "cluster_kmeans_str": "k-means cluster"},
)
fig_k.update_layout(
legend_title_text="k-means cluster",
width=800,
height=500,
plot_bgcolor="#F0F2F5", # subtle grey for better contrast
)
fig_k.show()
Visualize HDBSCAN clusters¶
PCA scatter plot colored by HDBSCAN clusters
Now we color the same PCA projection by HDBSCAN cluster labels:
Each color = HDBSCAN cluster ID.
Label
-1means “noise” or “unclustered / outlier” points.
This can look quite different from k-means:
k-means forces every track into some cluster.
HDBSCAN is allowed to say, “these tracks don’t belong to any dense group.”
# --- Improved HDBSCAN PCA scatter plot with discrete colors ---
# Convert cluster labels to string categories
emb_df["cluster_hdbscan_str"] = emb_df["cluster_hdbscan"].astype(str)
emb_df.loc[emb_df["cluster_hdbscan"] == -1, "cluster_hdbscan_str"] = "noise (-1)"
# Prepare a high-contrast color palette
palette = px.colors.qualitative.Dark24.copy()
# Ensure noise has a consistent neutral color
noise_color = "#7f7f7f" # medium grey
unique_clusters = emb_df["cluster_hdbscan_str"].unique().tolist()
# Assign colors: noise gets grey, others use Dark24 cycling
color_map = {}
palette_i = 0
for c in unique_clusters:
if c == "noise (-1)":
color_map[c] = noise_color
else:
color_map[c] = palette[palette_i % len(palette)]
palette_i += 1
# Build the figure
fig_h = px.scatter(
emb_df,
x="pca_x",
y="pca_y",
color="cluster_hdbscan_str",
color_discrete_map=color_map, # <-- Force our custom mapping
hover_name="file_name",
hover_data={
"file_path": False,
"duration_sec": True,
"cluster_hdbscan": True,
},
title="PCA Projection of Acoustic Embeddings — Colored by HDBSCAN Cluster",
labels={"pca_x": "PC1", "pca_y": "PC2", "cluster_hdbscan_str": "HDBSCAN cluster"},
)
fig_h.update_layout(
legend_title_text="HDBSCAN cluster",
width=800,
height=500,
plot_bgcolor="#F0F2F5", # subtle grey to help contrast
)
fig_h.show()
UMAP: Nonlinear 2D Embedding of Acoustic Space¶
PCA gave us a linear 2D summary of the embeddings.
Now we will use UMAP (Uniform Manifold Approximation and Projection):
UMAP is a nonlinear dimensionality reduction method.
It tries to preserve local neighborhoods: points that are close in high dimension tend to stay close in the 2D map.
This often reveals curved or irregular cluster structure that PCA misses.
We will:
Compute a 2D UMAP projection of our standardized feature matrix
X_scaled.Make two scatter plots (Plotly):
one colored by k-means cluster
one colored by HDBSCAN cluster
# --- Cell: Compute UMAP 2D projection ---
import umap
umap_model = umap.UMAP(
n_neighbors=10, # how many neighbors define "local"
min_dist=0.1, # how compact clusters are
metric="euclidean",
random_state=42,
)
X_umap = umap_model.fit_transform(X_scaled)
emb_df["umap_x"] = X_umap[:, 0]
emb_df["umap_y"] = X_umap[:, 1]
print("UMAP embedding shape:", X_umap.shape)
/Users/souvikmandal/AudioVenv/lib/python3.11/site-packages/umap/umap_.py:1952: UserWarning:
n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
UMAP embedding shape: (13, 2)
UMAP + k-means visualization¶
# --- UMAP scatter plot for k-means clusters ---
# Ensure categorical labels
emb_df["cluster_kmeans_str"] = emb_df["cluster_kmeans"].astype(str)
fig_umap_k = px.scatter(
emb_df,
x="umap_x",
y="umap_y",
color="cluster_kmeans_str",
color_discrete_sequence=px.colors.qualitative.Dark24,
hover_name="file_name",
hover_data={
"file_path": False,
"duration_sec": True,
"cluster_kmeans_str": True,
},
title="UMAP of Acoustic Embeddings — Colored by k-means Cluster",
labels={"umap_x": "UMAP-1", "umap_y": "UMAP-2", "cluster_kmeans_str": "k-means cluster"},
)
fig_umap_k.update_layout(
legend_title_text="k-means cluster",
width=800,
height=500,
plot_bgcolor="#F0F2F5",
)
fig_umap_k.show()
UMAP + HDBSCAN visualization¶
# --- UMAP scatter plot for HDBSCAN clusters ---
# Make sure we have the string labels with "noise (-1)"
emb_df["cluster_hdbscan_str"] = emb_df["cluster_hdbscan"].astype(str)
emb_df.loc[emb_df["cluster_hdbscan"] == -1, "cluster_hdbscan_str"] = "noise (-1)"
palette = px.colors.qualitative.Dark24.copy()
noise_color = "#7f7f7f"
unique_clusters = emb_df["cluster_hdbscan_str"].unique().tolist()
color_map = {}
palette_i = 0
for c in unique_clusters:
if c == "noise (-1)":
color_map[c] = noise_color
else:
color_map[c] = palette[palette_i % len(palette)]
palette_i += 1
fig_umap_h = px.scatter(
emb_df,
x="umap_x",
y="umap_y",
color="cluster_hdbscan_str",
color_discrete_map=color_map,
hover_name="file_name",
hover_data={
"file_path": False,
"duration_sec": True,
"cluster_hdbscan": True,
},
title="UMAP of Acoustic Embeddings — Colored by HDBSCAN Cluster",
labels={"umap_x": "UMAP-1", "umap_y": "UMAP-2", "cluster_hdbscan_str": "HDBSCAN cluster"},
)
fig_umap_h.update_layout(
legend_title_text="HDBSCAN cluster",
width=800,
height=500,
plot_bgcolor="#F0F2F5",
)
fig_umap_h.show()
Cluster Summaries¶
To interpret the clusters, we will compute simple summary statistics:
Number of tracks in each cluster
Average, minimum, and maximum track duration
We will do this for:
k-means clusters
HDBSCAN clusters (ignoring the noise cluster
-1for summaries)
Summaries for k-means¶
# --- Cluster summaries: k-means (all numeric metadata) ---
# Columns to exclude from "metadata" summarization
exclude_cols = set(feature_names) | {
"cluster_kmeans",
"cluster_kmeans_str",
"cluster_hdbscan",
"cluster_hdbscan_str",
"pca_x", "pca_y",
"umap_x", "umap_y",
}
# Metadata candidates = all other columns
metadata_cols = [c for c in emb_df.columns if c not in exclude_cols]
# Among those, pick only numeric columns for aggregation
numeric_meta_cols = emb_df[metadata_cols].select_dtypes(include="number").columns.tolist()
print("Numeric metadata columns being summarized (k-means):")
print(numeric_meta_cols)
# Build aggregation dict: for each numeric metadata column, compute mean/min/max
agg_dict = {"file_name": ("file_name", "count")} # count = number of tracks
for col in numeric_meta_cols:
agg_dict[f"{col}_mean"] = (col, "mean")
agg_dict[f"{col}_min"] = (col, "min")
agg_dict[f"{col}_max"] = (col, "max")
summary_k = (
emb_df
.groupby("cluster_kmeans")
.agg(**agg_dict)
.rename(columns={"file_name": "num_tracks"})
.reset_index()
.sort_values("cluster_kmeans")
)
summary_k
Numeric metadata columns being summarized (k-means):
['duration_sec', 'sample_rate']
summaries for HDBSCAN (ignoring noise)¶
# --- Cluster summaries: HDBSCAN (all numeric metadata, non-noise only) ---
mask_non_noise = emb_df["cluster_hdbscan"] != -1
df_h = emb_df[mask_non_noise].copy()
if df_h.empty:
print("No non-noise HDBSCAN clusters to summarize.")
else:
# Reuse the same logic as before, but on df_h
exclude_cols_h = set(feature_names) | {
"cluster_kmeans",
"cluster_kmeans_str",
"cluster_hdbscan",
"cluster_hdbscan_str",
"pca_x", "pca_y",
"umap_x", "umap_y",
}
metadata_cols_h = [c for c in df_h.columns if c not in exclude_cols_h]
numeric_meta_cols_h = df_h[metadata_cols_h].select_dtypes(include="number").columns.tolist()
print("Numeric metadata columns being summarized (HDBSCAN, non-noise):")
print(numeric_meta_cols_h)
agg_dict_h = {"file_name": ("file_name", "count")}
for col in numeric_meta_cols_h:
agg_dict_h[f"{col}_mean"] = (col, "mean")
agg_dict_h[f"{col}_min"] = (col, "min")
agg_dict_h[f"{col}_max"] = (col, "max")
summary_h = (
df_h
.groupby("cluster_hdbscan")
.agg(**agg_dict_h)
.rename(columns={"file_name": "num_tracks"})
.reset_index()
.sort_values("cluster_hdbscan")
)
summary_h
No non-noise HDBSCAN clusters to summarize.