Sequence-Based Classification using Machine Learning - LS100: Computational Behavioral Sciences

This notebook outlines the systematic methodology for developing a supervised machine learning model where the objective is to classify or categorize an event based on sequences of temporal data (e.g., biomechanical angles, stock price, or sensor readings).

1. Problem Formulation¶

Sequence classification requires a clear definition of the input-to-outcome relationship. The objective is to map a continuous stream of temporal data to a discrete categorical label.

Input ( $X$ ): A multivariate time-series representing an event of fixed duration.
Output ( $y$ ): A discrete category, singular, ground-truth label corresponding to the outcome of that event.

Example Case: Analyzing the joint angles of two participants during the 240 frames leading up to a strike to predict the winner.

2. Data Acquisition and Structure¶

Successful sequence modeling relies on raw data organized by discrete events. In our example, each distinct file (e.g., a CSV) serves as a single observation point, which is often the case in real-world scenarios.

Temporal Indexing: Each event must be anchored by a consistent temporal axis (e.g., frame number or ms timestamp).
Feature Columns (Predictors): The variables the model will learn from. These columns represent specific variables (features) observed at each timestep or throughout the entire event.
Target Column(s): The ground-truth label(s) mapped to all the feature data in each rows. These labels are often lebeled by a human labeler, or derived from metadata or boolean flags within the file.

3. Data Preprocessing & Standardization¶

Machine Learning models usually require consistent and clean data, and the raw data undergo rigorous preprocessing to ensure data quality. Below are a few examples that we will be using in this example.

A. Target Label Extraction¶

Sometimes, we need to convert raw status columns into a single “Class” or “Target” variable.

Process: Identify which column indicates the outcome and map it to a label.
Example: If h1_right_of_way > 0, the winner is h1.

B. Temporal Normalization¶

Sequence models (such as LSTMs or CNNs) generally require a uniform input size.

Standardization: Establish a fixed length of data for the feature columns for all events.
Quality Filtering: Events that deviate from the expected duration (e.g., fewer than 240 frames) should be excluded to prevent the introduction of incomplete or biased sequences.

C. Context-Aware Imputation¶

Missing data points (NaNs) caused by sensor occlusion or transmission loss can disrupt the model’s ability to learn temporal patterns.

Methodology: Use linear interpolation to estimate missing values.
The Constraints: To preserve biological or physical realism, interpolation should only be executed if the gap is flanked by a sufficient number of valid, non-null observations; domain knowlewdge and understanding of startistical distribution of the data are essential to determine the value of such observations (e.g., 10 datapoints on each side). This ensures the imputed values are anchored in legitimate, empirical data.

4. Feature Vectorization and Structural Transformation¶

To enable sequence learning, the data must be transformed from a “long” format (many rows per event) to a “wide” format (one row per event).

Vectorization: Each feature column is transformed into a vector/array containing the values for the entire duration, and is treated as a sequence of length $N$ .
Event Embedding: By packing these sequences into cells, each row in the master dataframe represents a complete event, maintaining the temporal order essential for the model to detect dynamic patterns.

Feature A (Array/Sequence)	Feature B (Array/Sequence)	...	Target Label
`[val0, val1, ... tN]`	`[t0, t1, ... tN]`	...	Class X

5. Model Selection & Architecture¶

Once the data is vectorized, it is prepared for deep learning architectures specifically designed for temporal analysis:

Long Short-Term Memory (LSTM): Specialized in retaining long-term dependencies within the sequence.
1D Convolutional Neural Networks (1D-CNN): Effective at identifying local spatial signatures or “motifs” within the movement trajectory.
Transformer-based Models: Utilize self-attention mechanisms to determine which specific segments of the event are most predictive of the outcome.

Guidance for Implementation: When adapting this methodology, researchers must justify their choice of sequence length ( $N$ ), imputation constraints, and model architecture based on the specific dynamics of the phenomenon under investigation.

%pip install pyarrow -q


[notice] A new release of pip is available: 25.2 -> 26.0.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.

import os
import pandas as pd
import numpy as np

INPUT_FOLDER = "/Users/souvikmandal/Documents/S06_Teaching_Mentoring_Talks/LS100/2026_Sem01/media/Henry/Henrys-Compile-Angles-Outputs/trimmed_csv"
OUTPUT_FILE = os.path.join(INPUT_FOLDER, "processed_master_df.parquet")
FEATURE_COLUMNS = ["h1_elbow", "h1_hip", "h1_knee", "h2_elbow", "h2_hip", "h2_knee"]
FLANKING_POINTS = 2

def interpolate_with_flanking(series, flanking=10):
    """
    Interpolate NaN blocks only when there are enough valid points
    on both sides of the gap.
    """
    series = series.copy()
    mask = series.isna()
    if not mask.any():
        return series

    nan_positions = np.where(mask)[0]
    if len(nan_positions) == 0:
        return series

    blocks = []
    block_start = nan_positions[0]
    for index in range(1, len(nan_positions)):
        if nan_positions[index] != nan_positions[index - 1] + 1:
            blocks.append((block_start, nan_positions[index - 1]))
            block_start = nan_positions[index]
    blocks.append((block_start, nan_positions[-1]))

    interpolated = series.interpolate(method="linear")

    for start, end in blocks:
        pre_idx = list(range(max(0, start - flanking), start))
        post_idx = list(range(end + 1, min(len(series), end + 1 + flanking)))

        has_enough_context = len(pre_idx) == flanking and len(post_idx) == flanking
        if not has_enough_context:
            continue

        if series.iloc[pre_idx].notna().all() and series.iloc[post_idx].notna().all():
            series.iloc[start : end + 1] = interpolated.iloc[start : end + 1]

    return series


def process_event_file(filepath, feature_columns=None, flanking=10):
    if feature_columns is None:
        feature_columns = FEATURE_COLUMNS

    try:
        df = pd.read_csv(filepath)
    except Exception as error:
        print(f"Skipping {os.path.basename(filepath)}: read error ({error})")
        return None

    if len(df) != 240:
        print(f"Skipping {os.path.basename(filepath)}: incorrect length ({len(df)} rows)")
        return None

    required_label_columns = ["h1_right_of_way", "h2_right_of_way"]
    required_columns = required_label_columns + feature_columns
    missing_columns = [col for col in required_columns if col not in df.columns]
    if missing_columns:
        print(f"Skipping {os.path.basename(filepath)}: missing columns {missing_columns}")
        return None

    h1_val = df["h1_right_of_way"].max()
    h2_val = df["h2_right_of_way"].max()

    if h1_val > 0:
        winner = "h1"
    elif h2_val > 0:
        winner = "h2"
    else:
        print(f"Skipping {os.path.basename(filepath)}: no winner identified")
        return None

    for col in feature_columns:
        df[col] = interpolate_with_flanking(df[col], flanking=flanking)

    if df[feature_columns].isna().any().any():
        print(f"Skipping {os.path.basename(filepath)}: remaining NaNs after interpolation")
        return None

    row_data = {"source_file": os.path.basename(filepath), "winner": winner}
    for col in feature_columns:
        row_data[col] = df[col].tolist()

    return row_data


def build_processed_master_dataframe(input_folder, output_file=None, feature_columns=None, flanking=10):
    if feature_columns is None:
        feature_columns = FEATURE_COLUMNS

    if output_file is None:
        output_file = os.path.join(input_folder, "processed_master_df.parquet")

    output_name = os.path.basename(output_file)
    all_events = []

    for filename in sorted(os.listdir(input_folder)):
        if not filename.endswith(".csv"):
            continue
        if filename == output_name:
            continue

        filepath = os.path.join(input_folder, filename)
        row_data = process_event_file(filepath, feature_columns=feature_columns, flanking=flanking)
        if row_data is not None:
            all_events.append(row_data)

    if not all_events:
        raise ValueError("No valid files found to process.")

    processed_master_df = pd.DataFrame(all_events)
    processed_master_df.to_parquet(output_file, index=False)
    print(f"Saved {len(processed_master_df)} processed events to {output_file}")
    return processed_master_df


processed_master_df = build_processed_master_dataframe(
    INPUT_FOLDER,
    output_file=OUTPUT_FILE,
    feature_columns=FEATURE_COLUMNS,
    flanking=FLANKING_POINTS,
)
processed_master_df.head()

Skipping LS100_FInalData_15_final_kinematics_healed_id__angles_training_sheet_trimed.csv: remaining NaNs after interpolation
Skipping LS100_FInalData_17_final_kinematics_healed_id__angles_training_sheet_trimed.csv: remaining NaNs after interpolation
Skipping LS100_FInalData_24_final_kinematics_healed_id__angles_training_sheet_trimed.csv: remaining NaNs after interpolation
Skipping LS100_FInalData_2_final_kinematics_healed_id__angles_training_sheet_trimed.csv: incorrect length (182 rows)
Skipping LS100_FInalData_6_final_kinematics_healed_id__angles_training_sheet_trimed.csv: incorrect length (226 rows)
Skipping LS100_FinalData_31_final_kinematics_healed_id__angles_training_sheet_trimed.csv: remaining NaNs after interpolation
Saved 26 processed events to /Users/souvikmandal/Documents/S06_Teaching_Mentoring_Talks/LS100/2026_Sem01/media/Henry/Henrys-Compile-Angles-Outputs/trimmed_csv/processed_master_df.parquet

processed_master_df.shape

(26, 8)