Linear Mixed-Effects Models: Why We Need Them

In this notebook, we will analyze data from n number of athletes doing a repetitive exercise (e.g., push-ups) in 2 sessions:

Each athlete has:
- Session 1 (fewer repeats)
- Session 2 (more repeats)
In each session, the athlete performs many repeats, and every repeat has a correctness score between 0 and 1.

Our research question:

Do athletes show lower correctness scores in the session with a higher number of repeats? (i.e., does performance decrease when the volume of work is higher?)

Why a Simple t-test Is Not Enough¶

At first glance, this might look like a simple comparison of Session 1 vs. Session 2 correctness scores. However, there is an important complication:

The data are clustered within athletes.
- Repeats from the same athlete are likely more similar to each other than to repeats from another athlete.
- This violates the independence assumption of simple tests like the standard t-test or simple linear regression.

If we ignore the fact that multiple measurements come from the same person, we:

Over-count evidence (treat each repeat as if it were a completely independent person).
Underestimate variability.
Risk inflated Type I error (finding “significant” effects that are not actually reliable).

To handle this structure properly, we need a model that:

Lets us estimate the overall effect of session type (low vs. high volume), and
Accounts for the fact that each athlete has their own baseline performance.

This is exactly what a Linear Mixed-Effects Model (LMM) is designed to do.

What Is a Linear Mixed-Effects Model?¶

A Linear Mixed-Effects Model is like a regular linear regression model, but with two types of effects:

Fixed effects (Population-level)
- These are the effects we are primarily interested in.
- They describe how the average correctness changes with predictors like session type (low vs. high volume).
- Example: “On average, correctness is 0.08 lower in the high-volume session.”
Random effects (Individual-level)
- These capture individual-specific deviations from the overall average.
- In our case, each athlete may have a different baseline correctness level.
- Example: Athlete A is generally more accurate than Athlete B, regardless of session.

The random effects allow the model to say:

“All athletes share some common pattern (fixed effect), but each athlete is allowed to start at their own baseline level (random effect).”

This combination lets us borrow strength across athletes while still respecting that they are different.

Key assumptions of LMMs¶

Normal distribution of random effects: The random effects (which capture group-level variation) are assumed to follow a normal distribution with a mean of (0).
Normal distribution of residuals: The errors or residuals (the differences between the observed and predicted values) are assumed to be normally distributed with a mean of (0) and a constant variance.
Linear relationship: There is a linear relationship between the predictors and the response variable.
Constant variance: The variance of the errors is constant across all levels of the predictor variables (homoscedasticity).

Our Model for the Push-Up Scenario¶

We will model the correctness score for each repeat using a simple mixed-effects structure.

Let:

$i$ index athletes (1 to 4)
$j$ index sessions (1 or 2)
$k$ index repeats within a session

Let:

$\text{correctness}_{ijk}$ = correctness score of repeat $k$ , in session $j$ , for athlete $i$
$\text{sessionHigh}_{ij} = 0$ for the lower-volume session, and 1 for the higher-volume session

The simple linear mixed-effects model is:

\text{correctness}_{ijk} = \beta_0 + \beta_1 \cdot \text{sessionHigh}_{ij} + u_{0i} + \varepsilon_{ijk}

(1)

Where:

$\beta_0$ : The average baseline correctness in the low-volume session (across all athletes). (Fixed Intercept)
$\beta_1$ : The fixed effect of session type (high vs. low volume).
- If $\beta_1 < 0$ , then correctness tends to be lower in the high-volume session.
- This is the main effect we care about. (Fixed Slope)
$u_{0i}$ : The random intercept for athlete $i$ .
- This term allows each athlete to have their own starting level of correctness.
- We usually assume $u_{0i} \sim \mathcal{N}(0, \sigma_u^2)$ , meaning athletes vary around the global mean with some variance. (Random Intercept)
$\varepsilon_{ijk}$ : The residual error for that specific repeat.
- This captures trial-to-trial variability not explained by session type or athlete.
- We assume $\varepsilon_{ijk} \sim \mathcal{N}(0, \sigma^2)$ . (Residual Error)

Interpretation in Plain Language¶

The fixed effect $\beta_1$ answers our main question:
“After accounting for differences between athletes, is correctness lower in the high-volume session?”
The random intercepts $u_{0i}$ say:
“Some athletes are generally more accurate; some are less accurate. We let the model learn that instead of forcing everyone to be identical.”

This approach respects the hierarchical structure of the data: repeats are nested inside sessions, and sessions are nested inside athletes.

Simulating Data for 4 Athletes and 2 Sessions¶

We will simulate a small dataset to match our scenario:

4 athletes (ID 1–4)
Each athlete performs:
- Session 1 (low volume): 20 repeats
- Session 2 (high volume): 30 repeats
Each repeat has a correctness score between 0 and 1.

We will assume:

Each athlete has their own baseline correctness (random intercept).
The high-volume session has a negative effect on correctness (to mimic fatigue), the same on average for all athletes.
There is some random trial-to-trial noise.

This will give us a realistic dataset to fit our linear mixed-effects model.

import numpy as np
import pandas as pd

# For reproducibility
rng = np.random.default_rng(42)

# Named athletes
athletes = ["susan_01", "alex_02", "maya_03", "david_04"]

# Named sessions
sessions = ["session_20", "session_30"]  # 20 repeats vs 30 repeats
n_repeats = {"session_20": 20, "session_30": 30}

# Fixed effect: average change in correctness in the high-volume session
beta_session_high = -0.08  # on average, correctness drops by 0.08 in session_30

rows = []

for athlete in athletes:
    # Random intercept: each athlete has their own baseline correctness
    baseline_correctness = rng.normal(loc=0.85, scale=0.05)
    
    for session in sessions:
        # Indicator: 0 for low volume (session_20), 1 for high volume (session_30)
        session_high = 1 if session == "session_30" else 0
        
        # Number of repeats in this session
        n = n_repeats[session]
        
        for repeat in range(1, n + 1):
            # Mean correctness for this athlete & session
            mu = baseline_correctness + beta_session_high * session_high
            
            # Add trial-to-trial noise
            correctness = rng.normal(loc=mu, scale=0.05)
            
            # Clip to [0, 1] to keep it a valid score
            correctness = np.clip(correctness, 0, 1)
            
            rows.append({
                "athlete_id": athlete,
                "session_label": session,   # "session_20" or "session_30"
                "session_high": session_high,  # 0 = low volume, 1 = high volume
                "repeat_id": repeat,
                "correctness": correctness
            })

df = pd.DataFrame(rows)

df.head()

print("\nMean correctness by athlete and session:")

df.groupby(["athlete_id", "session_label"])["correctness"].mean().unstack()


Mean correctness by athlete and session:

Fitting a Linear Mixed-Effects Model in Python¶

Now that we have a simulated dataset, we will fit the following linear mixed-effects model:

\text{correctness}_{ijk} = \beta_0 + \beta_1 \cdot \text{session\_high}_{ij} + u_{0i} + \varepsilon_{ijk}

(2)

Understanding the Parameters¶

$\text{correctness}_{ijk}$ : correctness score of repeat $k$ in session $j$ for athlete $i$
$\text{session\_high}_{ij}$ : The Fixed Effect predictor (Dummy variable)
- 0 for the low-volume session ("session_20")
- 1 for the high-volume session ("session_30")
$\beta_0$ : The Fixed Intercept—average baseline correctness in the low-volume session (across all athletes).
$\beta_1$ : The Fixed Slope—average effect of being in the high-volume session.
- If $\beta_1 < 0$ , correctness tends to be lower in "session_30".
- This is the main effect we are testing.
$u_{0i}$ : The Random Intercept for athlete $i$ , capturing that some athletes are generally more (or less) accurate than the population average $\beta_0$ .
$\varepsilon_{ijk}$ : Residual Error—the leftover trial-to-trial noise not explained by athlete or session.

Model Implementation in Python¶

In Python, we can fit this model using the statsmodels package (specifically the MixedLM function).

We treat correctness as the response variable.
We use session_high as a fixed effect predictor.
We specify athlete_id as the grouping variable for the random intercepts.

Key Questions to Answer¶

After fitting the model, we will inspect the summary table to address our main questions:

Sign and size of $\beta_1$ :
- Is it negative?
- How big is the drop in correctness in the high-volume session?
Statistical significance of $\beta_1$ :
- Is the session effect reliably different from zero (check the p-value)?

We will fit the model and then inspect the summary table for the fixed effects and random effects variances ( $\sigma^2_{u0}$ and $\sigma^2$ ).

import statsmodels.formula.api as smf

# Fit a linear mixed-effects model:
# correctness ~ session_high  with random intercepts for athlete_id
model = smf.mixedlm(
    formula="correctness ~ session_high",
    data=df,
    groups=df["athlete_id"]
) # you can also add method = "lbfgs"/ "nm", etc. inside the mixedlm() function.

result = model.fit()
print(result.summary())

# For quick intuition: mean correctness by session_label
print("\nMean correctness by session_label:")
print(df.groupby("session_label")["correctness"].mean())

          Mixed Linear Model Regression Results
=========================================================
Model:            MixedLM Dependent Variable: correctness
No. Observations: 200     Method:             REML       
No. Groups:       4       Scale:              0.0019     
Min. group size:  50      Log-Likelihood:     328.5277   
Max. group size:  50      Converged:          Yes        
Mean group size:  50.0                                   
---------------------------------------------------------
              Coef.  Std.Err.    z    P>|z| [0.025 0.975]
---------------------------------------------------------
Intercept      0.846    0.013  67.146 0.000  0.821  0.871
session_high  -0.073    0.006 -11.422 0.000 -0.085 -0.060
Group Var      0.001    0.011                            
=========================================================


Mean correctness by session_label:
session_label
session_20    0.845846
session_30    0.773172
Name: correctness, dtype: float64

/Users/souvikmandal/StatEnv/lib/python3.12/site-packages/statsmodels/regression/mixed_linear_model.py:2237: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
  warnings.warn(msg, ConvergenceWarning)

Understanding the Mixed-Effects Model Output¶

Let’s go through the important pieces one by one.

1. Fixed Effects: `Intercept` and `session_high`¶

These are the main parameters of interest.

Intercept¶

What it is:
The estimated mean correctness when session_high = 0, i.e., in the low-volume session (session_20), averaged across athletes.
Here: Intercept = 0.846
- Interpretation:
  In session_20, the model estimates that the average correctness is about 0.846.

`session_high`¶

What it is:
The estimated change in correctness when we go from:
- session_high = 0 → session_20 (low volume)
  to
- session_high = 1 → session_30 (high volume).
Here: session_high = -0.073
- Interpretation:
  On average, correctness in session_30 is 0.073 units lower than in session_20, after accounting for differences between athletes.
Based on the estimated coefficients from the Linear Mixed-Effects Model (LMM), we can calculate the predicted average correctness score for each session type across all athletes:
- Predicted mean in session_20 (low volume): This is given by the model’s Fixed Intercept ( $\beta_0$ ), since $\text{session\_high} = 0$ .
  $\hat{\mu}_{20} = \hat{\beta}_0$
  (3)
  $\hat{\mu}_{20} = \text{Intercept} \approx 0.846$
  (4)
- Predicted mean in session_30 (high volume): This is given by the Fixed Intercept ( $\beta_0$ ) plus the Fixed Slope for $\text{session\_high}$ ( $\beta_1$ ), since $\text{session\_high} = 1$ .
  $\hat{\mu}_{30} = \hat{\beta}_0 + \hat{\beta}_1$
  (5)
  $\hat{\mu}_{30} = \text{Intercept} + \text{session\_high} \approx 0.846 - 0.073 = 0.773$
  (6)

2. `Std.Err.` (Standard Error)¶

What it is:
The estimated uncertainty of each coefficient.
Smaller standard errors → more precise estimates.
Examples:
- Intercept: Std.Err. = 0.013
- session_high: Std.Err. = 0.006

We use the standard error to build test statistics (like the z value) and confidence intervals.

3. `z` (z-statistic)¶

What it is:
The test statistic used to evaluate the null hypothesis ( $H_0$ ) that a fixed-effect coefficient (like $\beta_1$ ) is zero is the z-statistic.

The z-statistic measures how many standard errors the estimated coefficient is away from zero.

z = \frac{\text{Estimate}}{\text{Std.Err.}}

(7)

Examples:
- Intercept: z = 67.146 → extremely large in magnitude → the intercept is clearly not 0.
- session_high: z = -11.422 → large negative value → strong evidence that the session effect is not 0.

Intuition:
The bigger the absolute value of z (e.g., > 2 or < -2), the stronger the evidence against the null hypothesis that the true coefficient is 0.

4. `P>|z|` (p-value)¶

What it is:
The p-value for testing:

H_0: \text{coefficient} = 0

(8)

It answers:
“If the true effect were 0, how likely is it to see a z-statistic as extreme as the one we observed?”
In the output:
- Intercept: P>|z| = 0.000
- session_high: P>|z| = 0.000
(0.000 means “so small it rounds to 0 to three decimal places”.)
For session_high, this means:
There is very strong evidence that the session effect is not zero → the high-volume session truly has a different (lower) correctness.

5. `[0.025 0.975]` (95% Confidence Interval)¶

What it is:
A 95% confidence interval for each coefficient:
- [0.025 → lower bound
- 0.975] → upper bound
Example for session_high:
- CI: [-0.085, -0.060]
- Interpretation:
  We are (approximately) 95% confident that the true effect of the high-volume session is between -0.085 and -0.060.
  So correctness in session_30 is somewhere between 0.060 and 0.085 points lower than in session_20.
Important point:
The entire CI is below zero, which matches the very small p-value and supports the conclusion that correctness drops in the high-volume session.

6. `Group Var` (Variance of Random Intercepts)¶

What it is:
The estimated variance of the random intercepts for athlete_id.
Here: Group Var = 0.001
- Interpretation:
  Athletes differ slightly in their overall correctness.
  The size of this variance tells us how much athletes vary around the global average.
If Group Var were 0, that would mean:
- All athletes are essentially identical in baseline correctness.
A non-zero Group Var means:
- The model sees real, between-athlete differences in baseline performance, beyond random noise.

7. `Scale`¶

What it is:
The estimated residual variance ( $\sigma^2$ ), i.e., the variability of the correctness scores within athletes and sessions that is not explained by:
- session effect (fixed effect), or
- athlete differences (random effects).
Here: Scale = 0.0019
- Interpretation:
  Even after accounting for session and athlete, there is still some trial-to-trial variability in correctness.
  The model assumes this residual noise has variance ≈ 0.0019.

8. `Log-Likelihood`¶

What it is:
A measure of how well the model fits the data (higher is better, for the same data and model class).
Here: Log-Likelihood = 328.5277
On its own, this number is mostly useful when you:
- Compare different models (e.g., with and without a random effect, or with and without an extra fixed effect), using criteria like AIC or LRT (Likelihood Ratio Test).

For our purposes in this teaching example, you can treat it as a model-fit statistic that we don’t interpret directly in “correctness units”.

9. `Converged: Yes`¶

What it is:
Indicates whether the numerical optimization algorithm successfully found a set of parameter values that maximize the likelihood.
Here: Converged: Yes
- Interpretation:
  The model fitting procedure succeeded, and the results are trustworthy (at least numerically).
If it said No, we would need to:
- Change the optimizer settings,
- Check the model specification,
- Or simplify the model.

10. Putting It All Together (Plain Language Summary)¶

From this model, we can say:

The average correctness in the low-volume session (session_20) is about 0.846.
The high-volume session (session_30) has an estimated effect of -0.073:
- Correctness drops to about 0.773.
- This drop is statistically significant (very small p-value, CI below zero).
There is some variation between athletes (Group Var), and additional trial-to-trial noise (Scale).
The optimizer converged, so we can trust the numerical estimates.

So, in the context of our question:

Yes, the model provides strong evidence that correctness scores are lower in the session with more repeats (session_30), even after accounting for differences between athletes.

Why a Simple t-test Is Not Enough¶

What Is a Linear Mixed-Effects Model?¶

Key assumptions of LMMs¶

Our Model for the Push-Up Scenario¶

Interpretation in Plain Language¶

Simulating Data for 4 Athletes and 2 Sessions¶

Fitting a Linear Mixed-Effects Model in Python¶

Understanding the Parameters¶

Model Implementation in Python¶

Key Questions to Answer¶

Understanding the Mixed-Effects Model Output¶

1. Fixed Effects: Intercept and session_high¶

Intercept¶

session_high¶

2. Std.Err. (Standard Error)¶

3. z (z-statistic)¶

4. P>|z| (p-value)¶

5. [0.025 0.975] (95% Confidence Interval)¶

6. Group Var (Variance of Random Intercepts)¶

7. Scale¶

8. Log-Likelihood¶

9. Converged: Yes¶