Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Python Essentials for Research in Behavioral Sciences: Data Automation in Python

Harvard University

Welcome to Week 2. In the previous notebook, you learned how to represent research data in Python using variables, data structures, pandas, and NumPy. That gave you the raw materials. In this notebook, we will start using those materials more strategically so Python can do repeated work for us.

The big idea this week is automation. Researchers often apply the same rule to many records: checking eligibility, filtering data, labeling observations, converting messy values, or summarizing repeated measurements. If you can express a rule clearly in code, Python can apply it faster, more consistently, and with fewer errors than your sleep-deprived spreadsheet self.

Treat this notebook as training in both coding and scientific reasoning:

  1. Run cells in order.

  2. Before each code cell, predict what you expect to happen.

  3. After running a cell, compare your expectation with the output.

  4. At each checkpoint, write one sentence about what rule or pattern your code is actually using.

Weekly Goal: move from simply storing and inspecting data to automating repeated decisions and repeated operations.

Time Plan (4-6 hours a week) This notebook has four sections. Each section is designed to take about 30-40 minutes for beginners, plus time for the checkpoint tasks and the final weekly challenge.


Section 1: More Operations with pandas

In Week 1, you learned how pandas can store tabular data in a DataFrame. That was the first step: getting data into a structure that is easier to inspect than a list of dictionaries. Now we will push a bit further and ask a more practical question: once the data is in a DataFrame, what do we actually do with it?

1.1. Inspecting and Summarizing a DataFrame

Before filtering, modeling, or reporting anything, you should first inspect the dataset. That means checking what columns are present, how many rows you have, and whether the values look roughly reasonable. This is one of the most boring habits in research coding, which is exactly why it saves you from very exciting disasters later.

1.2. Filtering Rows and Creating New Columns

Once you trust the basic structure of your DataFrame, the next step is usually to apply a rule. Maybe you want only adult participants, or only participants who completed enough sessions, or only observations above a threshold. pandas makes this efficient by letting you filter rows using boolean conditions.

You can also create new columns from existing ones. This is useful when you want to label observations, mark eligibility, or create a quick indicator that will be reused later in your analysis pipeline.

# Example 1: inspect a small research dataset with pandas
import pandas as pd

participant_df = pd.DataFrame([
    {"id": "P01", "name": "John Doe", "age": 25, "consent": True, "year_spent": 3, "score": [72, 81, 90]},
    {"id": "P02", "name": "Shanti Murmu", "age": 24, "consent": True, "year_spent": 3, "score": [91, 85, 78]},
    {"id": "P03", "name": "Bahar Shirazi", "age": 27, "consent": True, "year_spent": 5, "score": [82, 79, 77, 88, 95]},
    {"id": "P04", "name": " Kwame Adeyemi", "age": 26, "consent": True, "year_spent": 4, "score": [86, 74, 88, 99]},
    {"id": "P05", "name": "Olivia Silva", "age": 22, "consent": True, "year_spent": 1, "score": [77]}
])

print(participant_df.head())
print("\nShape:", participant_df.shape)
print("Columns:", participant_df.columns.tolist())
    id            name  age  consent  year_spent                 score
0  P01        John Doe   25     True           3          [72, 81, 90]
1  P02    Shanti Murmu   24     True           3          [91, 85, 78]
2  P03   Bahar Shirazi   27     True           5  [82, 79, 77, 88, 95]
3  P04   Kwame Adeyemi   26     True           4      [86, 74, 88, 99]
4  P05    Olivia Silva   22     True           1                  [77]

Shape: (5, 6)
Columns: ['id', 'name', 'age', 'consent', 'year_spent', 'score']

Finding the Mean of Lists Inside a Column

In this dataset, each value in the score column is a list of scores for one participant (for example, [72, 81, 90]). We cannot call .mean() directly on the full column and expect the participant-level averages to appear automatically.

We do it in two steps:

  1. Use .apply(...) to run the same function on each participant’s score list.

  2. From those participant-level means, calculate one overall mean.

lambda x: sum(x) / len(x) means: take one participant’s list x, add all scores, and divide by the number of scores.

In short, we are computing:

  • mean score per participant, then

  • mean of those participant means.

# Example 1.1: mean score per participant, then overall mean
participant_mean_scores = participant_df["score"].apply(lambda score_list: sum(score_list) / len(score_list))

overall_mean_score = round(participant_mean_scores.mean(), 2)

print("Participant-level mean scores:")
print(participant_mean_scores)
print("\nOverall mean score across participants:", overall_mean_score)
Participant-level mean scores:
0    81.000000
1    84.666667
2    84.200000
3    86.750000
4    77.000000
Name: score, dtype: float64

Overall mean score across participants: 82.72

We can rewrite the previous calculation in one compact line as shown below.

round(participant_df["score"].apply(lambda x: sum(x) / len(x)).mean(), 2)
np.float64(82.72)

Let’s see another lambda example

Here we will use lambda to create new numeric summaries from each participant’s list of scores. In Section 3, we will build on this and use conditional logic (if/elif/else) to create rule-based labels.

# Example 1.3: use lambda to derive numeric summary columns
participant_df["n_scores"] = participant_df["score"].apply(lambda score_list: len(score_list))
participant_df["mean_score"] = participant_df["score"].apply(
    lambda score_list: round(sum(score_list) / len(score_list), 2)
 )

print(participant_df[["id", "score", "n_scores", "mean_score"]])
    id                 score  n_scores  mean_score
0  P01          [72, 81, 90]         3       81.00
1  P02          [91, 85, 78]         3       84.67
2  P03  [82, 79, 77, 88, 95]         5       84.20
3  P04      [86, 74, 88, 99]         4       86.75
4  P05                  [77]         1       77.00

Run the following lines of code to get more comfortable.

#len(participant_df["score"].iloc[1])
#sum(participant_df["score"].iloc[1])

Notice that we added two new columns to participant_df: one for the number of recorded scores (n_scores) and one for each participant’s mean score (mean_score).

In Section 3, we will go one step further and use these kinds of values inside explicit decision rules.

participant_df
Loading...
# Using a for loop to calculate mean scores per participant
participant_mean_scores = []
for i in range(len(participant_df["score"])):
    participant_mean_scores.append(round(sum(participant_df["score"].iloc[i])/len(participant_df["score"].iloc[i]), 2))

participant_mean_scores
[81.0, 84.67, 84.2, 86.75, 77.0]
# Example 2: filter rows and create a useful new column
eligible_df = participant_df[
    (participant_df["age"] >= 18)
    & (participant_df["consent"])
    & (participant_df["year_spent"] >= 3)
].copy()

participant_df["above_average_score"] = (
    participant_df["mean_score"] > participant_df["mean_score"].mean()
)

print("Eligible participants:\n")
print(eligible_df[["id", "age", "year_spent", "score"]])
print("\nOriginal data with a derived column:\n")
print(participant_df[["id", "score", "above_average_score"]])
Eligible participants:

    id  age  year_spent                 score
0  P01   25           3          [72, 81, 90]
1  P02   24           3          [91, 85, 78]
2  P03   27           5  [82, 79, 77, 88, 95]
3  P04   26           4      [86, 74, 88, 99]

Original data with a derived column:

    id                 score  above_average_score
0  P01          [72, 81, 90]                False
1  P02          [91, 85, 78]                 True
2  P03  [82, 79, 77, 88, 95]                 True
3  P04      [86, 74, 88, 99]                 True
4  P05                  [77]                False

Checkpoint 1

Use participant_df to create a filtered DataFrame containing only participants who both consented and scored at least 80.

Tasks:

  • Print the columns id, score, and attendance.

  • Create a new column called high_attendance using the rule attendance >= 4.

  • Write one sentence explaining what your filtered output says about the sample.

Section 2: More Operations with NumPy

In Week 1, NumPy appeared as the fast numerical engine underneath many data operations. This week, we will use it more deliberately. The main question is not “Can NumPy do this?” The real question is “When does it help me think and compute more clearly than plain Python or even pandas?”

2.1. Arrays for Repeated Calculations

A NumPy array is especially useful when you want to apply the same numeric operation to many values at once. Instead of updating one score at a time, you can adjust the full set in one line. That is where automation starts feeling less like extra work and more like a tiny reward for writing clean code.

# Example 3: use NumPy arrays for repeated numeric operations
import numpy as np

scores_array = np.array([81, 76, 88, 79, 91, 74, 69, 85])
boosted_scores = scores_array + 3
score_deviation = scores_array - scores_array.mean()

print("Original scores:", scores_array)
print("Boosted scores:", boosted_scores)
print("Deviation from mean:", np.round(score_deviation, 2))
Original scores: [81 76 88 79 91 74 69 85]
Boosted scores: [84 79 91 82 94 77 72 88]
Deviation from mean: [  0.62  -4.38   7.62  -1.38  10.62  -6.38 -11.38   4.62]
# Example 4: create a boolean mask and summarize a subset
attendance_array = np.array([4, 5, 4, 2, 6, 3, 1, 5])
consistent_attendance = attendance_array >= 4

print("Attendance values:", attendance_array)
print("Consistent attendance mask:", consistent_attendance)
print("Scores for participants with attendance >= 4:", scores_array[consistent_attendance])
print("Mean score:", round(np.mean(scores_array), 2))
print("Standard deviation:", round(np.std(scores_array), 2))
Attendance values: [4 5 4 2 6 3 1 5]
Consistent attendance mask: [ True  True  True False  True False False  True]
Scores for participants with attendance >= 4: [81 76 88 91 85]
Mean score: 80.38
Standard deviation: 6.93

2.2. Boolean Masks and When NumPy Helps More Than pandas

A boolean mask is just a True/False pattern used to select values that meet a rule. NumPy is especially helpful when your data is mostly numeric and you want fast, repeated calculations. pandas is still better when the labels matter and you care about named columns or mixed data types.

So here is the practical rule of thumb: use pandas when you are reasoning about a table; use NumPy when you are reasoning about arrays of numbers. In the next section, we will build on that idea by writing explicit decision rules with conditionals and loops.

Section 3: Rules and Automation with Conditionals and Loops

So far, we have inspected data, filtered records, and applied repeated numeric operations. But real research workflows often require rules: include this participant, exclude that one, flag this record, or label a case for follow-up. That is where conditionals and loops become central.

3.1. Conditional Logic

Conditional logic means making a decision based on a rule. In Python, this usually appears with if, elif, and else. This is the language of inclusion criteria, exclusion criteria, quality flags, and many other research decisions.

3.2. Loops

A loop lets you apply the same pattern across many values or many records. This is one of the first places where code starts to feel like automation instead of organized typing.

This section focuses on three especially useful loop patterns:

  • range() when you want repeated steps tied to positions or counts

  • enumerate() when you want both the item and its position

  • zip() when you want to move through multiple lists together

You do not need to memorize every loop pattern today. The goal is to understand what each pattern is good for and why it can simplify a repeated research task.

# Example 5: write a decision rule with if, elif, and else
participant = {"id": "P04", "age": 21, "consent": True, "attendance": 2}

if participant["age"] < 18:
    status = "Exclude: under 18"
elif not participant["consent"]:
    status = "Exclude: no consent"
elif participant["attendance"] < 3:
    status = "Follow up: low attendance"
else:
    status = "Eligible"

print(participant["id"], "->", status)
P04 -> Follow up: low attendance
# Example 6: compare range, enumerate, and zip for repeated work
participant_ids = ["P01", "P02", "P03"]
attendance_values = [4, 5, 2]
score_values = [81, 76, 88]

print("Using range()")
for index in range(len(participant_ids)):
    print(index, participant_ids[index], attendance_values[index])

print("\nUsing enumerate()")
for position, participant_id in enumerate(participant_ids, start=1):
    print(position, participant_id)

print("\nUsing zip()")
for participant_id, attendance, score in zip(participant_ids, attendance_values, score_values):
    print(participant_id, "-> attendance:", attendance, ", score:", score)
Using range()
0 P01 4
1 P02 5
2 P03 2

Using enumerate()
1 P01
2 P02
3 P03

Using zip()
P01 -> attendance: 4 , score: 81
P02 -> attendance: 5 , score: 76
P03 -> attendance: 2 , score: 88

Checkpoint 3

Use conditionals and a loop to label each participant as "Eligible" or "Not eligible" using this rule:

  • age >= 18

  • consent == True

  • attendance >= 3

Tasks:

  • Print each participant’s status.

  • Count how many are eligible.

  • Print the final eligible count.

  • Bonus: if a participant is not eligible, print the first reason they fail the rule.

# Checkpoint 3: implement the rule and verify the final count
participants = [
    {"id": "S01", "age": 18, "consent": True,  "attendance": 4},
    {"id": "S02", "age": 16, "consent": True,  "attendance": 5},
    {"id": "S03", "age": 22, "consent": False, "attendance": 2},
    {"id": "S04", "age": 19, "consent": True,  "attendance": 3}
 ]

eligible_count = 0

# Write your loop and decision rule here

print("Eligible total:", eligible_count)
Eligible total: 0

Section 4: Reusable Research Helpers with Functions

By now, you have seen the same logic appear more than once: classify a score, check eligibility, convert a value, apply a rule across several records. When the same pattern appears repeatedly, it is usually a sign that a function can help.

4.1. Why Functions Help Automation

A function lets you package a rule into a named unit of code. That makes your workflow easier to test, easier to reuse, and much easier to revise when your criteria change. Instead of rewriting the same rule in five places, you update it once.

4.2. Functions with Inputs and Outputs

Good beginner functions usually do one clear thing: they take an input, apply a rule, and return an output. That sounds almost disappointingly simple, which is why it is so useful.

4.3. Defensive Helpers for Messy Data

Real data loves surprises. A value that should be numeric may arrive as text. A field may be missing. A label may contain something you did not expect. Defensive coding means writing helpers that fail gracefully instead of breaking the entire workflow.

In the next code cell, one function packages a classification rule, and another handles messy conversion more safely.

Checkpoint 4

Write a function eligibility_label(age, consent, attendance) that returns either "Eligible" or "Not eligible".

Then test at least four cases and briefly explain one edge case where writing the rule explicitly helps you avoid confusion.

# Example 7: package repeated logic into reusable functions
def classify_score(score):
    if score >= 90:
        return "High"
    elif score >= 75:
        return "Moderate"
    return "Needs support"

def to_int_safe(text):
    try:
        return int(text)
    except ValueError:
        return None

for score in [92, 81, 66]:
    print(score, "->", classify_score(score))

raw_values = ["18", "unknown", "21"]
converted_values = [to_int_safe(value) for value in raw_values]
print("Converted values:", converted_values)
92 -> High
81 -> Moderate
66 -> Needs support
Converted values: [18, None, 21]

Weekly Challenge: Build a Tiny Automation Pipeline

Now combine the main ideas from this notebook.

Your task is to build a small workflow that:

  • converts raw text values to usable numbers when needed,

  • applies an eligibility rule,

  • uses a loop to process many participant records,

  • uses at least one function so the logic is reusable,

  • prints a short summary report.

Keep the code clear enough that a collaborator could read it without needing a decoder ring from a secret programming society.

# Weekly Challenge starter: combine conversion, filtering, looping, and functions
research_question = "TODO"

participants = [
    {"id": "P01", "age": "19", "consent": True,  "attendance": "4", "score": "81"},
    {"id": "P02", "age": "17", "consent": True,  "attendance": "5", "score": "76"},
    {"id": "P03", "age": "22", "consent": False, "attendance": "4", "score": "88"},
    {"id": "P04", "age": "21", "consent": True,  "attendance": "2", "score": "79"},
    {"id": "P05", "age": "24", "consent": True,  "attendance": "6", "score": "91"}
 ]

# Define at least one helper function below, then use a loop to build your summary.
# Suggested outputs: eligible IDs, eligible count, and mean eligible score.