Python Essentials for Research in Behavioral Sciences: Introduction to Data in Python

Welcome to the first notebook of the course. In this course, you will work through several notebooks that help you understand concepts, pick up practical skills, and use Python for conducting your own research questions. Some notebooks will teach core ideas, and some will serve as blueprints that you can adapt to your own projects.

The goal is not only to learn syntax or run code to get outputs. The goal is to develop the knowledge, skills, and habits required to solve novel problems as a research programmer, namely collecting/ generating data and making sense of the data following data science workflow (import, clean, explore, analyze, visualize, report).

However, this is NOT a course focused on learning Python in deatils, but to learn the essentials to utilize the vast possibilities of Python to conduct research in Behavioral Sciences.

If you are a beginner and want to have a good grasp of Python, there are several excellent and free “Python for Beginner” courses available online. My current favorites are the 8-hours “Crash Course on Python” by Google and this “Python for Beginners” by Microsoft. Another wonderful resource is learnpython.org.

How to Use This Notebook Well¶

Treat this notebook as training for scientific reasoning with code, not as a set of isolated coding drills.

Run cells in order.
Before each code cell, predict what you expect to happen.
After execution, compare expectation vs output.
At each checkpoint, write one sentence explaining what your code demonstrates.

These habits will help you debug faster, justify your analytic choices, and scale your workflow to larger datasets.

Bite-size Weekly Goals for LS100

Each notebook in our course has several sections. If you are taking LS100, I hightly recommend progressing through one section a day and one notebook a week, especially in the first month. The first four notebooks are designed to build the foundation you will need for conducting large-scale, cutting-edge research that include automation, data analysis, and machine learning workflows.

For this week, the goal is to start using Python through some examples, enabling you to create a summary pipeline that is clear and reusable.

Time Plan (4-6 hours a week)

This notebook has four sections, and each section is designed to take about 30-40 minutes for beginners. The notebook ends with a Weekly challenge and reflection that may take 60-75 minutes.

Before we start, let’s check the interpreter, the program that executes our Python code.

import sys
# The "sys" module gives access to system-specific parameters and functions.
# It is part of Python's *standard library* (built-in tools that come with Python)

# Check the version of Python. For our work, you would want a version 3.XY.Z.
print("Python Version:", sys.version) 

# sys.executable gives the full path to the Python program (interpreter) that is being used.
# This tells where Python is installed on your computer.
print("Python executable path:", sys.executable)

Python Version: 3.12.12 (main, Oct  9 2025, 11:07:00) [Clang 16.0.0 (clang-1600.0.26.6)]
Python executable path: /Users/souvikmandal/python_venv/StatEnv/bin/python

Check whether Python version and the Python executable path matches with your expectation. If not, recheck what Python environment you are using.

Section 1: Variables, Built-In Functions, Data Types¶

While Python is used for several computational tasks, starting from data operations to web development to robotics, for our purpose, beginning our Python learning from the viewpoint of handling and analyzing data would be ideal.

Data is one of the most important materials for quantitative research. In this notebook, we will start understanding data in the “Pythonic way” - representing data, ideas, assumptions, and measurements in quantifiable entities using programming. Let’s start with understanding of variables.

1.1. Concept: Variables¶

Variable is a named reference to a value, such as text, a number, or a boolean, that acts as a reference or pointer to an object stored in memory. Storing data as variables make our work explicit, reproducible, and automated. We can assign a value to a variable using = (for example: age_participant_01 = 21).

Key Characteristics:

Symbolic Labels: Think of a variable as a labeled box or tag attached to a value (like a number or a string).
Dynamic Typing: You don’t have to specify what kind of data (type) a variable will hold. Python automatically determines the type based on the value you give it, and this can change if you reassign it to something else later.
Case Sensitivity: Variable names are case-sensitive, meaning myVar and myvar are treated as two different variables.

Practical Naming Habits:

Now that we have understood variables, the next step is to name variables in a way that keeps our code clear, readable, and reusable. Naming is not decoration; naming is part of scientific clarity.

Use descriptive names: sample_size is better than x.
Write variable names in lowercase and connect words with an underscore (_), such as sample_size or avg_study_hours.
Avoid vague names like data1, thing2, or value3.

If a collaborator opens this notebook after three months, they should still understand what each variable represents.

First code¶

Alright, I hope you are ready for the first few lines of codes in Python. :)

In the next two code cells, you will create and inspect a simple study setup. As you run them, notice how clear variable names make the output from print() and type() easier to interpret.

# Example 1: encode a simple study setup as explicit variables
name_researcher = "Souvik Mandal"
research_question = "Is time in grad school positively correlated with expertise and negatively correlated with mental health?"
sample_size = 480
avg_years_spent = 5.75
pilot_complete = True

1.2. Concept: Built-In Functions¶

In simple words, functions are equations that take one or more inputs, do calculations following the instructions or rules in the equation, and give outputs. Python comes with a set of pre-written functions that are already written, tested, and ready to use.

These functions are part of Python’s core library. Using them saves us from writing everything from scratch; we simply call them by name with inputs.

Examples include print() for displaying values, len() for counting items, and type() for checking what kind of data a value stores.

Key Characteristics:

Pre-written and Ready to Use: You call an in-built function by writing its name followed by parentheses with inputs, such as print("Hello") or len([1, 2, 3]).
Consistent Behavior: In-built functions are standardized, so they behave predictably. This consistency is crucial for reproducible research code.

Accessing stored data - Understanding print()

As you write code to analyze data, you should pause and inspect your intermediate results. The print() function displays values in the output so you can check what your code is producing step by step.

Before trusting final results, verify intermediate values first. This habit helps you detect mistakes early and build reliable research code.

Now, let’s print the variables we just saved.

We can simply use the command print(variable_name)
Or, we can add a string describing what we are printing within quotes or double quotes.
We can also use \n to add a line break.
We can use f-string formatting as well.

print(name_researcher)
print("\n")
print("Name of researcher:", name_researcher)
print("Research question:", research_question)
print("Sample size:", sample_size)
print("Average years spent:", avg_years_spent)
print("Pilot complete:", pilot_complete, "\n")

print("-" * 50)
print("Now, let's do a f-string for better formatting.")
print("-" * 50, "\n")
print(f"Just to remind you, {name_researcher} is asking: \"{research_question}\" "
    f"with a sample size of {sample_size}, "
    f"an average of {avg_years_spent} years spent, "
    f"and pilot complete status of {pilot_complete}.")

Souvik Mandal


Name of researcher: Souvik Mandal
Research question: Is time in grad school positively correlated with expertise and negatively correlated with mental health?
Sample size: 480
Average years spent: 5.75
Pilot complete: True 

--------------------------------------------------
Now, let's do a f-string for better formatting.
-------------------------------------------------- 

Just to remind you, Souvik Mandal is asking: "Is time in grad school positively correlated with expertise and negatively correlated with mental health?" with a sample size of 480, an average of 5.75 years spent, and pilot complete status of True.

Accessing the length of the stored data using len()

Now, let’s check how long is the name of the researcher and the research question using the in-built len function; this will count the number of characters in the given string variable.

print("Length of the name of researcher: ", len(name_researcher))
print("Length of the research question: ", len(research_question))

Length of the name of researcher:  13
Length of the research question:  105

However, a variable containing one integer does not have any length, so running len(sample_size) will give a TypeError. Try it anyway.

But you can certainly check the length of a data structure with several integers.

test_data_structure = (1, 2, 3, 1)
len(test_data_structure)

4

1.3. Concept: Data Types¶

A data type tells Python what kind of value a variable stores.

Python has many data types. Core built-in types that we will commonly encounter include:

int (integer): whole numbers, such as 48
float (floating-point number): numbers with decimals, such as 5.75
str (string): text, such as "participant_01" or "Hello world!"
bool (boolean): True or False
bytes, bytearray, memoryview: binary data types
range: sequence of integers, often used in loops
NoneType: represents no value (None)

Python can also have user-defined data types (for example, custom classes).

In this notebook, we will focus mostly on int (integer), float (floating-point number), str (string), bool (boolean), list, and dict because these are the most common data types in behavioral research workflows.

If a value that looks numeric but is actually stored as text, your analysis step can fail or produce misleading results.

Checking data type¶

# Checking the data type of one variable, say 'research_question'
type(research_question)

str

Now, let’s try different print methods to inspect types of the variables.

print(f"Data type for name of researcher: -> {type(name_researcher)}")
# To get just the type names, we can use the __name__ attribute of the type objects.
print(f"Data type for name of researcher: -> {type(name_researcher).__name__}")

print("\n")  # Add a blank line to separate the outputs visually.
print("Data type for sample size: ->", type(sample_size))
print("Data type for average years spent: ->", type(avg_years_spent))
print(f"Data type for pilot complete: -> {type(pilot_complete).__name__}")

Data type for name of researcher: -> <class 'str'>
Data type for name of researcher: -> str


Data type for sample size: -> <class 'int'>
Data type for average years spent: -> <class 'float'>
Data type for pilot complete: -> bool

Mini Challenge 1¶

Create a one-cell research card using f-strings. Include topic, target sample, expected effect, and whether IRB review is required.

Think of this as your first automation for communicating study setup quickly and clearly to collaborators.

Checkpoint 1¶

Create four variables for your own project:

project_topic (str)
target_sample (int)
expected_mean_variable01 (float)
irb_required (bool)

Then print each value and type.

# Checkpoint 1: replace placeholders with project-specific values
project_topic = "TODO"
target_sample = 0
expected_effect = 0.0
irb_required = False

print("project_topic ->", project_topic, type(project_topic).__name__)
print("target_sample ->", target_sample, type(target_sample).__name__)
print("expected_effect ->", expected_effect, type(expected_effect).__name__)
print("irb_required ->", irb_required, type(irb_required).__name__)

project_topic -> TODO str
target_sample -> 0 int
expected_effect -> 0.0 float
irb_required -> False bool

1.4. Data Type Conversion¶

Real research data often arrives as text, and working with multiple records is central to scaling your workflow. Type conversion means changing a value from one data type to another. For example, if a numeric value is stored as text, we can convert it using int() or float().

Reliable conversion is one of the first steps in any data-cleaning workflow.

In the next code cells, let’s inspect raw text values, convert them to numeric form, and then compute a simple metric.

# Example 1: inspect raw values and confirm they are text before conversion
raw_years = str(6.25)
raw_score = "91"

print(f"Data for years: {raw_years}, Type: {type(raw_years).__name__}")
print(f"Data for score: {raw_score}, Type: {type(raw_score).__name__}")

Data for years: 6.25, Type: str
Data for score: 91, Type: str

# Example 2: convert raw text fields into analysis-ready numbers
raw_years = str(6.25)
raw_score = "91"

converted_years = float(raw_years)
converted_score = int(raw_score)

print(f"Converted years: {converted_years}, Type: {type(converted_years).__name__}")
print(f"Converted score: {converted_score}, Type: {type(converted_score).__name__}")

Converted years: 6.25, Type: float
Converted score: 91, Type: int

Now you are ready to do numeric calculations with the attributes, like comparing the years in grad school by diferent participants, or calculating the mean, etc. See the example below.

records = [
    {"id": "P01", "age": 22, "consent": True, "attendance": 5, "score": "88"},
    {"id": "P02", "age": 19, "consent": True, "attendance": 4, "score": "91"},
    {"id": "P03", "age": 17, "consent": False, "attendance": 2, "score": "77"},
]

score_p01 = records[0]["score"]
score_p02 = records[1]["score"]

print("Data type of records:", type(records).__name__)
print(f"Raw score for P01: {score_p01}, Type: {type(score_p01).__name__}")

Data type of records: list
Raw score for P01: 88, Type: str

The “score” data is stored as string. Let’s try to get the sum of the values.

sum_scores = score_p01 + score_p02
sum_scores

'8891'

Well, it just added the scores one after another, treating the scores as string. Let’s convert that to decimal (float).

score_p01 = float(records[0]["score"])
score_p02 = float(records[1]["score"])
score_p03 = float(records[2]["score"])

sum_scores = score_p01 + score_p02 + score_p03
mean_score = sum_scores / len(records)
print(f"Sum of scores: {sum_scores}")
print(f"Mean score: {mean_score}")

# Example 3: round the mean score to 2 decimal places for better readability
print(f"Mean score rounded to 2 decimal places: {mean_score:.2f}")

Sum of scores: 256.0
Mean score: 85.33333333333333
Mean score rounded to 2 decimal places: 85.33

Mini Challenge 2¶

You receive participant fields as text values: age, stress_score, and sessions_attended.

Tasks:

Convert each field to the correct numeric type.
Print each value and its data type.
Print one short interpretation sentence using your converted values.

The output should look like below.

Participant summary: 
 age = 22, 
 stress_score = 38.6, 
 sessions attended = 9

Data types:
 age: (int), 
 stress_score: (float), 
 sessions_attended: (int)

This challenge strengthens the habit of proving your data format before analysis.

# Your code here.

I know it can feel a bit annoying to write similar lines again and again. Good news is that Python is very well-equipped to kill boring repetition. We will learn many cleaner solutions for repeated operations in the next notebook.

Section 2: Data Structures¶

In Section 1, you practiced storing values and checking their types with in-built functions. In this section, we will go one step further: organizing multiple values into structured collections.

2.1. Basic Data Structures¶

Data structures are containers that hold multiple values organized in a specific way. Structures we will commonly encounter are:

list: ordered, editable collection, such as [72, 81, 90]. Can store any type of data (numbers, strings, other lists, even a mix).
- Examples: [1, 2, 3, 1], ["apple", "banana", "apple"], [1, "two", [3, 4]]
tuple: ordered, non-editable collection, such as (72, 81, 90). Can store any type of data, but is immutable (cannot be changed after creation).
- Examples: (1, 2, 3), (True, "yes", 3.14), ([1, 2], "nested")
dict (dictionary): key-value mapping, such as {"id": "P01", "age": 22}. Keys are usually strings (or numbers), values can be any type (numbers, strings, lists, etc.).
- Examples: {"name": "Alice", "score": 95}, {1: [1, 2, 3], "valid": True}
set: unordered collection of unique values, such as {"A", "B"}. Can store any immutable type (numbers, strings, tuples).
- Examples: {1, 2, 3}, {"apple", "banana"}, {(1, 2), (3, 4)}

Structures let you group related values and pass them efficiently through your research pipeline.

# Example: represent research information with common data structures

# Global variables - variables that are defined at the top level of a script and can be accessed throughout the code
sanity_level = ["sane", "borderline", "insane", "nirvana"]
study_groups = {"control", "intervention"}
score_range = (0, 100)

# Variables attached to one participant - these could be part of a larger data structure like a list of participants or a dictionary of participant records
name_participant_01 = "John Doe"
id_participant_01 = "P01"
age_participant_01 = 25
consent_participant_01 = True
year_spent = 3
scores_participant_01 = [72, 81, 90]

record_participant_01_manual = {"name": "John Doe", 
                         "id": "P01", 
                         "age": 25, 
                         "consent": True,
                         "year_spent": 3,
                         "scores": [72, 81, 90],
                         "sanity_level": "sane",
                         "study_groups": "control"
                         } 
# We don't need to enter each attribute in a separate line, but it can help with readability, especially when there are many attributes.

# Let's create a record by fetching values from previously defined variables.
record_participant_01_fetched = {"name": name_participant_01, 
                      "id": id_participant_01, 
                      "age": age_participant_01, 
                      "consent": consent_participant_01, 
                      "year_spent": year_spent,
                      "scores": scores_participant_01,
                      "sanity_level": sanity_level[0],  # Assuming John is still sane.
                      #"study_groups": study_groups[0] # We can't index a set, so this would be an error. We would need to convert it to a list first if we wanted to fetch an element by index.
                      }


print("Type check:\n")
print("Variable name: scores_participant_01", "\nData Type:", type(scores_participant_01).__name__,"\nData:", scores_participant_01, "\n")
print("Variable name: sanity_level", "\nData Type:",type(sanity_level).__name__, "\nData:", sanity_level, "\n")
print("Variable name: score_range", "\nData Type:",type(score_range).__name__, "\nData:", score_range, "\n")
print("Variable name: study_groups", "\nData Type:",type(study_groups).__name__, "\nData:", study_groups, "\n")

#Pay attention to the two records from the same participant but created manually vs. fetched. Both should yield the same data, but the second approach is more flexible and less error-prone, especially when dealing with larger datasets or when participant information may change.

print("Variable name: record_participant_01_manual", "\nData Type:",type(record_participant_01_manual).__name__, "\nData:", record_participant_01_manual, "\n")
print("Variable name: record_participant_01_fetched", "\nData Type:",type(record_participant_01_fetched).__name__, "\nData:", record_participant_01_fetched, "\n")

Type check:

Variable name: scores_participant_01 
Data Type: list 
Data: [72, 81, 90] 

Variable name: sanity_level 
Data Type: list 
Data: ['sane', 'borderline', 'insane', 'nirvana'] 

Variable name: score_range 
Data Type: tuple 
Data: (0, 100) 

Variable name: study_groups 
Data Type: set 
Data: {'control', 'intervention'} 

Variable name: record_participant_01_manual 
Data Type: dict 
Data: {'name': 'John Doe', 'id': 'P01', 'age': 25, 'consent': True, 'year_spent': 3, 'scores': [72, 81, 90], 'sanity_level': 'sane', 'study_groups': 'control'} 

Variable name: record_participant_01_fetched 
Data Type: dict 
Data: {'name': 'John Doe', 'id': 'P01', 'age': 25, 'consent': True, 'year_spent': 3, 'scores': [72, 81, 90], 'sanity_level': 'sane'}

2.2. Compound and Nested Data Structures¶

Welcome to the next stage closer to the real-world data - the compound data structures! A common data structure in research studies is storing single observation or one unit of records in one data structure, like a dictionary with key-value pairs, and then all records in one data file, like a list. Such datasets often hold multiple attributes of different data types.

Maybe, you are already thinking about Excel or CSV files which have columns and rows. Each column contains a specific attribute (like age, score, or consent), and each row represents a single observation. Well, we are now thinking about 2-dimentional data, or tabular data.

Concept: 2D Lists¶

A 2D list is basically a list of lists. Let’s make a pseudo-dataset of participant records using a list of dictionaries, a common way to represent tabular data in Python. Pay attention to ‘score’ - this is a list within a list, a nested data structure.

# Example: 2D list representing participant data
participants_records_step00 = [
    ["P01", "John Doe", 25, True, 3, [72, 81, 90]],
    ["P02", "Shanti Murmu", 24, True, 3, [91, 85, 78]],
    ["P03", "Bahar Shirazi", 27, False, 5, [82, 79, 77, 88, 95]],
]
participants_records_step00

[['P01', 'John Doe', 25, True, 3, [72, 81, 90]],
 ['P02', 'Shanti Murmu', 24, True, 3, [91, 85, 78]],
 ['P03', 'Bahar Shirazi', 27, False, 5, [82, 79, 77, 88, 95]]]

Suppose we want to access the record of the first participant, which is the first item in the list of records. In Python, list indexing starts at 0, so we use records[0] to access the first participant’s record, records[1] for the second participant’s record, and so on.

print("First participant record:", participants_records_step00[0])

First participant record: ['P01', 'John Doe', 25, True, 3, [72, 81, 90]]

Now, if we want to access one particular item from one the lists, we simply add the index value of that item in a square bracket after the index value of the list. See below.

print("Score of second participant:", participants_records_step00[1][5])

Score of second participant: [91, 85, 78]

Now, there is a problem - we have to remember what kind of data is stored in what position of the lists. Here, dictionary comes to rescue. We can store one participant’s record with various attributes, like ‘id’, ‘name’, ‘age’, ‘consent’, ‘attendance’, and ‘score’ in one dictionary, and keep all distionaries in one list.

participants_records_step01 = [
    {"id": "P01", "name": "John Doe", "age": 25, "consent": True, "year_spent": 3, "score": [72, 81, 90]},
    {"id": "P02", "name": "Shanti Murmu", "age": 24, "consent": True, "year_spent": 3, "score": [91, 85, 78]},
    {"id": "P03", "name": "Bahar Shirazi", "age": 27, "consent": True, "year_spent": 5, "score": [82, 79, 77, 88, 95]},
]
participants_records_step01

[{'id': 'P01',
  'name': 'John Doe',
  'age': 25,
  'consent': True,
  'year_spent': 3,
  'score': [72, 81, 90]},
 {'id': 'P02',
  'name': 'Shanti Murmu',
  'age': 24,
  'consent': True,
  'year_spent': 3,
  'score': [91, 85, 78]},
 {'id': 'P03',
  'name': 'Bahar Shirazi',
  'age': 27,
  'consent': True,
  'year_spent': 5,
  'score': [82, 79, 77, 88, 95]}]

Accessing all the attributes of the first participant’s record

participants_records_step01[0]

{'id': 'P01',
 'name': 'John Doe',
 'age': 25,
 'consent': True,
 'year_spent': 3,
 'score': [72, 81, 90]}

Accessing one attribute of a participant’s record.

participants_records_step01[0]["score"]

[72, 81, 90]

You can already find some similarity of an Excel file with the list. Each key in the dictionary acts like a column header (attribute name), whereas the values are the actual data for each participant. This structure allows you to organize, access, and analyze your data efficiently, just like you would in a spreadsheet, but with much more flexibility and automation in Python.

However, one problem still exists - we have to remember which participant’s record is in what position (called the ‘index’) of our list. To save us from that headache, a “nested dictionary” let us label a unique key, like id, to every other data record (stored in a dictionary) to that key. This enables us to retrieve data just by using the key.

# Example: Nested dictionary for participant records
participants_records_step02 = {
    "P01": {"name": "John Doe","age": 25, "consent": True, "year_spent": 3, "score": [72, 81, 90]},
    "P02": {"name": "Shanti Murmu", "age": 24, "consent": True, "year_spent": 3, "score": [91, 85, 78]},
    "P03": {"name": "Bahar Shirazi", "age": 27, "consent": True, "year_spent": 5, "score": [82, 79, 77, 88, 95]},
}
participants_records_step02

{'P01': {'name': 'John Doe',
  'age': 25,
  'consent': True,
  'year_spent': 3,
  'score': [72, 81, 90]},
 'P02': {'name': 'Shanti Murmu',
  'age': 24,
  'consent': True,
  'year_spent': 3,
  'score': [91, 85, 78]},
 'P03': {'name': 'Bahar Shirazi',
  'age': 27,
  'consent': True,
  'year_spent': 5,
  'score': [82, 79, 77, 88, 95]}}

Now, we can check the whole record or one particular attibute of the records of a participant just remembering their id.

query_participant_id = "P01"

print(f"query participant's id: {query_participant_id}")
print(f"{query_participant_id}'s record: {participants_records_step02[query_participant_id]}")
print(f"Name: {participants_records_step02[query_participant_id]['name']}")
print(f"Year spent: {participants_records_step02[query_participant_id]['year_spent']}")

query participant's id: P01
P01's record: {'name': 'John Doe', 'age': 25, 'consent': True, 'year_spent': 3, 'score': [72, 81, 90]}
Name: John Doe
Year spent: 3

Let’s check whether the data type of the ‘consent’ field for the first participant is boolean. Give an extra attention to the second print statement, which evaluates the equality comparison between the data_type_consent variable and the string ‘bool’, and then prints the resulting boolean value.

data_type_consent = type((participants_records_step01)[0]['consent']).__name__
print("Question: Is the data type of 'consent' for P01 an \"boolean\"?")
print(f"Answer: {data_type_consent == 'bool'}")

Question: Is the data type of 'consent' for P01 an "boolean"?
Answer: True

Section 2: Data Structures - Summary:

In this section, we explored how Python’s built-in data structures—like lists, tuples, dictionaries, sets, and their nested forms—let us organize and access research data in flexible ways. These structures are powerful for small- to medium-sized datasets and help us move beyond repetitive code. However, as our data grows in size and complexity (think: hundreds of participants, dozens of variables, messy real-world data), managing everything with just built-in structures can get unwieldy and error-prone.

This is where Python libraries like pandas come to the rescue! In the next section, we’ll see how pandas and its DataFrame structure make it dramatically easier to work with tabular data—just like a supercharged spreadsheet, but with all the power and automation of Python. Whether you’re cleaning, analyzing, or sharing research data, pandas will quickly become your new favorite tool in data science.

3. Python Libraries — pandas and DataFrames¶

Python’s true power for research comes from its rich ecosystem of pre-built libraries, collections of code that make complex tasks much easier. One of the most important libraries for data analysis is pandas.

pandas (nope, not a typo. The lower case is the styling for pandas) is a Python library designed specifically for working with tabular data (think spreadsheets, CSV files, or survey results). Its main data structure is the DataFrame, which lets you organize, clean, analyze, and visualize data efficiently—without reinventing the wheel.

With pandas, you can:

Read data from CSV, Excel (xlsx), and many other formats (json, parquet, HDF5, etc)
Clean and transform messy datasets
Filter, group, and summarize data with just a few lines of code
Merge and join multiple datasets
Export your results for sharing or further analysis

In research, pandas is your go-to tool for turning raw data into meaningful insights.

Installation¶

We often need to install such libraries separately. There are many ways of doing that, buy we will use pip, the standard package manager for Python. It is a command-line tool used to install and manage additional libraries and dependencies that are not included in the Python standard library. Since Python versions 3.4 and 2.7.9, pip is included by default with the Python installer.

If you do no have pandas installed, uncomment the following cell and run it.

#%pip install pandas

3.1. Converting a list to a pandas dataframe¶

Let’s get pandas to work - converting our list “participants_records_step01” to a Pandas DataFrame. Remember the data structure. We have:

several key-value pairs in a dictionary represnting data from one participant,
saved in a list containing several such dictionaries.

# First, we import the pandas library to this notebook kernel and give it an alias 'pd' for easier reference.
import pandas as pd
# Create a DataFrame from a list of dicts
df_step01 = pd.DataFrame(participants_records_step01)
df_step01

Notice how the keys became the column headings. Now, let’s convert “participants_records_step01” into a deataframe, which has a different data structure - this is a dictionary of dictionaries where the id of each participant is a key and the other attributes are stored in a dictionary. Let’s see how this difference makes a difference in the dataframe.

df_step02 = pd.DataFrame(participants_records_step02)
df_step02

Think about what is the main difference between df_step01 and df_step02, and we will discuss this in our next meeting. See the cell below for a hint.

df_step02.transpose()

Note: When we create a DataFrame, we are creating an instance of the pandas.DataFrame class (I know I am talking gibberish; what is an instance and a class? We will know about them in the fourth notebook).

For now, let’s move on with the understanding that a DataFrame comes bundled with two key things:

Attributes: These are properties that describe the data, like its shape or its column names (e.g., df.shape).
Methods: These are functions built into the object that allow it to perform actions on itself, like saving to a file or calculating a mean (e.g., df.to_csv())

3.2. More Functionalities of pandas - Saving Dataframes¶

pandas is a powerhouse for research data wrangling! While we will talk more about pandas in a future notebook dedicated to “Data wringling”, here are some of its most useful functionalities:

Exporting data: Save your cleaned or analyzed data with df.to_csv(), df.to_excel(), etc.
Reading data: pd.read_csv(), pd.read_excel(), pd.read_json() let you import data from various formats.
Data inspection: df.head(), df.info(), df.describe() help you quickly understand your dataset.
Data selection: Use .loc[] and .iloc[] to select rows/columns by label or position.

Saving a pandas dataframe as CSV file¶

The to_csv() method lets you save your DataFrame as a CSV file—a common format for sharing and archiving research data on your computer. Here are some handy parameters:

path_or_buf: The filename or file object to write to (e.g., 'results.csv').
sep: The separator to use (default is ','). For tab-delimited, use sep='\t'.
index: Whether to write row indices (default is True). Set index=False to skip them.
columns: Specify a list of columns to write (default is all columns).
header: Write out column names (default is True).
encoding: File encoding (e.g., 'utf-8', 'utf-8-sig' for Excel compatibility).
na_rep: How to represent missing values (default is empty string, e.g., na_rep='NA').
mode: File write mode ('w' for write, 'a' for append).

Example:

df.to_csv('cleaned_data.csv', index=False, sep=',', encoding='utf-8', na_rep='NA')

This saves your DataFrame to a CSV file, without row indices, using commas, and writing ‘NA’ for missing values.

pandas has many more tricks up its sleeve—explore the docs or just try things out! Your future self (and your research collaborators) will thank you. Below is a minimalist code snippet, a command, that saves “df_step01” as a .csv file in the same directory where this notebook is saved.

df_step01.to_csv("participants_records_step01_with_idx.csv", index=True)
df_step01.to_csv("participants_records_step01_no_idx.csv", index=False)

3.3. Importing Data from CSV¶

You can also load data from a CSV file (one of the most common format of saving data files in behavioral research, online survey tools, etc).

# Example: Read a CSV file (assuming 'participants.csv' exists)
df_csv = pd.read_csv("participants_records_step01_with_idx.csv")
df_csv.head()

df_csv = pd.read_csv("participants_records_step01_no_idx.csv")
df_csv.head()

3.4. Accessing data from a dataframe¶

Accessing a Whole Row: You can use .loc (label-based) or .iloc (integer-position based) to access either entire rows. This will return the entire row as a Series.
Accessing One Element: To get a single cell, you provide both the row and the column separated by a comma inside the brackets [row, column].
- By Position: df.iloc[row_pos, col_pos], like df.iloc[0, 1], or df.iat[0, 1]
- By Label: df.loc[row_label, col_label], like df.loc[0, "age"] or df.at[0, "age"]

Remember that both row and column indices start at 0.

# Let's say, we want to see the value in the first row.
df_csv.iloc[0]

# Or, we want to access the row with index label 0 (which is the default index in this case).
df_csv.loc[0]

# To get the value of a specific cell, we can use .iloc[row_index, column_index].
df_csv.iloc[0,1]

'John Doe'

df_csv.at[0, 'name']

'John Doe'

Getting a row from using a value in one of the cells


df_csv.loc[df_csv["name"] == "John Doe"]

# Uncomment the line below to see how we can use string methods to filter rows based on partial matches in the 'name' column.

#df_csv['name'].str.contains('John')
df_csv[df_csv['name'].str.contains('John')]

df_csv.loc[df_csv["year_spent"] >=0]

Section 4. NumPy and Multidimensional Arrays¶

NumPy is Python’s go-to library for fast, efficient math with large arrays of numbers. If pandas is your spreadsheet for labeled, tabular data, NumPy is your scientific calculator for crunching numbers—especially when you need speed or want to work with multi-dimensional data (like matrices or images).

You don’t need to master NumPy now, but it’s helpful to know:

NumPy arrays are like supercharged lists for numbers.
They’re much faster and use less memory than regular Python lists for big numeric data.
You can do math on whole arrays at once (vectorized operations), which is both efficient and concise.

Let’s see NumPy in action with a few simple examples. If you do not have NumPy installed, uncomment the cell below and run it.

#%pip install numpy

# Import NumPy and create a simple array
import numpy as np
scores = np.array([88, 91, 77])
print("NumPy array:", scores)
print("Type:", type(scores))

NumPy array: [88 91 77]
Type: <class 'numpy.ndarray'>

NumPy vs. Lists and pandas: Why Arrays?

NumPy arrays are different from Python lists and pandas DataFrames:

Arrays are much faster for math and use less memory than lists.
You can do math on all elements at once (vectorized operations), instead of looping.
Use NumPy for large, numeric, or multi-dimensional data (like matrices or scientific data).
Use pandas for labeled, spreadsheet-like data (with column names and mixed types).

Let’s see a simple example of vectorized math with NumPy.

# Vectorized operation: add 5 to every score
boosted_scores = scores + 5
print("Boosted scores:", boosted_scores)

Boosted scores: [93 96 82]

Working with Multi-Dimensional Data

NumPy really shines when you need to work with more than one dimension—like a table (matrix) or even a 3D block of data (think: images or time series).

Let’s create a 2D array (matrix) and do a simple operation.

# Create a 2D NumPy array (matrix) and compute the mean
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("2D array (matrix):\n", matrix)
print("Mean of all elements:", matrix.mean())

2D array (matrix):
 [[1 2 3]
 [4 5 6]
 [7 8 9]]
Mean of all elements: 5.0

Summary: When to Use NumPy

Use NumPy when you need fast math on large, numeric, or multi-dimensional data (like arrays, matrices, or scientific data).
Use pandas when you want labeled, tabular data (like spreadsheets or survey results).

You don’t need to master NumPy now—just know it’s your go-to tool for efficient number crunching. We’ll explore more advanced features in future notebooks!

Section 5: Operators and Basic Operations for Research¶

Welcome to the world of operators! Here, you’ll learn how to add, subtract, compare, and generally boss your data around. This is where Python starts to feel less like a calculator and more like a research assistant who never sleeps (or complains about coffee).

5.1. Concept: Arithmetic, Comparison, and Logical Operators¶

Python supports all the usual suspects:

Arithmetic: +, -, *, /, //, %, **
Comparison: ==, !=, >, <, >=, <=
Logical: and, or, not

# Example: arithmetic and comparison
score1 = 88
score2 = 91
print("Sum:", score1 + score2)
print("Difference:", score2 - score1)
print("Are they equal?", score1 == score2)
print("Is score2 higher?", score2 > score1)

Sum: 179
Difference: 3
Are they equal? False
Is score2 higher? True

5.2. Concept: Vectorized Operations with NumPy¶

NumPy lets you do math on whole arrays at once, which is both efficient and a great way to impress your friends at research parties.

import numpy as np
scores = np.array([88, 91, 77])
attendance = np.array([5, 4, 2])

# Add 5 points to everyone (grade inflation, anyone?)
boosted_scores = scores + 5
print("Boosted scores:", boosted_scores)
# Compute mean score
print("Mean score:", np.mean(scores))
# Logical selection: who scored above 85?
print("Above 85:", scores > 85)

Boosted scores: [93 96 82]
Mean score: 85.33333333333333
Above 85: [ True  True False]

5.3. Type-Safe Operations and Common Pitfalls¶

Python is flexible, but sometimes too flexible. Mixing types (like adding a string to an int) will get you a TypeError and a gentle reminder to check your data types.

# Example: type safety
try:
    result = "88" + 5
except TypeError as e:
    print("Oops!", e)

Oops! can only concatenate str (not "int") to str

Mini Challenge 5¶

Given two NumPy arrays (scores and attendance), compute the mean, standard deviation, and a boolean array indicating who is eligible (score >= 85 and attendance >= 3). Print a summary using f-strings and keep the sanity_level variable alive and well.

Weekly Challenge: The Research Data Pipeline (Now with More Sanity)¶

It’s time to put your new skills to the test! This week’s challenge is to build a mini research data pipeline that would make any behavioral scientist (and their future self) proud.

Requirements¶

Load participant data from a CSV (or define a list of dicts if you’re feeling retro).
Convert all relevant fields to the correct types (no string scores allowed!).
Use pandas and/or NumPy to:
- Filter for eligible participants (age >= 18, consent True, attendance >= 3)
- Add a sanity_level column: “sane” if score >= 85, “borderline” otherwise
- Compute and print the mean, median, and standard deviation of scores for eligible participants
- Print a summary report using f-strings, with at least one joke about research or sanity
Bonus: Write a function that takes a DataFrame and returns a summary DataFrame with eligibility and sanity_level already computed.

Example Starter Code¶

import pandas as pd
import numpy as np
# Define participant records
records = [
    {"id": "P01", "age": "22", "consent": True, "attendance": "5", "score": "88"},
    {"id": "P02", "age": "19", "consent": True, "attendance": "4", "score": "91"},
    {"id": "P03", "age": "17", "consent": False, "attendance": "2", "score": "77"},
    {"id": "P04", "age": "21", "consent": True, "attendance": "3", "score": "84"},
    {"id": "P05", "age": "20", "consent": True, "attendance": "5", "score": "95"},
    {"id": "P06", "age": "18", "consent": False, "attendance": "1", "score": "80"},
    {"id": "P07", "age": "23", "consent": True, "attendance": "6", "score": "89"},
    {"id": "P08", "age": "19", "consent": True, "attendance": "4", "score": "90"},
]
df = pd.DataFrame(records)
# Convert types
for col in ["age", "attendance", "score"]:
    df[col] = df[col].astype(int)
# Eligibility
df["eligible"] = (df["age"] >= 18) & (df["consent"]) & (df["attendance"] >= 3)
# Sanity level
df["sanity_level"] = np.where(df["score"] >= 85, "sane", "borderline")
# Filter eligible
eligible_df = df[df["eligible"]]
# Summary stats
mean_score = eligible_df["score"].mean()
median_score = eligible_df["score"].median()
std_score = eligible_df["score"].std()
print(f"Eligible participants: {len(eligible_df)}")
print(f"Mean score: {mean_score:.2f}, Median: {median_score}, Std: {std_score:.2f}")
print(f"Sanity check: {eligible_df['sanity_level'].value_counts().to_dict()}")
print("If your code runs, your sanity_level is at least 'borderline'.")

# (Optional) Bonus: Function for summary pipeline

def summarize_participants(df):
    """Return a DataFrame with eligibility and sanity_level, and print summary stats."""
    df = df.copy()
    for col in ["age", "attendance", "score"]:
        df[col] = df[col].astype(int)
    df["eligible"] = (df["age"] >= 18) & (df["consent"]) & (df["attendance"] >= 3)
    df["sanity_level"] = np.where(df["score"] >= 85, "sane", "borderline")
    eligible_df = df[df["eligible"]]
    mean_score = eligible_df["score"].mean()
    median_score = eligible_df["score"].median()
    std_score = eligible_df["score"].std()
    print(f"Eligible participants: {len(eligible_df)}")
    print(f"Mean score: {mean_score:.2f}, Median: {median_score}, Std: {std_score:.2f}")
    print(f"Sanity check: {eligible_df['sanity_level'].value_counts().to_dict()}")
    print("If your code runs, your sanity_level is at least 'borderline'.")
    return eligible_df

# Example usage:
# eligible_df = summarize_participants(df)