The goal of the cleaning process was to prepare the teen mental health dataset for exploratory data analysis and visualization.
Input:
data/Teen_Mental_Health_Dataset.csv
Output:
data/Teen_Mental_Health_cleaned.csv
| Metric | Value |
|---|---|
| Rows | 1,200 |
| Columns | 13 |
| Missing values | 0 |
| Duplicate rows | 0 |
Column names were standardized to a consistent snake_case style.
Examples:
Daily Social Media Hours becomes daily_social_media_hoursSleep Hours becomes sleep_hoursDepression Label becomes depression_labelDuplicate rows were checked with:
df.duplicated().sum()
Result:
0 duplicate rows
No duplicate rows needed to be removed.
Missing values were checked with:
df.isnull().sum()
Result:
0 missing values across all columns
No imputation was required for the teen dataset.
Text categories were standardized by stripping whitespace and converting values to lowercase.
Columns cleaned:
genderplatform_usagesocial_interaction_levelExample:
df["gender"] = df["gender"].str.strip().str.lower()
The following checks were performed:
df[df["age"] <= 0]
df[df["daily_social_media_hours"] <= 0]
df[df["sleep_hours"] <= 0]
df[~df["depression_label"].isin([0, 1])]
No invalid values were found based on these checks.
| Metric | Value |
|---|---|
| Rows | 1,200 |
| Columns | 13 |
| Missing values | 0 |
| Duplicate rows | 0 |
The cleaned dataset was exported with:
df.to_csv("data/Teen_Mental_Health_cleaned.csv", index=False)
Using index=False prevents pandas from writing the DataFrame index as an extra CSV column.
The dataset was already mostly clean. The main cleaning work involved standardizing text categories and validating that the data had no missing values, duplicate rows, or invalid numeric ranges.