After moving to Delaware and gaining at least an hour of my life back in commuting time, I been working out more than I did last semester. Unfortunately, it seems this semester a lot of other people have the same idea. Whether from the cold and rainy weather, or trying to get back in shape in time for spring break, or some other reason, the number of undergrads in the gym has sufficiently intimidated me to avoid lifting on campus.
Luckily, my new apartment has a Peloton, which I like a lot but maybe not enough to spend $2000 plus $50 a month for my own. And I usually get a really good workout!
After a pretty intensive ride one day, I started to wonder if there was a relationship between the qualities of these workouts and other factors in my life, like sleep quality. Maybe, I thought, if I get better sleep at night then I feel better, and end up pushing myself a bit harder and end up with a better workout. But conversely, if I get more sleep, I’m more relaxed and might rather take it easy and not push myself at the gym. So I’m not sure if there is or isn’t an effect, and which direction it goes!
In this post I’ll take you along the process of trying to answer this question using simple statistical concepts. The last few posts have been a bit more advanced, so I thought it would be good to provide a simpler example from my life using statistics.
Formulating the question
The question I would like to answer is: Does better sleep quality lead to better workouts for me?
Typically, a good research question should be specific enough that it is feasible, but are general enough that it is applicable, new, and interesting. If I were to say “how does the sleep quality on tempurpedic mattresses affect the calories burnt during workouts of white men aged 25 in northern Delaware” this might be so specific that you might think that it wouldn’t extend to other people, or really imply anything to anyone else. But, you might also think “hey if this holds for men, it probably might hold for women too” and so the specificity isn’t such a big deal. A more longitudinal study including the sleep quality across lots of people from lots of places would be more powerful, but not necessarily change the result!
For my purposes, I only cared about the outcome for me anyway, so I just used my own data.
Gathering data
Step 1 was to download my sleep and workout data recorded from my Apple Watch. You can do the same with your data from the Health app on your iPhone.
To approximate sleep quality, I take the number of hours slept, and to approximate workout quality, I use the number of active calories burnt during the workout.
Like many measurements in economics, these measurements are not perfect. Sleep quality might be how long you sleep but it might also be how deeply you sleep or how many times you are disturbed during the night. Each or a combination of these measurements might lead to a more accurate picture of sleep quality, but for simplicity I just used duration.
Analyzing the results
Plotting my Apple Watch data leads to the following graph:
Where the number of hours that I slept the night before is on the x-axis, and how many active calories I burnt during my workout are on the y-axis, and the various colors represent different years.
Right off the bat you can see this data is pretty evenly distributed. There doesn’t seem to be a clear trend in the data, and I can have a good workout on little sleep or a bad one on more sleep or vice versa.
These qualitative observations are useful, but what we actually want to do is come up with a number for if there is actually a relationship between sleep and workout quality or not. To do that, we can estimate the following simple model, and conduct a t-test.
Dependent variable: Calories Burnt = Constant: “a” + Coefficient “b” * Independent variable: Hours Slept
Here, the t-test will come up with a number. “b” is what we calculate from the model, Beta is the hypothesized value, and sb is the standard error of the estimate of b, or the degree of variability of the coefficient.
If there is a relationship between sleep and workouts, we would think that the coefficient, b, would not be zero. If more sleep leads to better workouts, then we expect it to be positive, and if if leads to worse workouts, we expect it to be negative. So to find out if sleep effects workouts, set Beta to be 0 so it drops out.
Then, if this statistic “t” is large enough, we can say there is significant statistical evidence that the coefficient b is greater than the variability of it. So there is evidence of a relationship between the two variables. Note that this is not the same thing as saying there is a relationship, or a causal effect in the sense that better sleep causes better workouts. But it does provide evidence in favor of it!
When you estimate the model and calculate the t-test, you find that the coefficient is small (3.8) and insignificant (t=0.4). Typically the threshold for a significant value of t is 2, and since 0.4<2 we cannot say that b is different than 0. You can see this on the graph where the line of best fit appears to be pretty flat.
Interpretation
Through this exercise (pun intended) I was not able to find evidence that better sleep leads to better workouts. And even if I did that wouldn’t necessarily mean that better sleep causes better workouts, there may be something else that is causing the better workouts that is influenced by the better sleep. Only that there is a relationship between the two variables. (Causality is a much stronger statement that I’ll need another UTF post to convince you of!)
But the result here also doesn’t mean that better sleep doesn’t have an effect on workouts. It might be the case that there is something wrong with this data (measurement error from the Apple Watch), something else is at play (some days that I don’t get much sleep I have an afternoon cup of coffee and so actually have more energy when I workout), or there is reverse causality (the better workouts lead to better sleep for the next day as well!)
So we aren’t exactly closer to answering the research question, but we did end up with a lot of good ideas about what other factors might be causing the result that we saw. At least that is a start, which, in economics might only be as good as we can do.
Note* if you want to analyze your own data, it exports into an xml file that needs to be conditioned a bit before using. You can adjust my python code below to do so. There are lots of good data from the health app, so you can answer your own research questions :)
import csv
import pandas as pd
import xmltodict
with open("export.xml", 'r') as file:
filedata = file.read()
# Converting xml to python dictionary (ordered dict)
data_dict = xmltodict.parse(filedata)
#record list are other health records besides workouts
record_list = [dict(x) for x in data_dict["HealthData"]["Record"]]
workout_list = [dict(x) for x in data_dict["HealthData"]["Workout"]]
health_data = pd.DataFrame(record_list)
workout_data = pd.DataFrame(workout_list)
lifting_data = data_workout[workout_data['@workoutActivityType']=='HKWorkoutActivityTypeTraditionalStrengthTraining']
workout_list = [dict(x[0]) for x in lifting_data['WorkoutStatistics']]
workouts_df =pd.DataFrame(workout_list)
#%%
sleep_data = data[data['@type']=='HKCategoryTypeIdentifierSleepAnalysis']
sleep_data = sleep_data.loc[sleep_data['@value'].isin(['HKCategoryValueSleepAnalysisAsleepCore', 'HKCategoryValueSleepAnalysisAsleepDeep', 'HKCategoryValueSleepAnalysisAsleepREM','HKCategoryValueSleepAnalysisAsleepUnspecified'])]
#need to trim time stamp data
for i in sleep_data.index:
sleep_data.loc[i,'@creationDate']=sleep_data['@creationDate'][i][:-6]
sleep_data.loc[i,'@startDate']=sleep_data['@startDate'][i][:-6]
sleep_data.loc[i,'@endDate']=sleep_data['@endDate'][i][:-6]
#convert timestamps to time data type
sleep_data[['@creationDate','@startDate','@endDate']] = sleep_data[['@creationDate','@startDate','@endDate']].apply(pd.to_datetime)
sleep_data['duration']=0
for i in sleep_data.index:
sleep_data.loc[i,'duration']= (sleep_data['@endDate'][i]-sleep_data['@startDate'][i]).total_seconds()/3600
#remove data collected from iPhone
sleep_data[sleep_data['@sourceName']!="iPhone (6)"]
hrs_sleep = sleep_data.groupby('@creationDate').sum()
hrs_sleep['Date']= hrs_sleep.index.date
# %%
for i in workouts_df.index:
workouts_df.loc[i,'@startDate']=workouts_df['@startDate'][i][:-6]
workouts_df[['@startDate','@endDate']] = workouts_df[['@startDate','@endDate']].apply(pd.to_datetime)
workouts_df['Date']=0
for i in workouts_df.index:
workouts_df.loc[i,'Date'] = workouts_df['@startDate'][i].date()
# combine datasetes
merged_data = workouts_df.merge(hrs_sleep,on='Date')
merged_data['Year']=0
for i in merged_data.index:
merged_data.loc[i,'Year']=merged_data['Date'][i].year
# rename data sets of interest
merged_data['Calories Burnt'] = merged_data['@sum'].astype(float)
merged_data['Year']= merged_data['Year'].astype(str)
merged_data['Hours Slept'] = merged_data['duration']
# plot data
import plotly.express as px
px.scatter(merged_data,x='Hours Slept',y='Calories Burnt',color='Year')
#estimate model
import statsmodels.api as sm
model = sm.OLS(merged_data['Calories Burnt'],sm.add_constant(merged_data['Hours Slept'])).fit()
#print model summary
model.summary()
#plot with trendline
px.scatter(merged_data,x='Hours Slept',y='Calories Burnt',trendline="ols")