loading...
best viewed on
larger screen
The Case for a Data Driven Marathon
by Valentino Constantinou & Ashley Felber
Sorry, your browser does not support inline SVG.
Every year, tens of thousands of runners compete in the Chicago Marathon. Thousands more fail to finish. Roughly one thousand have injuries that vary widely in severity, from a simple blister to cardiac arrest.

The city of Chicago is forward-thinking in regards to marathon medicine and in driving open access to civic data. We illustrate that marathon organizers and the city could improve the safety of the Chicago Marathon by retrieving data from health trackers, such as those from Nike, Fitbit, and Garmin. Access to participants' health tracker data could provide understanding of a runner's physical condition, such as through examining heart rate, throughout the course and also prior to injury.
Some Background
Sorry, your browser does not support inline SVG.
The first Chicago Marathon took place on September 23, 1905 with 20 registered participants. Today, that number has grown to 45,000 participants, with over 70,000 applicants [1, 2]. While thousands of participants run or compete in the marathon, thousands more come to watch the event from in and around Chicago, and even from other states. Last year, 84% of marathon participants ran the entirety of the course. While certainly an achievement, this means that 7,818 failed to finish because they did not show up or due to exhaustion or injury. Most of the non-finishers have simply overestimated their ability, but roughly a thousand have injuries during the event every single year. What if we had a better idea of where exactly those injuries occurred, and under what physical conditions?
37,182 participants completed the 2015 Chicago Marathon [3].


A growing amount of Chicago Marathon participants are using health tracking devices to record heart rates, calories burned, distance traveled, and the intensity of our physical activity. What if we leveraged that information to improve the safety and efficiency of the entire event? The City of Chicago has been a pioneer in open civic data in recent years [4, 5, 6]. Continuing that trend towards events such as the Chicago Marathon could help the health and safety of participants by making improvements to operations. These are improvements that can only be made by having a better understanding of participants' physical strain and speed as they progress throughout the course.
The Chicago Marathon's 26.219 mile course. The Chicago Marathon is one of six World Marathon Majors.
An Organized Trial
Sorry, your browser does not support inline SVG.
We decided to conduct a small trial using Fitbit devices in order to assess the validity and usability of the data in a marathon context. The trial took place on May 14th, 2016 in Evanston, Illinois just south of the Northwestern University campus. Twelve volunteers completed an approximately 2.1 mile course along Lake Michigan, with an average temperature of 46F during the time the event was held. Many thanks to these wonderful volunteers. This proof-of-concept would not have become a reality without their time and sweat.

So, what did we gain from our trial run? After all, at 2.1 miles it represents just 8% of a total marathon distance (just 16% for a half marathon). While this is true, we were nevertheless able to easily see varying trends of runner pace over each minute of the trial race. Some participants had increased speed at the start and finish while others had a fairly constant pace throughout. The visualization below clearly shows differences in pace across participants.
Individual runner pace over the course distance, animated by the minute. Note that clear differences can be seen between participants. On a larger scale, pace across groups of runners could be visualized.

On a larger scale with thousands of runners and in a marathon context, a similar visualization could be used to illustrate the density of runners over the entire course at any point in time. This density could then be split between groups in order to show differences or similarities between them. How strongly does skill level affect overall pace? Does age have an impact on pace in the early stages, or in later stages? How far in advance of an aid station did a participant's injury occur, and would altering the location of those aid stations provide a better level of care? These are all questions that could be answered with similar graphs. While the above visualization is modest, it can easily serve as the foundation for the analysis of large groups of runners, such as the hundreds or thousands with health trackers that participate in the marathon each year.
What We Learned
Sorry, your browser does not support inline SVG.
Understanding runner pace over the length of the course was a great first step towards validation of the data and how it could potentially be used in a marathon context. It's certainly important, but we learned many other valuable lessons throughout the project that could aid in large-scale implementation from data collection to storage to retrieval and analysis. We go into further detail in the preceding paragraphs, but one of the major facts about the data we did not foresee is that the data lacks integrity near the start and finish. We also learned a decent amount of fun facts, such as:
Avg Pace: 13.4 min/mile


So... it is clear that we don't have any record-setters in our sample population (the current record holder is Dennis Kimetto, at roughly 4:40). Regardless, it is not hugely relevant in our case. The majority of marathon runners do not approach record-setting pace. Marathon runners will show better overall pace over our sample, but the differences are not drastic enough as to impede on the exploration of the data's usefulness. In the adjacent heatmap, we show just one visualization possibility that could be directly applied to a larger scale. Do certain brands of shoes enhance or degrade pace? Could some cause more podiatric injuries?

A good amount of the lessons we learned came about early in the process during framework development. While accessing the Fitbit API and retrieving data is relatively straight-forward, by default Fitbit places a restriction on the level of granularity an application can access. Fitbit regularly allows a deeper granularity of access by special request, and we were granted intraday granularity following a polite email where we explained our research goals. This process took about 26 days from sending the email to being granted access.

It also became clear that achieving critical mass would only be attainable by migrating our scripts to the web. In order to be granted access to the data, we had to have each participant sign into their Fitbit account prior to the start of the race. While running the script for each user provided the minimum amount of functionality for our trial purposes, allowing would-be participants to sign up online would dramatically expand the potential audience. This is currently a goal we would like to pursue in our development plan, along with including new functionality to retrieve data from other types of devices.
Avg Pace: 13.4 min/mile
We also learned lessons in regards to data retrieval and validity. The data retrieval is a post-race process, which involves querying the Fitbit API for a specific set of information using credentials received during the initial sign up and login process. The amount of data we can access is partly a function of the amount of time that has passed since the conclusion of the race. Since participants' wearables may sync with the Fitbit server in different intervals, at least several days may pass before being able to retrieve a large sample of the potential population. In our case, we queried the data four days following the event. Two users had syncing issues, resulting in ten total participants in our data set. On a larger scale, we'd suggest waiting several weeks between event and retrieval of data in order to guarantee a successful retrieval.

Data integrity was a key attribute of the data we wanted to examine as part of our analysis. Upon visual examination of the data, such as in the adjacent heatmap, we discovered that the first few minutes of data could be empty or not representative of initial pace. It was also not immediately clear as to when the runner completed the event from the data itself. We were able to infer end times from our manual recordings on the day of the event, but translation to marathon scale should involve the use of officially recorded runner start and end times. Both the start and end of the data for each runner is highly skewed and should be disregarded in certain areas of analysis. The vast majority of the data, however, was representative of a particular participant's pace throughout the course. It's not surgical quality data, but it's certainly enough to make robust inferences.
Total Steps: 34,496


Lastly, the data's validity was examined through the eyes of researchers and analysts. Beyond analyzing pace, it is possible to examine elevation changes, heart rates (depends on device), cardio activity levels, stride lengths (walking and running), and the standard set of demographic variables, among other factors. While demographics such as shoe brand were obtained through a participant survey, the Fitbit API provides the analyst with device-generated data, some of which is averaged over the lifetime of the user's activity (increasing accuracy). While we don't dive into a deeper analysis of the data in this article, it is entirely possibly to adapt the frame of an analysis due to the breadth of accessible data.
Towards the Future
Sorry, your browser does not support inline SVG.
While the JavaScript D3 library was used for the visualizations in this article, we also used R and the RShiny package to construct a visualization dashboard more tailored towards individual participants. While researchers could use the data to improve the safety of the event, participants can also improve their own lives by having easier and better access to their personal data. Our RShiny package (linked below under "explore") provides an example of how access to data could be expanded for the greater public good.

Data collection was facilitated by the Fitbit API, which allowed us to build a framework using Python 3 and MongoDB to complete the end-to-end process of requesting data storage to formal analysis. Our framework includes functions that request user access, store unique access credentials, and request user specific information on physical activity by daily summaries or intraday level (1 min most granular). Going forward, we would like to bring the framework online as an open invitation to any marathon participant that may be interested in sharing their marathon data, partly by expanding the use beyond Fitbit devices to others such as Nike+. Development is open source and free to anyone with an interest to help (simply click "develop" below).

About the Data Scientists
Sorry, your browser does not support inline SVG.
Valentino Constantinou is a student at Northwestern University's Master of Science in Analytics program. He is a graduate of the University of Tennessee’s Haslam College of Business, where he studied Economics. In his spare time, Valentino enjoys reading, spending time outdoors, and working on extracurricular projects such as analysis of medical data for the Chicago Marathon. He has accepted a 2016 summer internship position at the Jet Propulsion Laboratory, a joint facility between NASA and Caltech in Pasadena, California, as an IT Data Scientist. Valentino will be graduating from the MSiA program in December 2016.

Ashley Felber is also a student at Northwestern's Master of Science in Analytics program. She is a graduate of the University of Michigan with a degree in Economics and minor in Mathematics, and will be interning at Amazon in Seattle for her 2016 summer internship. Ashley enjoys spending time outdoors, hiking, and enjoying some delicious wine and cheese in her free time. She will also be graduating from the MSiA program in December 2016.