During the Dragon’s Back Race runners cross Wales over a five stage race. Every day a certain number of checkpoints needs to be visited, but between those checkpoints the route is free. Most of the time, the best route is rather obvious, but on occasions multiple options are possible. A big point of talk are the different options taken. Which one is the best? Which one is the fastest? Which one makes sure you don’t get lost?
In 2015, following the race became easier for armchair fans: every runner carried a GPS tracker. So you could see where they were during the day. And now we can know where everyone went during the race. Can we draw any conclusions from those data about route choices?
Day by Day
How and such
For those interested, I give a short methodological overview. First of all, the input data. Simple. There are two kinds of data used in the analysis.
One is the tracks as they were available via the live tracking on the website. Theoretically there is one point every 1.5 minutes, which theoretically should be correct within a few meters. Reality is always a bit more involved, but in general the tracks show rather clearly where the runners went. Some tracks are missing or partly missing. I assume this is due to a malfunctioning of some kind. There are also some spurious points, strange artefacts etcetera, but nothing you wouldn’t expect from GPS tracks. The 1.5 minute time interval reduces of course the precision of the data. Generally, runners can travel about 100 meters in that time. Its main implication is that close parallel options are near impossible to distinguish.
The second part of data are the intermediate times, as they are given by the race results. Those come from SI-dibbers, which is in general flawless. There are a few exceptions with missing data points and spurious values, but in general those times are very reliable.
My main objective was to find out differences in time between different route choices. And also to see what are the different choices made by the runners. Maybe I overlooked an interesting option during my race, that was spotted by others. The first question to answer is how you define a fast option. If you merely look at the times run, the fastest option will be the option chosen by the fastest runners. Which is not necessary interesting.
OK. First I’ll give some definitions. For me the race is divided in five stages. So, a stage is a day of running from camp to camp. Times for a stage vary from 7 to 16+ hours. Each stage is divided into multiple legs. For me, a leg is between two checkpoints. So, each of the five stages is divided into 9 to 23 legs (equalling the number of intermediate checkpoints of the day plus one). Times for each leg can vary between a few minutes to many hours. Just keep this in mind. It is the terminology that I will use everywhere.
My take on the problem was to convert all the leg times to a percentage of the stage time. The assumption is of course that this percentage is stable among all the runners. This assumption turned out to be rather correct. If you plot the percentages for each leg, you find a very nice flat line (with some exceptions, but I’ll mention that when we get there). This gave me a good way to work with the intermediate times.
Obviously, there is a very high correlation among those percentages: the sum of all percentages of one runner must be 100%. In normal language, if you make a stupid mistake at some point you will have a high percentage for the leg in question. But this implies that the percentages for the other legs get pushed down because the sum has to be 100%. So it looks as if you were running relatively fast, while this was possibly not the case. I ignored this correlation and treated the percentages as random variables. This is obviously not correct, but good enough for our purposes. And it simplifies everything enormously. The error made due to this will be the biggest in long legs of stages with few legs.
For the route options, the best starting point is just looking at the tracks on a map. Visual inspection reveals if there was variation in the ways taken. If everybody followed the same obvious path it makes no sense to make a further analysis. If there were different option, I manually identified those. For each track I manually decided to which option they belonged. Mistakes made in the classification are entirely my own. I’ve also experimented with a kNN-classifier, but in the end making the classification manually turned out to be both quicker and more accurate.
The implicit assumption is that all runners need the same percentage of their stage time to cover a specific leg. Then, if for all runners taking a specific route choice this percentage is higher than for the runners taking another route choice, we can assume that the former route was simply slower than the latter route. I want to stress that correlation does not imply causality. The causality might be the other way around. Maybe tired, slowing runners tend to prefer a route which looks easier (e.g. less elevation change). The choice of the route can certainly depend on how you feel at a particular moment. It is also useful to stress that the optimal will depend on the circumstances, especially the weather. Under different circumstances, and they will definitely be different during a next edition of the race, other choices might be better than the optimal choices of 2015. Furthermore, there is absolutely no guarantee that any part of the 2015 race will be part of next editions. Though some parts are of course very likely to be a part of all Dragon’s Back Races.
It has been announced that from 2017 on the map will contained an advised line. The data that you find here might help you spot the places where it is interesting to sneak to an alternative option.
The analysis is every time limited to runners that finished the stage. Otherwise it doesn’t make a lot of sense to calculate the percentage of the stage time. All times are given in minutes.
Distribution over stages
As a way to get started, I start by taking a look at how the distribution over the different stages looks like. Obviously, I won’t make any attempt to link this to route choices, but it might give an idea of the stages where people lost time.
Only 65 runners are ranked in the final standings. For those 65 this are histograms of how much time they spent on each of the stages.
The average percentage of finish time needed for each of the subsequent stages is 18.9%, 21.1%, 21.6%, 20.4% and 18.1%. As all participants probably figured out, the longest stages are the second and third stages. The first and last stages are the shortest ones. The first one feels a bit longer then it really is because you are still fresh, it has lots of elevation change and you start later than during the other stages. The spread is clearly larger on the last stage than on the other stages. I’m not sure if this is because some people were struggling to the finish line, or if it has to do with the weather conditions. During this last stage we got some foggy conditions making the navigation a lot more tricky than during the rest of the race. I might simply be the difference between people nailing the navigations and people loosing lots of time when they get lost or spending a lot of time to avoid getting lost.
Next, let’s see if this distribution over the stages changes with finish time.
This is for each of the stages a scatter plot showing the percentage of time spent on the stage in function of the total time. To consider differences between faster and slower runners, I’ve added a linear regression line. In order to read this you have to realise that for each runner there are five dots: one on each plot and those five dots are on a vertical line. We see some nice examples of the correlation between those percentages. The runners that are very bad (very high percentage) on one stage are usually relatively good (low percentage) during another stage. I’ll give two examples.
There is the runner who spent almost 25% of his race on stage 2. This is the rightmost bar in the histograms. We see that the same runner spent (relatively) the least time on stage 5 of all runners. So this same runner is the leftmost bar in the histograms. It is an example of someone who had a rough day during the second stage and made a big dive in the ranking on that day. Afterwards, he recovered and started moving up again in the rankings. I know because it’s me.
Another nice example is the runner who ran slightly more than 3500 minutes over the entire race. I won’t say who he is, but let’s call him Ally. In the first two stages he ran very fast for someone with his total time. You see that both times his dots are very low. During the last two stages, on the other hand, his dots are very high, indicating that we was very slow for someone with his total time. It is an example of a fast runner who got physical issues halfway through the races and ended up struggling to the finish line with the back of the packers in the last stages.
If you find yourself, the stages where you are above the line will be the stages where you had the most difficulties, while the stages where you are below the line are when you had a blast. Don’t be disappointed if you are not all the time below the line. That is impossible.
A logical question to ask is if there is a difference between faster and slower runners. We see that during stages 1 and 3 the regression line has negative slope, while during stages 4 and 5 it has a positive slope. This means that faster runners spent relatively longer on stages 1 and 3 and less time on stages 4 and 5. Or, another way to turn this is that the faster runners are better at keeping up the pace over the five stages, while slower runners are slowing while the race progresses. For the more mathematically inclined readers I give the slopes of the regression lines with a 95% confidence interval for each of the five stages: -6.2e-06 (-1.2e-05; -7.9e-07), -3.0e-06 (-8.4e-06; 2.4e-06), -6.7e-06 (-1.1e-05; -2.9e-06), 8.8e-06 (4.8e-06; 1.3e-05) and 7.1e-06 (5.3e-07; 1.4e-05).