There is a lot of real data out there that can be used for learning about statistics. It’s important, though, to choose data with variables students can understand and can connect with. I was really inspired by a talk Rob Gould gave at the NZAMT conference in July 2015 about professional versus modern data (you can read more in Rob’s paper Statistics and the Modern Student) and it did make me think about how we expect students to connect with data that has come from a study. If we give data that was collected through a study that itself had a purpose for the study, why should we expect or want students to develop their own purpose for investigating this same data? I do think students can be really interested in data from studies so I am not discouraging their use but perhaps that is what the purpose should be framed around – what am I personally interested in finding out about using this data?


The Auckland marathon is held each year and nearly 12 000 people enter the different events of the marathon. The reason that the Auckland marathon appeals to me as an example is how some of the data is collected for each runner: through a chip interfacing with different sensors placed at different points in the running courses. So we have “modern” data in terms of using sensors but it is intentionally collected so that runners can be awarded prizes. In this case, because of technology, we can get accurate data on quite a large number of runners. This data is combined with data that runners would have provided when entering the competition through an entry form, which is more like “professional” data in that this entry form was designed.

This is also an example of a well-defined population (all the runners entered in the Auckland marathon) which we could use to learn about sample to population inference. Before anybody starts to worry about the fact that we do have all the data so why would we take a sample, you should note that in the previous sentence I used the word “learn” – that is the important word here. For students to learn about sample to population inference, we need to be able to demonstrate the relationship(s) between a population and samples from this population, and to do this you need to have all of the “data” for a population. The most important thing about setting up students to sample from a population is that students get they are learning about sample-to-population inference: that they learn about what they can and can’t say about a population (parameter) when they only have some of the data from that population. If the focus is on this aspect of learning, then students do get why they are only using sample of the population data for their investigation.


So, when I first started teaching we got students to use a random number generator on their calculator to select members of a population list (and so their data) for a sample. There is no reason why students still couldn’t do this – procedurally it is no different from using a population bag……


….  or using an application/script to select a random sample from a hidden population (database).


Whether students see all the data in a spreadsheet, see all the data cards in a population bag, or use a population database, students know that in this learning environment all of the data exists (in that the variables have already been defined and measured for each member of the population) and that they are only going to have access to some of the data. Students should be learning about what is involved in creating data through sampling – not just the difficulties of defining sampling frames and minimising non-sampling errors like non-response bias etc.  but also about defining variables to be measured. However, we also need to balance different priorities for learning in statistics – we want to make connections between understandings but we also need to focus on some ideas more than others at different points in students’ learning progression – so there should be no issue with using “ready made” population data for learning about sample-to-population inference.

Although the data is in the database sitting behind the website, if you really want students to experience the “pain” of sampling, you could give students the range of bib numbers for the 2015 Auckland marathon (20 to 35951 although not all numbers are used in this range) and get them to generate random numbers using their calculator to select members of their sample. They can then go to the race results website  to look up each of these runners in their sample and record the information needed for an investigation. If you would like the whole population data set for the Auckland marathon 2015 you can access that here.

This post is based on a plenary I did for the Christchurch Mathematical Association (CMA) Statistics Day in November 2015 where I presented 10 ways to embrace the awesomeness that is our statistics curriculum. You can find all the posts related to this plenary in one place here as they are written.