Helping students to estimate mean and standard deviation

Estimating the mean and standard deviation of a discrete random variable is something we expect NZ students to be able to do by the time they finish Year 13 (Grade 12). The idea is that students estimate these properties of a distribution using visual features of a display (e.g. a dot plot) and, ideally, these measures are visually and conceptually attached to a real data distribution with a context and not treated entirely as mathematical concepts.

At the start of this year I went looking for an interactive dot plot to use when reviewing mean and standard deviation with my intro-level statistics students. Initially, I wanted something where I could drag dots around on a dot plot and show what happens to the mean, standard deviation etc. as I do this. Then I wanted something where you could drag dots on and off the dot plot, rather than having an initial starting dot plot, so students could build dot plots based on various situations. I came across a few examples of interactive-ish dot plots out there in Google-land but none quite did what I wanted (or kept the focus on what I wanted), so I decided to write my own. [Note: CODAP would have been my choice if I had just wanted to drag dots around. Extra note: CODAP is pretty awesome for many many reasons].

In my head as I developed the app was an activity I’ve used in the past to introduce standard deviation as a measure – Exploring statistical measures by estimating the ages of famous people – as well as a workshop by the awesome Christine Franklin. For NZ-based teachers (or teachers who want to come to beautiful New Zealand for our national mathematics teachers conference), Chris is one of the keynote speakers at the NZAMT 2017 conference and is running a workshop at this conference called Conceptualizing Variation from the Mean: Evolving from ‘Number of Steps’ to the ‘SAD’ to the ‘MAD’ to the ‘Standard Deviation’  which you should get along to if you can. Also in my head was the idea of the mean of a distribution being like the “balancing point”, and other activities I have used in the past based on this analogy and also see-saws! My teaching colleague Liza Bolton was also super helpful at listening to my ideas, suggesting awesome ones of her own, and testing the app throughout its various versions.

dots – an interactive dot plot

You can access dots at this address: but you might want to keep reading to find out a little more about how it works 🙂 Below is a screenshot of the app, with some brief descriptions of how things are supposed to work. Current limitations for dots are that no more than 35 dots will be displayed, the axis is fixed between 0 and 34, and that dots can only be placed on whole numbers. I had played around with making these aspects of the app more flexible, but then decided not to pursue this as I’m not trying to re-create graphing/statistical software with this interactive.

Since I’ve got the It’s raining cats and dogs (hopefully) project running, I thought I’d use some of the data collected so far to show a few examples of how to use dots. [Note: The data collection phase of the cats and dogs data cards project is still running, so you can get your students involved]. Here are 15 randomly selected cats from the data cards created so far, with the age of each cat removed.

Once you get past how cute these cats are, what do you think the mean age of these cats is (in years)? Can you tell which cat is the oldest? How much variation do you think there is between the ages of these cats?

Dragging dots onto the dot plot

A dot plot can be created by dragging dots on to the plot (don’t forget to add a label for the axis like I did!)


Sending data to the dot plot

You can also add the data and the label to the URL so that the plot is ready to go. Use the structure shown below to do this, and then click on the link to see the ages of these cats on the interactive dot plot.,1,12,16,4,2,11,8,4,9,5,2,3,1,17&label=ages_of_cats_in_years

Turns out China is the oldest cat in this sample.

Exploring the balance point

You can click below the dots on the axis to indicate your estimate for the mean. You could do a couple of things after this. You could click the Mean button to show the mean, and check how this compares to your estimated mean. Or you could click the Balance test button to turn in on (green), and see how well the dots balance on the point you have estimated as the mean (or both like I did).


Estimating standard deviation

Estimating standard deviation is hard. I try not to use “rules” that only work with Normally distributed-ish data (like take the range and divide by six) and aren’t based on what the standard deviation is a measure of. Visualising standard deviation is also a tricky thing. In the video below I’ve gone with two approaches: one uses a Chrome extension Web Paint to draw on the plot where I think is the average distance each dot is from the mean and one uses the absolute deviations.


Using “random distribution”

This is the option I have used the most when working with students individually. Yes, there is no context when using this option, but in my conversations with students when talking about the mean and standard deviation I’m not sure the lack of context makes it non-conceptual-building activity. The short video below shows using the median as a starting point for the estimate of the mean, and the adjusting from here depending on other features of the distribution (e.g. shape). The video ends by dragging a dot around to see what happens to the different measures, since that was the starting point for developing dots 🙂


Other ideas for using dots?

Share them below the related Facebook post, on Twitter, or wherever – I’d be super keen to hear whether you find this interactive dot plot useful for teaching students how to estimate mean and standard deviation 🙂

PS no cats were harmed in the making of this GIF

It’s raining cats and dogs (hopefully)

In April 2017, I presented an ASA K-12 statistics education webinar: Statistical reasoning with data cards (webinar). Towards the end of the webinar, I encouraged teachers to get students to make their own data cards about their cats. A few days later, I then thought that this could be something to get NZ teachers and students involved with. Imagine a huge collection of real data cards about dogs and cats? Real data that comes from NZ teachers and students? Like Census At School but for pets 🙂 I persuaded a few of my teacher friends to create data cards for their pets (dogs or cats) and to get their students involved, to see whether this project could work. Below is a small selection of the data cards that were initially created (beware of potential cuteness overload!)

The project then expanded to include more teachers and students across NZ, and even the US, and I’ve now decided to keep the data card generator (and collection) page open so that the set of data cards can grow over time. Please use the steps below to get students creating and sharing data cards about their pets.

Creating and sharing data cards about dogs and cats

Inevitably, there will be submissions made that are “fake”, silly or offensive (see below).

Data cards submitted to the project won’t automatically be added to any public sets of data cards, and will be checked first. Just like with any surveying process that is based on self-selection, is internet based and relies on humans to give honest and accurate answers, there is the potential for non-sampling errors. To help reduce the quantify of “fake” data cards, if you are keen to have your students involved with this project it would be great if you could do the following:

1. Talk to your students about the project and explain that the data cards will be shared with other students. They will be sharing information about their pet and need to be OK with this (and don’t have to!). The data will be displayed with a picture of their pet, so participation is not strictly anonymous. All of this is important to discuss with students as we need to educate students about data privacy 🙂

2. When students submit their data, they are given the finished data card which they can save. Set up a system where students need to share the data card they have created with you e.g. by saving into a shared Google drive or Dropbox, or by emailing the data card to you. The advantage for you of setting up this system is that you get your class/school set of data cards to use however you want. The advantage for me is that this level of “watching” might discourage silly data cards being created.

3. Share this link with your students and let the rain of cats and dogs begin!

Pet data cards

The data collection period for this set of data cards was 1 May 17 to 19 May 17.

The diagram below shows the data included on each data card:

Additional data that could be used from each data card includes:

  • Whether the pet photo was taken inside or outside
  • Whether the pet photo is rotated (and the angle of rotation)
  • The number of letters in the pet name
  • The number of syllables in the pet name

PDF of all data cards: click to download


Which one doesn’t belong …. for stats?

If you haven’t heard of the activity Which one doesn’t belong? (WODB), it involves showing students four “things” and asking them to describe/argue which one doesn’t belong. There are heaps of examples of Which one doesn’t belong? in action for math(s) on the web, Twitter, and even in a book. From what I’ve seen, for math(s) I think the activity is pretty cool. In terms of whether WODB works for stats, however, I’m not so sure. Perhaps for definitions, facts, static pieces of knowledge it could work (?), but in terms of making comparisons involving data and its various representations (including graphs/displays), I need more convincing. There’s something different between comparing properties of shapes (for example), which remain fixed, and comparing data about something/someone, which could vary.

For example, What cat doesn’t belong? for the four “stats cats” data cards shown below.

To make comparisons between the four cats means to reason with data, but if I am considering only the data provided in these four data cards then these comparisons are made without uncertainty. For example, I can say definitively, for these four cats, that:

  • Elliot is the only cat with a name that has three syllables,
  • Molly is the only female cat,
  • Joey is the only cat is both an inside and outside cat,
  • Classic is the only cat that uses a cat door.

I could argue many different cases for which cat (or photo) does not belong. This is all cool, but doesn’t feel like statistics to me. Statistics is all about using data to make decisions in the face of uncertainty, by appreciating different sources of variation and considering how to deal with these. In particular, inferential reasoning involves going beyond the data at hand, thinking about generalisability, considering the quality and quantity of data available, and appreciating/communicating the possibility of being wrong not matter how “right” the methodology.

So while I appreciate that WODB allows for “not just one correct answer” and the development of argumentation skills, I’d be more happier if this kind of activity within statistics teaching led to the posing of statistical investigative questions (SIQ): WODB->SIQ. Why? We need more data and more of an idea of where the data came from to really answer the really interesting questions that comparing these four cats might provoke us to consider. We need students to feel the uncertainty that comes from thinking and reasoning statistically and to help students find ways to deal with this uncertainty. We also need students to care about the questions being asked of the data – my worry here is that otherwise the question students might ask when using WODB is Who cares which one doesn’t belong? 🙂

Questions I have when looking at these stats cats data cards, which are interesting to me are: I wonder …. How many syllables do cats’ names have? Do most cats have two syllable names? Is Elliot (my cat!) an unusual name for this reason? Do I spend too much on cat food ($NZD30 per week)? Or maybe black cats are more expensive to feed? I won’t be able to get definitive answers to these questions, but by collecting more data and investigating these questions using statistical methods I can get a better understanding of what could be plausible answers.

PS Want some of these data cards? Head here –> It’s raining cats and dogs (hopefully)

Statistical reasoning with data cards (webinar)

UPDATE: The video of the webinar is now available here.

I’m super excited to be presenting the next ASA K-12 Statistics Education Webinar. The webinar is based on one of my sessions from last year’s Meeting Within a Meeting (MWM) and will be all about using data cards featuring NZ data/contexts. I’ll also be using the digital data cards featured in my post Initial adventures in Stickland if you’d like to see these in “teaching action”.

The webinar is scheduled for Thursday April 20 9:30am New Zealand Time (Wednesday April 19 at 5:30 pm Eastern Time, 2:30 pm Pacific), but if you can’t watch it live a video of the webinar will be made available after the live presentation 🙂

Here are all the details about the webinar:

Title: Statistical Reasoning with Data Cards

Presenter: Anna-Marie Fergusson, University of Auckland

Abstract: Using data cards in the teaching of statistics can be a powerful way to build students’ statistical reasoning. Important understandings related to working with multivariate data, posing statistical questions, recognizing sampling variation and thinking about models can be developed. The use of real-life data cards involves hands-on and visual-based activities. This talk will present material from the Meeting Within a Meeting (MWM) Statistics Workshop held at JSM Chicago (2016) which can be used in classrooms to support teaching within the Common Core State Standards for Mathematics. Key teaching and learning ideas that underpin the activities will also be discussed.

To RSVP to participate in the live webinar, please use the following link:

The ASA is offering this webinar without charge and only internet and telephone access are necessary to participate. This webinar series was developed as part of the follow-up activities to the Meeting Within a Meeting (MWM) Workshop for Math and Science teachers held in conjunction with the Joint Statistical Meetings ( MWM will be held again in Baltimore, MD on August 1-2, 2017.  For those unavailable to participate in the live webinar, ASA will record this webinar and make it available after the live presentation. Previous webinar recordings are available at

A stats cat in a square?

On Twitter a couple of days ago, I saw a tweet suggesting that if you mark out a square on your floor, your cat will sit in it.


Since I happen to have a floor, a cat, and tape I thought I’d give it a go. You can see the result at the top of this post 🙂 Amazing right?

Well, no, not really. I marked out the square two days ago, and our cat Elliot only sat in the square today.

Given that:

  • our cat often sits on the floor
  • our cat often sits on different parts of said floor
  • that we have a limited amount of floor
  • I marked out the square in an area that he likes to sit
  • that we were paying attention to where on the floor our cat sat

… and a whole lot of other conditions, it actually isn’t as amazing as Twitter thinks. Also, my hunch is that people who do witness their cat sitting the square post this on Twitter more often than those who give up waiting for the cat to sit in the square.

Below is a little simulation based on our floor size and the square size we used, taking into account our cat’s disposition for lying down in places. It’s just a bit of fun, but the point is that with random moving and stopping within a fixed area, if you watch long enough the cat will sit in the square 🙂

PS The cat image is by Lucie Parker. And yes, the cat only has to partially in the square when it stops but I figured that was close enough 🙂

Using data and simulation to teach probability modelling

This post provides the notes and resources for a workshop I ran for the Auckland Mathematical Association (AMA) on using data and simulation to teach probability modelling (specifically AS91585/AS91586). This post also includes notes about a workshop I ran for the AMA Statistics Teachers’ Day 2016 about my research into this area.

Using data in different ways

The workshop began by looking at three different questions from the AS91585 2015 paper. What was similar about all three questions was that they involved data, however, how this data was used with a probability model was different for each question.

For the first question (A), we have data on a particular shipment of cars: we know the proportion of cars with petrol cap on left-hand side of the car and the percentage of cars that are silver. We are then told that one of the cars is selected at random, which means that we do not need to go beyond this data to solve the problem. In this situation, the “truth” is the same as the “model”. Therefore, we are finding the probability.

For the second question (B), we have data on 10 cars getting petrol: we know the proportion of cars with petrol caps on the left-hand side of the car. However, we are asked to go beyond this data and generalise about all cars in NZ, in terms of their likelihood of having petrol caps on the left-hand side of the cars. This requires developing a model for the situation. In this situation, the “truth” is not necessarily the same as the “model”, and we need to take into account the nature of the data (amount and representativeness) and consider assumptions for the model (the conditions, the model applies IF…..). Therefore, when we use this model we are finding an estimate for the probability.

For the third question (C), we have data on 20 cars being sold: we know the proportion of cars that have 0 for the last digit of the odometer reading (six). What we don’t know is if observing six cars with odometer readings that end in 0 is unusual (and possibly indicative of something dodgy). This requires developing a model to test the observed data (proportion), basing this model on an assumption that the last digit of an odometer reading should just be explained by chance alone (equally likely for each digit). Therefore, when we use this model, we generate data from the model (through simulation) and use this simulated data to estimate the chance of observing 6 (or more) cars out of 20 with odometer readings that end in 0. If this “tail proportion” is small (less than 5%), we conclude that chance was not acting alone.

There’s a lot of ideas to get your head around! Sitting in there are ideas around what probability models are and what simulations are (see the slides for more about this) and as I discovered during my research last year with teachers and probability distribution modelling, these ideas may need a little more care when defining and using with students. The main reason I think we need to be careful using data when teaching probability modelling is because it matters whether you are using data from a real situation, where you do not know the true probability, or whether you are using data that you have generated from a model through simulation. Each type of data tells you something different and are used in different ways in the modelling process. In my research, this led to the development of the statistical modelling framework shown below:

All models are wrong but some are more wrong than others: Informally testing the fit of a probability distribution model

At the end of 2016, I presented a workshop at the AMA Statistics Teachers’ Day based on my research into probability distribution modelling (AS91586). This 2016 workshop also went into more detail about the framework for statistical modelling I’m developing. The video for this workshop is available here on Census At School NZ.

We have a clear learning progression for how “to make a call” when making comparisons, but how do we make a call about whether a probability distribution model is a good model? As we place a greater emphasis on the use of real data in our statistical investigations, we need to build on sampling variation ideas and use these within our teaching of probability in ways that allow for key concepts to be linked but not confused. Last year I undertook research into teachers’ knowledge of probability distribution modelling. At this workshop, I shared what I learned from this research, and also shared a new free online tool and activities I developed that allows students to informally test the fit of probability distribution models.

During the workshop, I showed a live traffic camera from Wellington (, which was the context for a question developed and used (the starter question AKA counting cars). Before the workshop, I recorded five minutes of the traffic and then set up a special html file that pauses the video every five seconds. This was so teachers at the workshop (and students) could count the number of cars passing different points on the motorway (marked with different coloured lines) every five seconds. To use this html file, you need to download both of these files into the same folder – traffic.html and traffic.mp4. I’ve only tested my files using the Chrome browser 🙂

If you don’t want to count the cars yourself, you can head straight to the modelling tool I developed as part of my research: In the dropdown box under “The situation” there are options for the different coloured points/lines on the motorway. The idea behind getting teachers and students to actually count the cars was to try to develop a greater awareness of the complexity of the situation being modelled, to reinforce the idea that “all models are wrong” – that they are approximations of reality but not the truth. Also, I wanted to encourage some deeper thinking about limitations of models. For example, in this situation, looking at five second periods, there is an upper limit on how many cars you can count due to speed restrictions and following distances. We also need to get students to think more about model in terms of sample space (the set of possible outcomes) and the shape of the distribution (which is linked to the probabilities of each of these outcomes), not just the conditions for applying the probability distribution 🙂

In terms of the modelling tool, I developed a set of teaching notes early last year, which you can access in the Google drive below. This includes some videos I made demonstrating the tool in action 🙂 I also started developing a virtual world (stickland but this is still a work in progress. Once you have collected data on either the birds or the stick people, you can copy and paste it into the modelling tool. There will be more variables to collect data on in the future for a wider range of possible probability distributions (including situations where none is applicable).

Slides from IASC-ARS/NZSA 2017 talk

Resources for workshop (via Google Drive)

Developing learning and formative assessment tasks for evaluating statistically-based reports

This post provides the notes and resources for a workshop I ran for the Auckland Mathematical Association (AMA) on developing learning and formative assessment tasks for evaluating statistically-based reports (specifically AS91584).

Notes for workshop

The starter task for this workshop was based around a marketing leaflet I received in my letterbox for a local school back in 2014. I was instantly skeptical about the claims being made by the school and went straight to sources of public data to check the claims. As was often the case, this personal experience turned into an activity I used with my Scholarship Statistics students to help them develop their critical evaluation skills. The task, public data I used, and my attempt at answers (from my past self in 2014) are provided at the bottom of this post. My overall conclusion was that most of the claims check out until around 2011, but not so much for 2012 – 2013, leading my to speculate that the school had not updated their marketing leaflet. The starter task is all about claims and data, and not so much about statistical processes, study design, or inferential reasoning – all of which are required for students to engage with the evaluation of statistically-based reports. However, I used this task to set the focus of the workshop, which was to focus on the claims that are being made, and whether they can be supported or not, and why.

The questions used for the external assessment tasks for AS91584 (available here) are designed to help scaffold students to critique the report in terms of the claims, statements or conclusions made within the report. Students need to draw on what has been described in the report and relevant contextual and statistical knowledge to write concise and clear discussion points that show statistical insight and answer the questions posed. This is hard for students. Students find it easy to write very creative, verbose and vague responses, but harder to write responses that are not based only on speculation or that are not rote learned. We see this difficulty with internally assessed tasks as well, so it’s not that surprising that students struggle to write concise, clear, and statistically insightful discussion points under exam pressure.

Teachers who I have spoken to who have taught this standard (which includes me) really enjoy teaching statistical reports to students. In reflections and conversations with teachers on how we could further improve the awesome teaching of statistical reports, a few ideas or suggestions emerged:

  • Perhaps we focus our teaching too much on content, keeping aspects such as margin of errors and confidence intervals, observational studies vs experiments, and non-sampling errors too separate?
  • Perhaps we focus too much on “good answers” to questions about statistical reports, rather than “good questions” to ask of statistical reports?

Great ideas for teaching statistical report can be sourced from Census at School NZ or from conversations with “statistical friends” (see the slides for more details). These include ideas such as: experiencing the study design first and then critiquing a statistical report that used a similar design, using matching cards to build confidence with different ideas, keeping a focus on the statistical inquiry cycle, teaching statistical reports through the whole year rather than in one block, and teaching statistical reports alongside other topics such as time series, bivariate analysis, and bootstrapping confidence intervals. I quite like the idea of the “seven deadly sins” of statistical reports, but didn’t quite have enough time to develop what these could be before the workshop – feel free to let me know if you come up with a good set! [Update: Maybe these work or could be modified?]

When I taught statistical reports in 2013 (the first year of the new achievement standard/exam), I was gutted when I got my students’ results back at the start of 2014.  I reflected on my teaching and preparation of students for the exam and realised I had been too casual about teaching students how to respond to questions. In particular, I had expected my “good” students would gain excellence (the highest grade – showing statistical insight) because they had gained excellences for the internally-assessed students or were strong contenders to get a Scholarship in Statistics. So, a bit later in 2014, when the assessment schedules came out, I looked carefully at what had been written as expected responses. To me, it seemed that a good discussion point had to address three questions: What? Why? How? Depending on the question being asked, the whats, whys and hows were a bit different, but at the time (only having one exam and schedule to go with!) it seemed to make sense. At least, in my teaching that year with students, I felt that using this simple structure allowed me to teach and mark discussion points more confidently. You can see more details for this “discussion point” structure in the slides.

The last part of the workshop involved providing teachers with one of three statistical reports (all around the theme of coffee of course!) and asking them, in groups, to develop a formative assessment task. After identifying one or two key claims made in the report, they had to select three or four questions from previous year’s exams that would be relevant for questioning the report in front of them (relevant to the conclusions made in the report). We didn’t quite get this finished in the workshop – the goal was to create three formative assessment tasks that could be shared! However, perhaps some of the teachers who attended the workshop will go on to develop formative assessment tasks and email these to me to share at a later date. I do feel strongly that all teachers of statistics should feel confident to write their own formative or practice assessment tasks for whatever they are teaching – if you’re not sure about what understanding you are trying to assess and what questions to ask to assess that understanding, how do you feel confident with what to teach? I’m hoping to launch a project next term to help support statistics teachers to feel more confident with writing formative assessment tasks, so watch this space 🙂

Resources for workshop (via Google Drive)

Ideas for using technology to design and carry out experiments online

This post provides the notes for a workshop I ran at the Otago Mathematics Association (OMA) conference about using technology to design and carry out experiments online.

Actually, at the moment this post only provides a PDF of the slides I used for the workshop – I will update this post with more detail later this year 🙂 Links and documents referred to in the slides are at the bottom of this page.

Associated links/documents

Initial adventures in Stickland


This post provides the notes for a workshop I ran at the Otago Mathematics Association (OMA) Conference about using data challenges to encourage statistical thinking.

Until last week, I had never re-presented or adapted a workshop that I had developed in a previous year.  So it really interesting to take this workshop on data challenges, which I had presented at the AMA and CMA stats days last year, and work through it again with a new bunch of awesome teachers in Dunedin.  I wrote notes about this workshop last year –  Using data challenges to encourage statistical thinking  – so this post will just share a few things I tweaked the second time around, including an activity we tried in Stickland 🙂

Some changes and additions

To show an example of a predictive model in action, we used one of a few online tools which attempt to predict your age using your name (based on US data) e.g. I also demonstrated another online tool that attempts to predict your gender based on writing ( by using my abstract for this workshop (it did correctly predict, based on the writing being formal, that it was written by a female). For the actual data challenge itself using the celebrity data, I purposefully removed Dr Dre from the training data set to make it easier to explore the data without worrying about how to handle his extremely high earnings for 2014 (new link here).

Testing Stickland

Another thing I changed about the workshop this time around was that rather than use physical data cards (these Census at school stick people data cards), we tried out my new digital data cards in the virtual world of Stickland. I’ve already shared a little bit about the ideas behind Stickland – see the Welcome to stickland! post – so what follows is an example of how we used Stickland in the workshop. (Just a quick reminder that the data cards are real students from the NZ Census At School 2015 data, the names being the only variable that is not real).


The activity starts with the idea of wanting to predict whether a stick person chosen at random from Stickland uses Facebook or not. If you head to, the first thing you could do is select a sample of stick people and see what proportion of them use Facebook. I got the teachers in this workshop to select 20 stick people and then let them play with moving the data cards around in the grey screen below (click or touch the card to drag the card to somewhere else on the screen e.g. to sort the cards into Facebook users and non-Facebook users).


For the sample shown above, an equal number of stick people are Facebook users than not, but of course this will vary from sample to sample. I then told the teachers that this particular stick person is a Snapchat user, and asked them if this changes their prediction of whether they are a Facebook user or not. One way to explore this is to create a two way table with the cards (see below) and then reason with this.


Most of the different samples showed a similar story to the sample above: Of the Snapchat users, most were Facebook users and of the non-Snapchat users, most were non-Facebook users. I then suggested (if we had time) we could also explore whether knowing the gender and age of the stick person would help us build a better model for predicting Facebook usage. At this stage (considering multiple variables/factors) I would want the students to move into software that allows them to explore the data more deeply (more about how that is possible is discussed in the Welcome to stickland! post). We didn’t do this in the workshop and the teachers had to leave Stickland perhaps before they wanted to 🙂

Where to next?

Stickland is just in “proof of concept” form at the moment and will no doubt have lots of bugs and weird features. In the Welcome to stickland! post, I discuss the influence of others in developing these digital data cards, in particular Pip Arnold and her work with statistical investigations and data cards that stretches back to at least 2005 (if not earlier!). Feel free to have a play and to let me know what you think about the concept, but this is definitely a possible project for 2017 and not intended to be a fully featured product yet.

Mind the stats?


Have you noticed how Google sometimes gives the top page in your search results a little summary box? For example, if you Google “how to plan a honeymoon”, you get this:


Since I didn’t do number two on this list, my job for tonight was to check out trains for our travel in the UK leg of our honeymoon. After my first Google search, I got a little distracted and consequently typed up this short post 🙂  I realised part way through that “mind the gap” is more of a London underground thing than a UK train travel thing, but it’s late so hopefully the reference still makes sense.

My first (and only) search tonight was for a train from London to Cambridge. Before even clicking through to the website listed, I got to read this little “statistical report” 🙂


The first two sentences got me questioning what “fastest journey time” means, since how can the “average journey time” be lower than the shortest journey time? The third sentence made me shake my head at the misuse our special stats word “average”  and I automatically re-worded that sentence in my head to “on weekdays there are, on average, 96 trains per day…..”

So not only because I actually needed to find out about trains from London to Cambridge, but also because I was curious to find out what “fastest journey time” means, I clicked through to

When you scroll down to the bottom you get this nice table:


This gives some immediate answers to my confusion about the Google search summary – I think. “Slowest route” actually means the minimum time, and “Fastest route” means the maximum time. At least now the average journey time of one hour sits between these two numbers, but did you notice when you scrolled down the page that there were some routes listed with times greater than 63 minutes, the supposed “fastest route”?

Me too, so I went through all routes for the next 24 hours (starting from 8:44am London time) and listed their times:


There’s bound to be a few mistakes in there when I was converting from hours to minutes 🙂 But to finish this short critique, let’s look at the data:


For this particular 24 hour period (from Monday 21st November 8:44am) there were 76 trains from London to Cambridge, with a mean journey time of around 64 minutes (based on the advertised times). If I wanted to check out the claims about the average number of trains per weekday and the average journey time, I’d need a better sampling method and more “weekdays” of data. But this sample does offer evidence to contradict the claims about “shortest” and “fastest” journey times.

Unless those terms still don’t mean what I think they mean, even when I reverse them 🙂