Um ….. here’s a new tool for exploring probability distributions!

Actually, it’s not a new tool exactly, more a re-working of the existing modelling tool I’ve already shared on this blog, but with a new name and web location – the probability distribution explorer!

I developed the probability distribution explorer as part of my Masters research into teaching probability distribution modelling. The proposed teaching framework and the tool were developed in response to use of data for distribution modelling for AS91586, in particular the need for students to demonstrate use of methods related to the distribution of true probabilities versus distribution of model estimates of probabilities versus distribution of experimental estimates of probabilities.

The tool was developed primarily to support comparisons of the “distribution of experimental estimates of probabilities” and “distribution of model estimates of probabilities”. When reviewing research literature, I found limited examples of how to teach this comparison using an informal approach i.e. not using a Chi-square goodness-of-fit test. Consequently, I also found a lack of statistically sound criteria to enable drawing of conclusions in such resources as textbooks, workbooks and assessment exemplars.

This led to my research, which involved a small group of New Zealand high school statistics teachers. Focusing on the Poisson distribution, the criteria used by ten Grade 12 teachers for informally testing the fit of a probability distribution model was investigated. I found that criteria currently used by the teachers were unreliable as they could not correctly assess model fit, in particular, sample size was not taken into account.

After exploring the goodness-of-fit using my visual inference tool, teachers reported a deeper understanding of model fit. In particular, that the tool had allowed them to take into account sample size when testing the fit of the probability distribution model through the visualisation of expected distributional shape variation. I’ve re-developed the tool this year to support NZQA as they explore opportunities for assessment within a digital environment. A team of teachers are developing prototype assessment activities for AS91586 and these will be trialled with students in schools later in the year.

The video below gives a general introduction to the tool, using data on how many times I say “um” when I’m teaching. The video itself provides another source of data because, um … well, you’ll see if you watch!

More videos, teaching notes and related resources can be found here: stat.auckland.ac.nz/~fergusson/prob_dist_explorer/teachers/

A stats cat in a square?

On Twitter a couple of days ago, I saw a tweet suggesting that if you mark out a square on your floor, your cat will sit in it.

 

Since I happen to have a floor, a cat, and tape I thought I’d give it a go. You can see the result at the top of this post πŸ™‚ Amazing right?

Well, no, not really. I marked out the square two days ago, and our cat Elliot only sat in the square today.

Given that:

  • our cat often sits on the floor
  • our cat often sits on different parts of said floor
  • that we have a limited amount of floor
  • I marked out the square in an area that he likes to sit
  • that we were paying attention to where on the floor our cat sat

… and a whole lot of other conditions, it actually isn’t as amazing as Twitter thinks. Also, my hunch is that people who do witness their cat sitting the square post this on Twitter more often than those who give up waiting for the cat to sit in the square.

Below is a little simulation based on our floor size and the square size we used, taking into account our cat’s disposition for lying down in places. It’s just a bit of fun, but the point is that with random moving and stopping within a fixed area, if you watch long enough the cat will sit in the square πŸ™‚

PS The cat image is by Lucie Parker. And yes, the cat only has to partially in the square when it stops but I figured that was close enough πŸ™‚

Using data and simulation to teach probability modelling

This post provides the notes and resources for a workshop I ran for the Auckland Mathematical Association (AMA) on using data and simulation to teach probability modelling (specifically AS91585/AS91586). This post also includes notes about a workshop I ran for the AMA Statistics Teachers’ Day 2016 about my research into this area.

Using data in different ways

The workshop began by looking at three different questions from the AS91585 2015 paper. What was similar about all three questions was that they involved data, however, how this data was used with a probability model was different for each question.

For the first question (A), we have data on a particular shipment of cars: we know the proportion of cars with petrol cap on left-hand side of the car and the percentage of cars that are silver. We are then told that one of the cars is selected at random, which means that we do not need to go beyond this data to solve the problem. In this situation, the “truth” is the same as the “model”. Therefore, we are finding the probability.

For the second question (B), we have data on 10 cars getting petrol: we know the proportion of cars with petrol caps on the left-hand side of the car. However, we are asked to go beyond this data and generalise about all cars in NZ, in terms of their likelihood of having petrol caps on the left-hand side of the cars. This requires developing a model for the situation. In this situation, the “truth” is not necessarily the same as the “model”, and we need to take into account the nature of the data (amount and representativeness) and consider assumptions for the model (the conditions, the model applies IF…..). Therefore, when we use this model we are finding an estimate for the probability.

For the third question (C), we have data on 20 cars being sold: we know the proportion of cars that have 0 for the last digit of the odometer reading (six). What we don’t know is if observing six cars with odometer readings that end in 0 is unusual (and possibly indicative of something dodgy). This requires developing a model to test the observed data (proportion), basing this model on an assumption that the last digit of an odometer reading should just be explained by chance alone (equally likely for each digit). Therefore, when we use this model, we generate data from the model (through simulation) and use this simulated data to estimate the chance of observing 6 (or more) cars out of 20 with odometer readings that end in 0. If this “tail proportion” is small (less than 5%), we conclude that chance was not acting alone.

There’s a lot of ideas to get your head around! Sitting in there are ideas around what probability models are and what simulations are (see the slides for more about this) and as I discovered during my research last year with teachers and probability distribution modelling, these ideas may need a little more care when defining and using with students. The main reason I think we need to be careful using data when teaching probability modelling is because it matters whether you are using data from a real situation, where you do not know the true probability, or whether you are using data that you have generated from a model through simulation. Each type of data tells you something different and are used in different ways in the modelling process. In my research, this led to the development of the statistical modelling framework shown below:

All models are wrong but some are more wrong than others: Informally testing the fit of a probability distribution model

At the end of 2016, I presented a workshop at the AMA Statistics Teachers’ Day based on my research into probability distribution modelling (AS91586). This 2016 workshop also went into more detail about the framework for statistical modelling I’m developing. The video for this workshop is available here on Census At School NZ.

We have a clear learning progression for how β€œto make a call” when making comparisons, but how do we make a call about whether a probability distribution model is a good model? As we place a greater emphasis on the use of real data in our statistical investigations, we need to build on sampling variation ideas and use these within our teaching of probability in ways that allow for key concepts to be linked but not confused. Last year I undertook research into teachers’ knowledge of probability distribution modelling. At this workshop, I shared what I learned from this research, and also shared a new free online tool and activities I developed that allows students to informally test the fit of probability distribution models.

During the workshop, I showed a live traffic camera from Wellington (http://wixcam.citylink.co.nz/nph-webcam.cgi/terrace-north), which was the context for a question developed and used (the starter question AKA counting cars). Before the workshop, I recorded five minutes of the traffic and then set up a special html file that pauses the video every five seconds. This was so teachers at the workshop (and students) could count the number of cars passing different points on the motorway (marked with different coloured lines) every five seconds. To use this html file, you need to download both of these files into the same folder – traffic.html and traffic.mp4. I’ve only tested my files using the Chrome browser πŸ™‚

If you don’t want to count the cars yourself, you can head straight to the modelling tool I developed as part of my research: http://learning.statistics-is-awesome.org/modelling-tool/. In the dropdown box under “The situation” there are options for the different coloured points/lines on the motorway. The idea behind getting teachers and students to actually count the cars was to try to develop a greater awareness of the complexity of the situation being modelled, to reinforce the idea that “all models are wrong” – that they are approximations of reality but not the truth. Also, I wanted to encourage some deeper thinking about limitations of models. For example, in this situation, looking at five second periods, there is an upper limit on how many cars you can count due to speed restrictions and following distances. We also need to get students to think more about model in terms of sample space (the set of possible outcomes) and the shape of the distribution (which is linked to the probabilities of each of these outcomes), not just the conditions for applying the probability distribution πŸ™‚

In terms of the modelling tool, I developed a set of teaching notes early last year, which you can access in the Google drive below. This includes some videos I made demonstrating the tool in action πŸ™‚ I also started developing a virtual world (stickland http://learning.statistics-is-awesome.org/stickland-modelling/) but this is still a work in progress. Once you have collected data on either the birds or the stick people, you can copy and paste it into the modelling tool. There will be more variables to collect data on in the future for a wider range of possible probability distributions (including situations where none is applicable).

Slides from IASC-ARS/NZSA 2017 talk

https://goo.gl/dfA9MF

Resources for workshop (via Google Drive)

How many of my emails will get rolled up this week?

all_rolled_up
At the start of the year I started using a service call unroll me with my gmail account. It allows you to wrap up regular or subscription emails into one daily email digest. It takes a number of months to setup the service to capture all your regular or subscription emails, but I have found it helpful in reducing the clutter in my email so worth the minimal effort.

I noticed – as you do when you’re a stats teacher – that the number of emails that are rolled up per day varies. I wondered if there was anything going on – any patterns, trends etc. –  so went back over the last couple of months and recorded how many emails were wrapped up per day.

So here’s a little challenge for your students πŸ™‚

Using the data on the number of my emails wrapped per day for the last few months, can they predict how many of my emails will be wrapped up over the next four days (Tuesday), Wednesday, Thursday and Friday?

Here’s the data…….

Jump with the data into iNZight lite

Download the data as a CSV

Link for data: http://statistics-is-awesome.org/rolled_up_emails.csv

Raw data as ordered counts (first count is a Monday)

14,11,25,24,24,36,21,12,13,23,28,19,27,8,15,14,19,24,26,24,7,21,19,32,26,25,25,12,14,21,16,27,25,23,12,13,24,22,19,21,25,10,19,16,18,32,24,23,10,14,22,30,24,25,24,15,15,21,27,22,32,26,11,18,23,28,32,18,32,13,18,26,26,35,23,22,13,14,18,22,30,26,26,9,21,16,27,21,25,20,10,17,22,31,15,27,25,10,16,20,17,27,24,22,15,22

Not sure how to get the students started?

Here are some ideas you could give to students:

  • Graph the data in Excel or another spreadsheet and used “your eyes” and/or a sketch to make the prediction
  • Import the data into iNZight (or equivalent) and try to use a time series model to make the predictions
  • Find the mean number of emails rolled up for each day of the week and use these to make the predictions
  • Use a probability distribution to model the number of emails rolled up each day and generate four random outcomes from this model to make the predictions

So how many emails did I get?

Move your mouse over the grey box below to see πŸ™‚

Tuesday: 22

Wednesday: 29

Thursday: 30

Friday: 33

Probability teaching ideas using simulation

probsim

This post provides some teaching examples for using an online probability simulation tool. It’s a supplement to the workshop I offered for the NZAMT 2015 conference.

Probability simulation tool

I recently developed  a very basic online probability simulation tool . I wanted a simulation tool that would run online without using applets or flash (tablet compatible). I also wanted to be able to animate repeated simulations in a loop – in the past to get this effect, I had to either make animated GIFs or set up slides in Powerpoint to transition automatically. I did a quick search for online simulation tools and couldn’t find what I wanted so I adapted some code I had written previously to get what I wanted.

An example of an animated looped simulation from the probability simulation tool

It’s very much designed “fit for a specific purpose” (more about that in the part 2) so I know it has lots of limitations πŸ™‚ But what I like about the feature being demonstrated above is that it will keep running automatically, freeing me up to ask the students questions about what they are seeing and why they are seeing this.

Small samples – lots of variation

One of the activities I presented in the workshop involved teachers trying to work out who my siblings were based on photos. I presented five sets of four photos. Within each set, one photo was of one of my siblings, the rest were photos of other non-related people. In the workshop there around 30 teachers present.  The basic idea (with lots of assumptions) is that distribution for the number of correct selections IF teachers were guessing can be modelled by a binomial distribution with n = 5 and p = 0.25.

After “marking” the teachers selections of my siblings, I created a dot plot of the 30 individual results. One of the questions put to the teachers at the workshop was “Do these results look like what we’d see if each of you was guessing which person was my sibling?”‘

class_results_siblings

To build up a simulated distribution based on guessing, each teacher then used five different hands-on simulations to make new sibling selections for each set of photos (see the resources link at the end of this post). I then created another dot plot from these simulated selections and asked teachers to compare the features of the two plots e.g. centre, spread, shape, unusual.

class_results_siblings_simulated

For this workshop, the two distributions actually came out to look pretty similar. But this won’t necessary happen. To demonstrate the amount of variation between repeated simulations (of 30 students guessing across five sets of possible siblings), I set up the probability simulation tool with the options shown in the screen grab below:

probsim2

So that the axis does not resize for each simulation, I fixed the axis between 0 and 5. To stop the dots from automatically resizing, I fixed the dot size to the smallest option. I then pressed “Start animation” and let the simulations run over and over again. This gives the following animation:


This animation could then be used to ask questions like:

  • “What would be an unlikely number of correct siblings if someone was guessing?”
  • “How many correct siblings would you expect to see if someone was guessing – between where and where?”
  • “What looks similar for each animation?”- “What looks different?”
  • “What variation are we seeing?” – “Why are we seeing it?”
  • “What does one dot on the graph represent?”
  • “How is the simulated data being generated?”

Want to read/see more?

Wild, C. Animations of sampling variation

Wild, C. VIT – Visual inference tools

NZ Senior secondary guide – Lateness: Choice or chance

Resources

Workshop materials – stimulating simulations NZAMT 2015

Online probability simulation tool

Statistics lesson starter: Is this really surprising?

dominoes

A supermarket is running a promotion. For every $20 you spend, you will receive one domino. There are 50 dominoes to collect. I received 10 dominoes for my last shop and was surprised to find that all 10 dominoes were different. Should I have been surprised? Explain πŸ™‚

 

Update after some more shopping…

dominoes_update

Statistics teaching ideas based on ….. the alphabet!

letters

This post focuses on randomness, simulations and probability.

10 quick ideas……

  1. Choose five letters (e.g. A, B, D, N, U) and display these together. For the rest of these ideas to work, choose letters that can go together to make three letter words (avoid certain words!). Ask students to randomly select one of the letters and write this down.
  2. Ask students to share honestly how they selected their letter – you should find they do use a reason e.g. the first letter of their name, or they choose the one they think no one else will select. Discuss the difference between selecting something and randomly selecting something, and get students to come up with examples for each e.g. selecting which lolly to eat based on which one you like vs putting your hand into a bag and choosing a lolly without looking.
  3. You could discuss more how humans are not that great at generating or accepting randomness. There are some great youtube videosand websites with ideas for activities to explore this. A nice example is this decision by Spotify to change their algorithm for shuffling songs – their article includes some nice visualisations to support their discussion. You could also explore how the word or concept of random is used in everyday language, or in particular, in design (like my example below).fake vs random
  4. Display the class results as a dot plot (with the letters along the horizontal axis). So what are we looking for in the plot? Ask the students – are these results what you expect? Some students may discuss expecting to see an equal number of selections for the five letters, others may expect to see uneven results because “it’s random”, others may have other ideas based on not trusting that other students selected their letters randomly. Try to get as much out of your students as possible so you know what they are thinking πŸ™‚
  5. We can’t use the results to prove that students selected their letters randomly or not, but we can see if the results look like what we’d get if a random process was used. Students may not know what they are looking for, and for small samples like a class, we actually expect quite a bit of variation. Use a simulation tool like this one to simulate randomly selecting n letters with replacement from the five letters you used (n being the size of your class). Discuss with the class whether their results look similar or different to the simulated results.spinner
  6. Make five large cards with each of the five letters on them. Select three students from the class (randomly or not!) and use a shuffling process to allocate each student one of the five cards. Get your students to stand in a line facing the class with their letters hidden. Ask the class how likely they think it is that when the three letters are shown that the three letters will make a word. Then get the students one by one from left to right reveal their letters.
  7. Get students to generate three letter “words” by randomly selecting three letters without replacementfrom the five letters you used (they could work in groups with their own set of five cards). This will require students to decide if a word is real or not. If you want to help students spot correct words,  you could do a round of “Bogggle” and get the class to create as many valid three letter words as possible from the five letters without repeating letters. Depending on previous learning, you may need to discuss the concept of probability estimates (AKA experimental probability), before getting students to generate 20 “words” (or more if you like!), counting how many of these “words” were real, and determining an estimate for the probability.
  8. Discuss how a simulation could be set up using a computer to run thousands of trials to check randomly created words from the five letters against a list of three letter words that are “real” to determine a closer estimate of the model probability (AKA theoretical probability). This process of checking words against a list of “true words”could be compared to processes around checking whether an email address submitted to an online signup form is “real” or not. We need keep linking what we do in the classroom with the real world πŸ™‚
  9. You could explore the model probability by considering the total number of “words” that could be created by randomly selecting three letters from the five without replacement (e.g. 5 x 4 x 3 = 60) and the total number of real words found by systematically trying out all permutations or by using a Scrabble tool like this one (e.g. for my five letters it’s eleven real words).
  10. You could finish by looking at the “Infinite Monkey Theorem“. This will require a bit more of a theoretical focus and understanding of complementary events and the usefulness of finding P(X = 0) when you need to find P(X β‰₯ 1). This kind of thinking can be referred to whenever a new animal is found to be awesome at predicting the results of sports games e.g. Paul the OctopusRichie McCow

Want to read more?

Kaplan et al. (2014) Exploiting lexical ambiguity to help students understand the meaning of random

Resources

TWIG resources for infinite monkey theorem