Using data and simulation to teach probability modelling

This post provides the notes and resources for a workshop I ran for the Auckland Mathematical Association (AMA) on using data and simulation to teach probability modelling (specifically AS91585/AS91586). This post also includes notes about a workshop I ran for the AMA Statistics Teachers’ Day 2016 about my research into this area.

Using data in different ways

The workshop began by looking at three different questions from the AS91585 2015 paper. What was similar about all three questions was that they involved data, however, how this data was used with a probability model was different for each question.

For the first question (A), we have data on a particular shipment of cars: we know the proportion of cars with petrol cap on left-hand side of the car and the percentage of cars that are silver. We are then told that one of the cars is selected at random, which means that we do not need to go beyond this data to solve the problem. In this situation, the “truth” is the same as the “model”. Therefore, we are finding the probability.

For the second question (B), we have data on 10 cars getting petrol: we know the proportion of cars with petrol caps on the left-hand side of the car. However, we are asked to go beyond this data and generalise about all cars in NZ, in terms of their likelihood of having petrol caps on the left-hand side of the cars. This requires developing a model for the situation. In this situation, the “truth” is not necessarily the same as the “model”, and we need to take into account the nature of the data (amount and representativeness) and consider assumptions for the model (the conditions, the model applies IF…..). Therefore, when we use this model we are finding an estimate for the probability.

For the third question (C), we have data on 20 cars being sold: we know the proportion of cars that have 0 for the last digit of the odometer reading (six). What we don’t know is if observing six cars with odometer readings that end in 0 is unusual (and possibly indicative of something dodgy). This requires developing a model to test the observed data (proportion), basing this model on an assumption that the last digit of an odometer reading should just be explained by chance alone (equally likely for each digit). Therefore, when we use this model, we generate data from the model (through simulation) and use this simulated data to estimate the chance of observing 6 (or more) cars out of 20 with odometer readings that end in 0. If this “tail proportion” is small (less than 5%), we conclude that chance was not acting alone.

There’s a lot of ideas to get your head around! Sitting in there are ideas around what probability models are and what simulations are (see the slides for more about this) and as I discovered during my research last year with teachers and probability distribution modelling, these ideas may need a little more care when defining and using with students. The main reason I think we need to be careful using data when teaching probability modelling is because it matters whether you are using data from a real situation, where you do not know the true probability, or whether you are using data that you have generated from a model through simulation. Each type of data tells you something different and are used in different ways in the modelling process. In my research, this led to the development of the statistical modelling framework shown below:

All models are wrong but some are more wrong than others: Informally testing the fit of a probability distribution model

At the end of 2016, I presented a workshop at the AMA Statistics Teachers’ Day based on my research into probability distribution modelling (AS91586). This 2016 workshop also went into more detail about the framework for statistical modelling I’m developing. The video for this workshop is available here on Census At School NZ.

We have a clear learning progression for how “to make a call” when making comparisons, but how do we make a call about whether a probability distribution model is a good model? As we place a greater emphasis on the use of real data in our statistical investigations, we need to build on sampling variation ideas and use these within our teaching of probability in ways that allow for key concepts to be linked but not confused. Last year I undertook research into teachers’ knowledge of probability distribution modelling. At this workshop, I shared what I learned from this research, and also shared a new free online tool and activities I developed that allows students to informally test the fit of probability distribution models.

During the workshop, I showed a live traffic camera from Wellington (http://wixcam.citylink.co.nz/nph-webcam.cgi/terrace-north), which was the context for a question developed and used (the starter question AKA counting cars). Before the workshop, I recorded five minutes of the traffic and then set up a special html file that pauses the video every five seconds. This was so teachers at the workshop (and students) could count the number of cars passing different points on the motorway (marked with different coloured lines) every five seconds. To use this html file, you need to download both of these files into the same folder – traffic.html and traffic.mp4. I’ve only tested my files using the Chrome browser 🙂

If you don’t want to count the cars yourself, you can head straight to the modelling tool I developed as part of my research: http://learning.statistics-is-awesome.org/modelling-tool/. In the dropdown box under “The situation” there are options for the different coloured points/lines on the motorway. The idea behind getting teachers and students to actually count the cars was to try to develop a greater awareness of the complexity of the situation being modelled, to reinforce the idea that “all models are wrong” – that they are approximations of reality but not the truth. Also, I wanted to encourage some deeper thinking about limitations of models. For example, in this situation, looking at five second periods, there is an upper limit on how many cars you can count due to speed restrictions and following distances. We also need to get students to think more about model in terms of sample space (the set of possible outcomes) and the shape of the distribution (which is linked to the probabilities of each of these outcomes), not just the conditions for applying the probability distribution 🙂

In terms of the modelling tool, I developed a set of teaching notes early last year, which you can access in the Google drive below. This includes some videos I made demonstrating the tool in action 🙂 I also started developing a virtual world (stickland http://learning.statistics-is-awesome.org/stickland-modelling/) but this is still a work in progress. Once you have collected data on either the birds or the stick people, you can copy and paste it into the modelling tool. There will be more variables to collect data on in the future for a wider range of possible probability distributions (including situations where none is applicable).

Slides from IASC-ARS/NZSA 2017 talk

https://goo.gl/dfA9MF

Resources for workshop (via Google Drive)

How long does it take a student to submit a swear word into a text analysis tool?

Update on the predictive text challenge

I haven’t heard anything from anyone with any problems, and there seems to be a bit of traffic to the challenge page, so hopefully this is going well. I’ll allow checking of the first list of reserved words tomorrow. Students should put in what they predict the readability score will be for each word. These predicted scores will be checked against the actual readability scores and students will be given an overall result e.g. 85%. Oh, and just because you’re a teacher too you’ll get this idea for an investigative question/problem……. How long does it take a student to submit a swear word into a text analysis tool?

Related “reading themed” statistical investigation ideas

Check out http://josephrocca.com/randomsentence/ where you can generate “random” sentences from books that are no longer under U.S.A. copyright restrictions – so books generally published before the early 20th century. You could compare the process for random sampling sentences from digital books to processes for random sampling sentences from physical books (so much here with different sampling methods). You could give students an actual physical book and challenge them to estimate the total word count (check using the digital version!), or get students to devise a way to compare the “readability” of two books, or….?

So what was so surprising?

Recap: I got 10 dominoes from a supermarket recently and was surprised to find that all 10 were different (there are 50 different dominoes to collect). Ok, so on the face of it this may look like a familiar (and not super awesome) starter. Collecting cereal cards, ice block sticks, seed packets…….. But I was surprised to see this because I was thinking that a random process like this would mean I should expect to see at least one double up e.g. like seeing runs of heads when you flip a coin. When I thought about it more, I realised I wasn’t taking into account there were 50 dominoes – this makes a difference.

SOLO stands for the Structure of the Observed Learning Outcomes. It’s a model/taxonomy for defining different levels of understanding or thinking and was developed by J. Biggs and K. Collis in 1982. I’ve been using SOLO in my teaching of statistics since around 2006 and think it’s awesome. It fits so well with building conceptual understandings of statistics rather than just procedural ones. I use SOLO in (at least) two ways: (1) to structure good questions for students to use when working with data, questions to make them think at different levels and (2) to plan my teaching of a topic e.g. what are the key ideas (not skills)?

The prices increased from Jan to Feb and then decreased from Mar to May and then increased again…..

I think I like this answer on Quora re how to explain over-fitting of models. Some of the language is a bit off – I think if you swap the word “hypothesis” for “model” and remove “experiments” and replace with “observations” it reads better. But I like the idea of how to explain to students that a model is not about getting a perfect fit to the observed data and that simpler can be better (e.g. go for the minimum number of trend lines as possible that tell the general story of what is happening……).

Probability teaching ideas using simulation

This post provides some teaching examples for using an online probability simulation tool. It’s a supplement to the workshop I offered for the NZAMT 2015 conference.

Probability simulation tool

I recently developed  a very basic online probability simulation tool . I wanted a simulation tool that would run online without using applets or flash (tablet compatible). I also wanted to be able to animate repeated simulations in a loop – in the past to get this effect, I had to either make animated GIFs or set up slides in Powerpoint to transition automatically. I did a quick search for online simulation tools and couldn’t find what I wanted so I adapted some code I had written previously to get what I wanted.

An example of an animated looped simulation from the probability simulation tool

It’s very much designed “fit for a specific purpose” (more about that in the part 2) so I know it has lots of limitations 🙂 But what I like about the feature being demonstrated above is that it will keep running automatically, freeing me up to ask the students questions about what they are seeing and why they are seeing this.

Small samples – lots of variation

One of the activities I presented in the workshop involved teachers trying to work out who my siblings were based on photos. I presented five sets of four photos. Within each set, one photo was of one of my siblings, the rest were photos of other non-related people. In the workshop there around 30 teachers present.  The basic idea (with lots of assumptions) is that distribution for the number of correct selections IF teachers were guessing can be modelled by a binomial distribution with n = 5 and p = 0.25.

After “marking” the teachers selections of my siblings, I created a dot plot of the 30 individual results. One of the questions put to the teachers at the workshop was “Do these results look like what we’d see if each of you was guessing which person was my sibling?”‘

To build up a simulated distribution based on guessing, each teacher then used five different hands-on simulations to make new sibling selections for each set of photos (see the resources link at the end of this post). I then created another dot plot from these simulated selections and asked teachers to compare the features of the two plots e.g. centre, spread, shape, unusual.

For this workshop, the two distributions actually came out to look pretty similar. But this won’t necessary happen. To demonstrate the amount of variation between repeated simulations (of 30 students guessing across five sets of possible siblings), I set up the probability simulation tool with the options shown in the screen grab below:

So that the axis does not resize for each simulation, I fixed the axis between 0 and 5. To stop the dots from automatically resizing, I fixed the dot size to the smallest option. I then pressed “Start animation” and let the simulations run over and over again. This gives the following animation:

This animation could then be used to ask questions like:

• “What would be an unlikely number of correct siblings if someone was guessing?”
• “How many correct siblings would you expect to see if someone was guessing – between where and where?”
• “What looks similar for each animation?”- “What looks different?”
• “What variation are we seeing?” – “Why are we seeing it?”
• “What does one dot on the graph represent?”
• “How is the simulated data being generated?”

Wild, C. Animations of sampling variation

Wild, C. VIT – Visual inference tools

NZ Senior secondary guide – Lateness: Choice or chance

Resources

Workshop materials – stimulating simulations NZAMT 2015

Online probability simulation tool

Statistics lesson starter: Is this really surprising?

A supermarket is running a promotion. For every \$20 you spend, you will receive one domino. There are 50 dominoes to collect. I received 10 dominoes for my last shop and was surprised to find that all 10 dominoes were different. Should I have been surprised? Explain 🙂

Update after some more shopping…

Statistics teaching ideas based on ….. the alphabet!

This post focuses on randomness, simulations and probability.

10 quick ideas……

1. Choose five letters (e.g. A, B, D, N, U) and display these together. For the rest of these ideas to work, choose letters that can go together to make three letter words (avoid certain words!). Ask students to randomly select one of the letters and write this down.
2. Ask students to share honestly how they selected their letter – you should find they do use a reason e.g. the first letter of their name, or they choose the one they think no one else will select. Discuss the difference between selecting something and randomly selecting something, and get students to come up with examples for each e.g. selecting which lolly to eat based on which one you like vs putting your hand into a bag and choosing a lolly without looking.
3. You could discuss more how humans are not that great at generating or accepting randomness. There are some great youtube videosand websites with ideas for activities to explore this. A nice example is this decision by Spotify to change their algorithm for shuffling songs – their article includes some nice visualisations to support their discussion. You could also explore how the word or concept of random is used in everyday language, or in particular, in design (like my example below).
4. Display the class results as a dot plot (with the letters along the horizontal axis). So what are we looking for in the plot? Ask the students – are these results what you expect? Some students may discuss expecting to see an equal number of selections for the five letters, others may expect to see uneven results because “it’s random”, others may have other ideas based on not trusting that other students selected their letters randomly. Try to get as much out of your students as possible so you know what they are thinking 🙂
5. We can’t use the results to prove that students selected their letters randomly or not, but we can see if the results look like what we’d get if a random process was used. Students may not know what they are looking for, and for small samples like a class, we actually expect quite a bit of variation. Use a simulation tool like this one to simulate randomly selecting n letters with replacement from the five letters you used (n being the size of your class). Discuss with the class whether their results look similar or different to the simulated results.
6. Make five large cards with each of the five letters on them. Select three students from the class (randomly or not!) and use a shuffling process to allocate each student one of the five cards. Get your students to stand in a line facing the class with their letters hidden. Ask the class how likely they think it is that when the three letters are shown that the three letters will make a word. Then get the students one by one from left to right reveal their letters.
7. Get students to generate three letter “words” by randomly selecting three letters without replacementfrom the five letters you used (they could work in groups with their own set of five cards). This will require students to decide if a word is real or not. If you want to help students spot correct words,  you could do a round of “Bogggle” and get the class to create as many valid three letter words as possible from the five letters without repeating letters. Depending on previous learning, you may need to discuss the concept of probability estimates (AKA experimental probability), before getting students to generate 20 “words” (or more if you like!), counting how many of these “words” were real, and determining an estimate for the probability.
8. Discuss how a simulation could be set up using a computer to run thousands of trials to check randomly created words from the five letters against a list of three letter words that are “real” to determine a closer estimate of the model probability (AKA theoretical probability). This process of checking words against a list of “true words”could be compared to processes around checking whether an email address submitted to an online signup form is “real” or not. We need keep linking what we do in the classroom with the real world 🙂
9. You could explore the model probability by considering the total number of “words” that could be created by randomly selecting three letters from the five without replacement (e.g. 5 x 4 x 3 = 60) and the total number of real words found by systematically trying out all permutations or by using a Scrabble tool like this one (e.g. for my five letters it’s eleven real words).
10. You could finish by looking at the “Infinite Monkey Theorem“. This will require a bit more of a theoretical focus and understanding of complementary events and the usefulness of finding P(X = 0) when you need to find P(X ≥ 1). This kind of thinking can be referred to whenever a new animal is found to be awesome at predicting the results of sports games e.g. Paul the OctopusRichie McCow