This post provides the notes for an Ignite presentation I ran at the Christchurch Mathematical Association (CMA) 2015 Statistics Day on dealing with outliers. If you are not familiar with an Ignite presentation, it is 20 slides that auto-advance every 15 seconds for a total presentation of five minutes!
I’ve finally decided to post my slides and notes from this presentation (nearly four years later) as it gave me an opportunity to try out a text-to-voice tool for creating videos. Play the recreation of my ignite talk below! It is actually less than five minutes as “robo” me does not waffle and does not need 15 seconds per slide 🙂
This is a short post about exploring data from Census at school NZ using the online version of iNZight.
Have you use the random sampler lately on Census at School? Did you know that it now links your random sample through to iNZight lite?
Head to http://new.censusatschool.org.nz/explore and click on button that says “get a random sample”. Follow the instructions to get your random sample (I selected the CensusAtSchool NZ 2015 Database, no Subpopulation, and Total sample size of 200) and you’ll get a link to iNZight lite that will include the data you just got from your random sample.
Click on that, follow the steps to load up iNZight lite and start exploring that data 🙂
The iNZight lite tool is still under active development so if you come across anything weird or want to suggest improvements, just send feedback through to the iNZight team via the iNZight website.
If you are comfortable with playing around with URLs, you can also set up links to iNZight lite which include the csv file already loaded, like demonstrated below (using the Australian Institute of Sport athletes data set). The part to change is in bold, which you can replace with any web-hosted csv file.
This is a post about my new tool to help write report comments for statistics (and mathematics) students. You can find it under the new “Tools” menu, but you might want to read the stuff below first to find out how it works 🙂
A very brief history of my approach to writing report comments….
A good friend of mine (also a teacher) once commented that I spent more time thinking and creating tools to write report comments than the actual time it would take to just write them by hand. I did write my report comments “by hand” for the first couple of years of teaching, but then I realised that, while every student is different, I wasn’t necessarily using completely different comments about each student. There was a structure to what I was writing and some key things I was commenting on regarding what the student was doing (or not doing) in terms of their learning. So a started developing tools to help me (and others) write report comments.
Developing the new reporting writing tool…
My goal with this new version of the report writing tool was to make the process more natural and for comments to be “suggested” in the writing process (rather than using drop-down boxes), similar to how Google suggests search terms for you as you type. For this suggestion process to work well, I had to analyse all the comments I had used before to identify words that were useful at identifying comments and words that were not. For example, I found that most words less than four letters long were not that useful at selecting useful comments, and words like “have”, “take” etc. also were not useful at presenting reasonable options for comments. I ran some of my previously written comments through some text analysis tools to check out aspects such as readability and word length.
Want to see it in action?
I’ve made this video to explain how to use the tool because I think it’s easier this way 🙂
I haven’t heard anything from anyone with any problems, and there seems to be a bit of traffic to the challenge page, so hopefully this is going well. I’ll allow checking of the first list of reserved words tomorrow. Students should put in what they predict the readability score will be for each word. These predicted scores will be checked against the actual readability scores and students will be given an overall result e.g. 85%. Oh, and just because you’re a teacher too you’ll get this idea for an investigative question/problem……. How long does it take a student to submit a swear word into a text analysis tool?
Related “reading themed” statistical investigation ideas
Check out http://josephrocca.com/randomsentence/ where you can generate “random” sentences from books that are no longer under U.S.A. copyright restrictions – so books generally published before the early 20th century. You could compare the process for random sampling sentences from digital books to processes for random sampling sentences from physical books (so much here with different sampling methods). You could give students an actual physical book and challenge them to estimate the total word count (check using the digital version!), or get students to devise a way to compare the “readability” of two books, or….?
So what was so surprising?
Recap: I got 10 dominoes from a supermarket recently and was surprised to find that all 10 were different (there are 50 different dominoes to collect). Ok, so on the face of it this may look like a familiar (and not super awesome) starter. Collecting cereal cards, ice block sticks, seed packets…….. But I was surprised to see this because I was thinking that a random process like this would mean I should expect to see at least one double up e.g. like seeing runs of heads when you flip a coin. When I thought about it more, I realised I wasn’t taking into account there were 50 dominoes – this makes a difference.
More about SOLO
SOLO stands for the Structure of the Observed Learning Outcomes. It’s a model/taxonomy for defining different levels of understanding or thinking and was developed by J. Biggs and K. Collis in 1982. I’ve been using SOLO in my teaching of statistics since around 2006 and think it’s awesome. It fits so well with building conceptual understandings of statistics rather than just procedural ones. I use SOLO in (at least) two ways: (1) to structure good questions for students to use when working with data, questions to make them think at different levels and (2) to plan my teaching of a topic e.g. what are the key ideas (not skills)?
The prices increased from Jan to Feb and then decreased from Mar to May and then increased again…..
I think I like this answer on Quora re how to explain over-fitting of models. Some of the language is a bit off – I think if you swap the word “hypothesis” for “model” and remove “experiments” and replace with “observations” it reads better. But I like the idea of how to explain to students that a model is not about getting a perfect fit to the observed data and that simpler can be better (e.g. go for the minimum number of trend lines as possible that tell the general story of what is happening……).
This post provides some teaching examples for using an online probability simulation tool. It’s a supplement to the workshop I offered for the NZAMT 2015 conference.
Probability simulation tool
I recently developed a very basic online probability simulation tool . I wanted a simulation tool that would run online without using applets or flash (tablet compatible). I also wanted to be able to animate repeated simulations in a loop – in the past to get this effect, I had to either make animated GIFs or set up slides in Powerpoint to transition automatically. I did a quick search for online simulation tools and couldn’t find what I wanted so I adapted some code I had written previously to get what I wanted.
An example of an animated looped simulation from the probability simulation tool
It’s very much designed “fit for a specific purpose” (more about that in the part 2) so I know it has lots of limitations 🙂 But what I like about the feature being demonstrated above is that it will keep running automatically, freeing me up to ask the students questions about what they are seeing and why they are seeing this.
Small samples – lots of variation
One of the activities I presented in the workshop involved teachers trying to work out who my siblings were based on photos. I presented five sets of four photos. Within each set, one photo was of one of my siblings, the rest were photos of other non-related people. In the workshop there around 30 teachers present. The basic idea (with lots of assumptions) is that distribution for the number of correct selections IF teachers were guessing can be modelled by a binomial distribution with n = 5 and p = 0.25.
After “marking” the teachers selections of my siblings, I created a dot plot of the 30 individual results. One of the questions put to the teachers at the workshop was “Do these results look like what we’d see if each of you was guessing which person was my sibling?”‘
To build up a simulated distribution based on guessing, each teacher then used five different hands-on simulations to make new sibling selections for each set of photos (see the resources link at the end of this post). I then created another dot plot from these simulated selections and asked teachers to compare the features of the two plots e.g. centre, spread, shape, unusual.
For this workshop, the two distributions actually came out to look pretty similar. But this won’t necessary happen. To demonstrate the amount of variation between repeated simulations (of 30 students guessing across five sets of possible siblings), I set up the probability simulation tool with the options shown in the screen grab below:
So that the axis does not resize for each simulation, I fixed the axis between 0 and 5. To stop the dots from automatically resizing, I fixed the dot size to the smallest option. I then pressed “Start animation” and let the simulations run over and over again. This gives the following animation:
This animation could then be used to ask questions like:
“What would be an unlikely number of correct siblings if someone was guessing?”
“How many correct siblings would you expect to see if someone was guessing – between where and where?”
“What looks similar for each animation?”- “What looks different?”
“What variation are we seeing?” – “Why are we seeing it?”
A supermarket is running a promotion. For every $20 you spend, you will receive one domino. There are 50 dominoes to collect. I received 10 dominoes for my last shop and was surprised to find that all 10 dominoes were different. Should I have been surprised? Explain 🙂
This lesson focuses on developing student understanding of statistical measures. It is a re-working of a workshop I offered for the NZAMT 2009 conference. This lesson is awesome because students drive the development of a measure for determining the best “age estimator”.
I was introduced to the activity when I was at teachers’ college, and have used it in my teaching pretty much every year since then in some way or another. I adapted it to focus on creating a variety of measures (not just the standard deviation) during some early work on the SOLO taxonomy, as a way of demonstrating progressions in thinking.
This post focuses on randomness, simulations and probability.
10 quick ideas……
Choose five letters (e.g. A, B, D, N, U) and display these together. For the rest of these ideas to work, choose letters that can go together to make three letter words (avoid certain words!). Ask students to randomly select one of the letters and write this down.
Ask students to share honestly how they selected their letter – you should find they do use a reason e.g. the first letter of their name, or they choose the one they think no one else will select. Discuss the difference between selecting something and randomly selecting something, and get students to come up with examples for each e.g. selecting which lolly to eat based on which one you like vs putting your hand into a bag and choosing a lolly without looking.
You could discuss more how humans are not that great at generating or accepting randomness. There are some great youtube videosand websites with ideas for activities to explore this. A nice example is this decision by Spotify to change their algorithm for shuffling songs – their article includes some nice visualisations to support their discussion. You could also explore how the word or concept of random is used in everyday language, or in particular, in design (like my example below).
Display the class results as a dot plot (with the letters along the horizontal axis). So what are we looking for in the plot? Ask the students – are these results what you expect? Some students may discuss expecting to see an equal number of selections for the five letters, others may expect to see uneven results because “it’s random”, others may have other ideas based on not trusting that other students selected their letters randomly. Try to get as much out of your students as possible so you know what they are thinking 🙂
We can’t use the results to prove that students selected their letters randomly or not, but we can see if the results look like what we’d get if a random process was used. Students may not know what they are looking for, and for small samples like a class, we actually expect quite a bit of variation. Use a simulation tool like this one to simulate randomly selecting n letters with replacement from the five letters you used (n being the size of your class). Discuss with the class whether their results look similar or different to the simulated results.
Make five large cards with each of the five letters on them. Select three students from the class (randomly or not!) and use a shuffling process to allocate each student one of the five cards. Get your students to stand in a line facing the class with their letters hidden. Ask the class how likely they think it is that when the three letters are shown that the three letters will make a word. Then get the students one by one from left to right reveal their letters.
Get students to generate three letter “words” by randomly selecting three letters without replacementfrom the five letters you used (they could work in groups with their own set of five cards). This will require students to decide if a word is real or not. If you want to help students spot correct words, you could do a round of “Bogggle” and get the class to create as many valid three letter words as possible from the five letters without repeating letters. Depending on previous learning, you may need to discuss the concept of probability estimates (AKA experimental probability), before getting students to generate 20 “words” (or more if you like!), counting how many of these “words” were real, and determining an estimate for the probability.
Discuss how a simulation could be set up using a computer to run thousands of trials to check randomly created words from the five letters against a list of three letter words that are “real” to determine a closer estimate of the model probability (AKA theoretical probability). This process of checking words against a list of “true words”could be compared to processes around checking whether an email address submitted to an online signup form is “real” or not. We need keep linking what we do in the classroom with the real world 🙂
You could explore the model probability by considering the total number of “words” that could be created by randomly selecting three letters from the five without replacement (e.g. 5 x 4 x 3 = 60) and the total number of real words found by systematically trying out all permutations or by using a Scrabble tool like this one (e.g. for my five letters it’s eleven real words).
You could finish by looking at the “Infinite Monkey Theorem“. This will require a bit more of a theoretical focus and understanding of complementary events and the usefulness of finding P(X = 0) when you need to find P(X ≥ 1). This kind of thinking can be referred to whenever a new animal is found to be awesome at predicting the results of sports games e.g. Paul the Octopus, Richie McCow