Using data challenges to encourage statistical thinking
This post provides the notes for a workshop I ran at the Christchurch Mathematical Association (CMA) and Auckland Mathematical Association (AMA) 2015 Statistics Days about using data challenges to encourage statistical thinking.
What is a data challenge? These are just the words I am using to describe a competition that involves (big) data. Some good examples are the ASA datafest and the Hudson Data Jam. Closer to home (NZ) we have the ISLP statistical poster competition and New Zealand’s Next Top Engineering Scientist. The key ingredients are an interesting dataset that can be explored to find stories and the use of presentations or visualisations of this data to tell stories and communicate insight.
There are also opportunities out there for other competitions to involve statistical thinking, for example, the annual NZ’s next top engineering scientist competition. The question posed in 2014 was “If Mt Taranaki erupted, how much would it cost the aviation industry?” and it easy to list the reasons why answering this question needs statistics: data, models, prediction, estimate, uncertainty, variation…….. and decision making! Think about forming a four person team for the 2016 competition that includes a statistics student 🙂
The basic idea behind a predictive model is that based on some information we can predict something is likely to happen. For example, based on your name, I could form a model to predict your age using NZ baby name popularity data and survival rates. Another example is the well known case of Target predicting a teenager was pregnant using information about items she had purchased. Kaggle offers lots of competitions with data, however, their competitions require the use of skills beyond exploratory data analysis (such as coding in R or other programming languages) and formal modelling methods. The approach I am taking for this workshop is more informal with the main focus on exploratory data analysis and probabilistic thinking 🙂
For this workshop, I devised a very structured activity where participants are guided through the exploratory data analysis aspect of the competition. I am modelling the kinds of questions I would ask of the data and how I would use software like iNZight lite to visualise the answers to these questions. I am also “sowing the seeds” of predictive modelling by encouraging “tendency” and “likely” language. You can remove all scaffolding and just give students the data and the final challenge questions and let them go for it.
For the AMA Statistics Day, I used a quick task with my Census at School stick people data cards first. Each group was given 50 different data cards from the same population of stick people. I randomly selected another stick person from the same population. I asked the teachers to predict if my person was on Facebook using their cards. The teachers at the workshop sorted their cards into Facebook and non-Facebook stick people. At this stage we didn’t have a model, we were just trying to predict whether my stick person used Facebook or not. I then asked the teachers to predict whether my stick person used Facebook or not based on the knowledge/information that my stick person uses Snapchat. This involved teachers splitting their Facebook stick people group into Snapchat and non-Snapchat people, and the same with their non-Facebook people (see below).
This data challenge is based on the list of the top 100 most powerful celebrities as determined by Forbes magazine. If you want to try the data challenge out while reading this post, DO NOT Google that list now 🙂 I obtained the information from the Forbes website and then merged this with information from Twitter about followers and number of tweets for those celebrities with Twitter accounts (this twitter information being correct when I first created the data set – it will be out of date now!). I then randomly selected five celebrities to remove from the data set, the remaining 95 celebrities formed a sort of “training” set. We may already now that Beyonce was determined the most powerful celebrity in 2014 (“Who runs the world? Girls do!”), but what does it take to get on to this list? Money, job, social media presence? Let’s explore the data to see what we can find out!
Download the csv file celebrities2014training.csv then load up in either iNZight desktop or iNZight lite (available from the iNZight website OR Use this link to iNZight lite with data loaded http://lite.docker.stat.auckland.ac.nz/?url=http://teaching.statistics-is-awesome.org/celebrities2014training.csv&land=visualize
Ignoring the fact that we are not comparing any of this data with celebrities who did not make the list, this is a good question to get familiar with the data set. For this whole activity we are not going to be making inferences for celebrities outside of the top 100, just for the five that I have removed from the data set. The questions on this slide allow for a structured and guided approach to exploring the data (see slide image above). You may need to help students understand how bar graphs are constructed in iNZight, and I have ordered the questions so that you look at one variable by itself first before comparing this variable across another variable. This initial exploration ends with a hint of how this kind of modelling will work, by asking students to predict the kind of celebrity someone from this top 100 list is if they are a male in their thirties. There is no “correct” answer here 🙂 The teachers at both workshops were happy with predicting “Athletes”.
For this set of questions, we need to learn about how to deal with an “outlier” – Dr Dre and his unusually high income (because he said Beats by Dre to Apple during 2014). One approach is to change the axis limits in iNZight lite (under the advanced options) which is a better approach than removing him from the data set, as you will also be removing the other information about him which could be useful for our model. These questions allow for looking at the relationship between two numerical variables via scatter plots and this context allows for a meaningful discussion of other sources of variation, for example, when discussing why the relationship between earnings and ranking is not a perfect relationship. You can encourage students to try combinations of variables when exploring celebrity earnings, but also draw their attention to how small the groups can get when they start subsetting (especially if they have four variables!).
This last set of questions gives students an opportunity to explore the associated twitter data for these celebrities. It’s pretty interesting – especially the lack of relationship between number of followers and number of tweets! There is also an opportunity in the second set of questions to demonstrate adding a third variable to a scatter plot using a colour gradient (by plotting number of twitter followers vs number of tweets and adding ranking through “code more variables” – an advanced option in iNZight lite).
|On to the challenge!
The idea with each of these challenge questions is to give the information to students, then give them some time to explore that data and discuss which type of celebrity (category) they think each person is. If they have worked through the previous questions/explorations, they should have some ideas of what to check or look for, but also one of the aims of this learning activity is that they do realise they still might not get their prediction correct despite using the data 🙂 Why not?
Warm up question – who are we?
Challenge question 1 – what type of celebrity am I?
Challenge question 2 – what type of celebrity am I?
Challenge question 3 – what type of celebrity am I?
Challenge question 4 – what type of celebrity am I?
Challenge question 5 – what type of celebrity am I?
|Did you get these all correct? Remember that you were predicting for celebrities not in the data set (unseen data) so there is no guarantee that what is the case for the 95 celebrities is also the case for each of the five celebrities under question. We also used a very small “training” set to create an informal model. Lastly, just because you got one particular prediction wrong doesn’t mean your model won’t perform well in the long run (well, not quite for this particular example since we are only concerned with the top 100 but hopefully you get my point).