Follow the data!

Last week I was down in Wellington for the VUW NZCER NZAMT16 Mathematics & Statistics Education Research Symposium, as well as for the NZAMT16 teacher conference. It was a huge privilege to be one of the keynote speakers and my keynote focused on teaching data science at the school level. I used the example of following music data from the New Zealand Top 40 charts to explore what new ways of thinking about data our students would need to learn (I use “new” here to mean “not currently taught/emphasised”).

It was awesome to be back in Wellington, as not only did I complete a BMus/BSc double degree at Victoria University, I actually taught music at Hutt Valley High School (the venue for the conference) while I was training to become a high school teacher (in maths/stats and music). I didn’t talk much in my keynote about the relationship between music and data analysis, but I did describe my thoughts a few years ago (see below):

All music has some sort of structure sitting behind it, but the beauty of music is in the variation. When you learn music, you learn about key ideas and structures, but then you get to hear how these same key ideas and structures can be used to produce so many different-sounding works of art. This is how I think we need to help students learn statistics – minimal structure, optimal transfer, maximal experience. Imagine how boring it would be if students learning music only ever listened to Bach.

Due to some unforeseen factors, I ended up ZOOMing my slides from one laptop at the front of the hall to another laptop in the back room which was connected to the data projector. Since I was using ZOOM, I decided to record my talk. However, the recording is not super awesome due to not really thinking about the audio side of things (ironically). If you want to try watching the video, I’ve embedded it below:

You can also view the slides here: I’m not sure they make a whole lot of sense by themselves, so here’s a quick summary of some of what I talked about:

  • Currently, we pretty much choose data to match the type of analysis we want to teach, and then “back fit” the investigative problem to this analysis. This is not totally a bad thing, we do it in the hope that when students are out there in the real world, they think about all the analytical methods they’ve learned and choose the one that makes sense for the thing they don’t know and the data they have to learn from. But, there’s a whole lot of data out there that we don’t currently teach students about how to learn from, which comes from the computational world our students live in. If we “follow the data” that students are interacting with, what “new” ways of thinking will our students need to make sense of this data?
  • Album covers are a form of data, but how do we take something we can see visually and turn this into “data”. For the album covers I used from one week of 1975 and one week of 2019, we can see that the album covers from 1975 are not as bright and vibrant as those from 2019, similarly we can see that people’s faces feature more in the 1975 album covers. We could use the image data for each album cover, extract some overall measure of colour and use this to compare 1975 and 2019. But what measure should we use? What is luminosity, saturation, hue, etc.? How could we overfit a model to predict the year of an album cover by creating lots of super specific rules? What pre-trained models can we use for detecting faces? How are they developed? How well do they work? What’s this thing called a “confusion matrix”?
  • An intended theme across my talk was to compare what humans can do (and to start with this), with what we could try to get computers to do, and also to emphasise how important human thinking is. I showed a video of Joy Buolamwini talking about her Gender Shades project and algorithmic bias: and tried to emphasise that we can’t teach about fun things we can do with machine learning etc. without talking about bias, data ethics, data ownership, data privacy and data responsibility. In her video, Joy uses faces of members of parliament – did she need permission to use these people’s faces for her research project since they were already public on websites? What if our students start using photos of our faces for their data projects?
  • I played the song that was number one the week I was born (tragedy!) as a way to highlight the calendar feature of the nztop40 website – as long as you were born after 1975, you can look up your song too. Getting students to notice the URL and how it changes as you navigate a web page is a useful skill – in this case, if you navigate to different chart weeks, you can notice that the “chart id” number changes. We could “hack” the URL to get the chart data for different weeks of the years available. If the website terms and conditions allow us, we could also use “web scraping” to automate the collection of chart data from across a number of weeks. We could also set up a “scheduler” to copy the chart data as it appears each week. But then we need to think about what each row in our super data set represents and what visualisations might make sense to communicate trends, features, patterns etc. I gave an example of a visualisation of all the singles that reached number one during 2018, and we discussed things I had decided to do (e.g. reversing the y axis scale) and how the visualisation could be improved [data visualisation could be a whole talk in itself!!!]
  • There are common ways we analyse music – things like key signature, time signature, tempo (speed), genre/style, instrumentation etc. – but I used one that I thought would not be too hard to teach during the talk: whether a song is in the major or minor key. However, listening to music first was really just a fun “gateway” to learn more about how the Spotify API provides “audio features” about songs in its database, in particular supervised machine learning. According to Spotify, the Ed Sheeran song Beautiful people is in the minor key, but me and guitar chords published online think that it’s in the major key. What’s the lesson here? We can’t just take data that comes from a model as being the truth.
  • I also wanted to talk more about how songs make us feel, to extend thinking about the modality of the song (major = happy, minor = sad), to the lyrics used in the song as well. How can we take a set of lyrics for a song and analyse these in terms of overall sentiment – positive or negative? There’s lots of approaches, but a common one is to treat each word independently (“bag of words”) and to use a pre-existing lexicon. The slides show the different ways I introduce this type of analysis, but the important point is how common it is to transfer a model trained within one data context (for the bing lexicon, customer reviews online) and use it for a different data context (in this case, music lyrics). There might just be some issues with doing this though!
  • Overall, what I tried to do in this talk was not to showcase computer programming (coding) and mathematics, since often we make these things the “star attraction” in talks about data science education. The talk I gave was totally “powered by code” but do we need to start with code in our teaching? When I teach statistics, I don’t start with pulling out my calculator! We start with the data context. I wanted to give real examples of ways that I have engaged and supported all students to participate in learning data science: by focusing on what humans think, feel and see in the modern world first, then bringing in (new) ways of thinking statistically and computationally, and then teaching the new skills/knowledge needed to support this thinking.
  • We have an opportunity to introduce data science in a real and meaningful way at the school level, and we HAVE to do this in a way that allows ALL students to participate – not just those in enrichment/extension classes, coding clubs, and schools with access to flash technology and gadgets. While my focus is the senior levels (Years 11 to 13), the modern world of data gives so many opportunities for integrating statistical and computational thinking to learn from data across all levels. We need teachers who are confident with exploring and learning from modern data, and we need new pedagogical approaches that build on the effective ones crafted for statistics education. We need to introduce computational thinking and computer programming/coding (which are not the same things!) in ways that support and enrich statistical thinking.

If you are a NZ-based teacher, and you are interested in learning more about teaching data science, then please use the “sign-up” form at (the “password” is datascience4everyone). I’ll be sending out some emails soon, probably starting with learning more about APIs (for an API in action, check out ).

What’s going on, what’s going on?

For many high school teachers here in New Zealand, the teaching year is over and it’s now a six-week summer break before school starts again next year. Despite the well-deserved break, some teachers are already thinking about ideas for next year. I’ve been amazed (and inspired) by the teachers who have signed up to spend a day with Liza and I on Friday 15th December to learn more about working with modern data (more details here). We are both really looking forward to the full-day workshop 🙂 One of the tools we’ll be working with at the workshop is the platform IFTTT (If This Then That). It’s basically a way to connect devices and online accounts using APIs (application programming interfaces) without using code.

I used IFTTT recently to collect data on New York Times articles. One of the reasons why I started collecting data on New York Times articles was because of their free, online feature “What’s Going On in This Graph?”. On Tuesday, December 12 and every second Tuesday of the month through the US school year, The New York Times Learning Network, in partnership with the American Statistical Association, hosts a live online discussion about a timely graph like the one shown below.

One of the super interesting graphs featured

Students from around the world “read” the graph by posting comments about what they notice and wonder in an online forum.  Teachers live-moderates by responding to the comments in real time and encouraging students to go deeper.  All releases are archived so that teachers can use previous graphs anytime (read this introductory post to learn more). I used “What’s Going On in This Graph?” when I was teaching our Lies, Damned lies and Statistics course, and it is such an awesome resource for helping build statistical literacy and thinking.

So, inspired by the New York Times graphs, about two months ago I created an “applet” on IFTTT that creates a new row in a Google spreadsheet every time a new article is posted to the New York Times website. It stopped working for some reason at the end of November – check out the “raw” data here: 

So what’s going on with the data I collected? Your first thought on viewing the data might be – huh? You call this data? The only variable that is “graph ready” is which section each of the nearly 6000 articles were published in. But there are so many variables in data sets just like this one waiting to be defined and explored. After our workshop on Friday, I’ll post an “after” version of this same data set 🙂

Using data challenges to encourage statistical thinking

This post provides the notes for a workshop I ran at the Christchurch Mathematical Association (CMA) and Auckland Mathematical Association (AMA) 2015 Statistics Days about using data challenges to encourage statistical thinking.

What is a data challenge? These are just the words I am using to describe a competition that involves (big) data. Some good examples are the ASA datafest and the Hudson Data Jam. Closer to home (NZ) we have the ISLP statistical poster competition and New Zealand’s Next Top Engineering Scientist. The key ingredients are an interesting dataset that can be explored to find stories and the use of presentations or visualisations of this data to tell stories and communicate insight.
There are also opportunities out there for other competitions to involve statistical thinking, for example, the annual NZ’s next top engineering scientist competition. The question posed in 2014 was “If Mt Taranaki erupted, how much would it cost the aviation industry?” and it easy to list the reasons why answering this question needs statistics: data, models, prediction, estimate, uncertainty, variation…….. and decision making! Think about forming a four person team for the 2016 competition that includes a statistics student 🙂
The basic idea behind a predictive model is that based on some information we can predict something is likely to happen. For example, based on your name, I could form a model to predict your age using NZ baby name popularity data and survival rates. Another example is the well known case of Target predicting a teenager was pregnant using information about items she had purchased. Kaggle offers lots of competitions with data, however, their competitions require the use of skills beyond exploratory data analysis (such as coding in R or other programming languages) and formal modelling methods. The approach I am taking for this workshop is more informal with the main focus on exploratory data analysis and probabilistic thinking 🙂
There will be a longer post about the teacher gender predictor investigation/tool I developed soon. The short story is that I had a go at building a predictive model by trying to predict what gender a teacher was based on their answers to some question. I learned a lot from doing this investigation – the biggest thing was learning first hand the perils of over-fitting a model 🙂 The current “success rate” for the model is 68% which sounds good (if you think about comparing it to a fair coin predicting gender) but actually around 70% of the teachers trying out the tool are female. So, yeah, I could get a higher success rate by just predicting every teacher is female (and only offend the male teachers!)

For this workshop, I devised a very structured activity where participants are guided through the exploratory data analysis aspect of the competition. I am modelling the kinds of questions I would ask of the data and how I would use software like iNZight lite to visualise the answers to these questions. I am also “sowing the seeds” of predictive modelling by encouraging “tendency” and “likely” language. You can remove all scaffolding and just give students the data and the final challenge questions and let them go for it.

For the AMA Statistics Day, I used a quick task with my Census at School stick people data cards first. Each group was given 50 different data cards from the same population of stick people. I randomly selected another stick person from the same population. I asked the teachers to predict if my person was on Facebook using their cards. The teachers at the workshop sorted their cards into Facebook and non-Facebook stick people. At this stage we didn’t have a model, we were just trying to predict whether my stick person used Facebook or not. I then asked the teachers to predict whether my stick person used Facebook or not based on the knowledge/information that my stick person uses Snapchat. This involved teachers splitting their Facebook stick people group into Snapchat and non-Snapchat people, and the same with their non-Facebook people (see below).
Yes, this kind of classification and sorting with data cards can also be used for building understanding of conditional probability and two way tables 🙂 Most groups of teachers were happy that, based on the features of their sample of stick people, that a stick person that used Snapchat was more likely to use Facebook than not.


This data challenge is based on the list of the top 100 most powerful celebrities as determined by Forbes magazine. If you want to try the data challenge out while reading this post, DO NOT Google that list now 🙂 I obtained the information from the Forbes website and then merged this with information from Twitter about followers and number of tweets for those celebrities with Twitter accounts (this twitter information being correct when I first created the data set – it will be out of date now!). I then randomly selected five celebrities to remove from the data set, the remaining 95 celebrities formed a sort of “training” set. We may already now that Beyonce was determined the most powerful celebrity in 2014 (“Who runs the world? Girls do!”), but what does it take to get on to this list? Money, job, social media presence? Let’s explore the data to see what we can find out!


Download the csv file celebrities2014training.csv then load up in either iNZight desktop or iNZight lite (available from the iNZight website OR Use this link to iNZight lite with data loaded

Ignoring the fact that we are not comparing any of this data with celebrities who did not make the list, this is a good question to get familiar with the data set. For this whole activity we are not going to be making inferences for celebrities outside of the top 100, just for the five that I have removed from the data set. The questions on this slide allow for a structured and guided approach to exploring the data (see slide image above). You may need to help students understand how bar graphs are constructed in iNZight, and I have ordered the questions so that you look at one variable by itself first before comparing this variable across another variable. This initial exploration ends with a hint of how this kind of modelling will work, by asking students to predict the kind of celebrity someone from this top 100 list is if they are a male in their thirties. There is no “correct” answer here 🙂 The teachers at both workshops were happy with predicting “Athletes”.

For this set of questions, we need to learn about how to deal with an “outlier” – Dr Dre and his unusually high income (because he said Beats by Dre to Apple during 2014). One approach is to change the axis limits in iNZight lite (under the advanced options) which is a better approach than removing him from the data set, as you will also be removing the other information about him which could be useful for our model. These questions allow for looking at the relationship between two numerical variables via scatter plots and this context allows for a meaningful discussion of other sources of variation, for example, when discussing why the relationship between earnings and ranking is not a perfect relationship. You can encourage students to try combinations of variables when exploring celebrity earnings, but also draw their attention to how small the groups can get when they start subsetting (especially if they have four variables!).

This last set of questions gives students an opportunity to explore the associated twitter data for these celebrities. It’s pretty interesting – especially the lack of relationship between number of followers and number of tweets! There is also an opportunity in the second set of questions to demonstrate adding a third variable to a scatter plot using a colour gradient (by plotting number of twitter followers vs number of tweets and adding ranking through “code more variables”  – an advanced option in iNZight lite).
On to the challenge!

The idea with each of these challenge questions is to give the information to students, then give them some time to explore that data and discuss which type of celebrity (category) they think each person is. If they have worked through the previous questions/explorations, they should have some ideas of what to check or look for, but also one of the aims of this learning activity is that they do realise they still might not get their prediction correct despite using the data 🙂 Why not?

Warm up question – who are we?
We tend to be ranked around the  middle of the top 100. Our median earnings for 2014 were around 60 million. We are more likely to be male than female.A  Directors/Producers
B  Models/Personalities
C  Authors
D  Actors/actresses


Challenge question 1 – what type of celebrity am I?
I am a male. I am 26 years old and earned 32 million dollars in 2014. I was ranked the 34th most powerful celebrity. I am on twitter, had 9560330 followers and made 22347 tweets.


Challenge question 2 – what type of celebrity am I?
I am a female. I am 34 years old and earned 47 million dollars in 2014. I was ranked the 56th most powerful celebrity. I am on twitter, had 2524261 followers and made 3268 tweets.


Challenge question 3 – what type of celebrity am I?
I am a male. I am 45 years old and earned 60 million dollars in 2014. I was ranked the 6th most powerful celebrity. I am on twitter, had 3120088 followers and made 213 tweets.


Challenge question 4 – what type of celebrity am I?
I am a male. I am 68 years old and earned 100 million dollars in 2014. I was ranked the 11th most powerful celebrity. I am not on twitter!


Challenge question 5 – what type of celebrity am I?
I am a male. I am 39 years old and earned 61 million dollars in 2014. I was ranked the 22nd most powerful celebrity. I am on twitter, had 4293463 followers and made 401 tweets.


Did you get these all correct? Remember that you were predicting for celebrities not in the data set (unseen data) so there is no guarantee that what is the case for the 95 celebrities is also the case for each of the five celebrities under question. We also used a very small “training” set to create an informal model. Lastly, just because you got one particular prediction wrong doesn’t mean your model won’t perform well in the long run (well, not quite for this particular example since we are only concerned with the top 100 but hopefully you get my point).