I’m pretty excited about the talks and workshops I’m doing over the next month or so! Below are the summaries or abstracts for each talk/workshop and when I get a chance I’ll write up some of the ideas presented in separate posts.

** Keynote: Searching for meaningful sampling in apple orchards, YouTube videos, and many other places!** (AMA, Auckland, September 14, 2019)

In this talk, I shared some of my ideas and adventures with developing more meaningful learning tasks for sampling. Using the “Apple orchard” exemplar task, I presented some ideas for “renovating” existing tasks and then introduced some new opportunities for teaching sample-to-population inference in the context of modern data and associated technologies. I shared a simple online version of the apple orchard and also talked about how my binge watching of DIY YouTube videos led to my personal (and meaningful) reason to sample and compare YouTube videos.

** Workshop: Expanding your toolkit for teaching statistics** (AMA, September 14, Auckland, 2019)

In this workshop, we explored some tools and apps that I’ve developed to support student’s statistical understanding. Examples were: an interactive dot plot for building understanding of mean and standard deviation, a modelling tool for building understanding of distributional variation, tools for carrying out experiments online and some new tools for collecting data through sampling.

The slides for both the keynote and workshop are embedded below:

*Talk***:**** Introducing high school statistics teachers to code-driven tools for statistical modelling** (VUW/NZCER, Wellington, September 30, Auckland, 2019)

**Abstract: **The advent of data science has led to statistics education researchers re-thinking and expanding their ideas about tools for teaching and learning statistical modelling. Algorithmic methods for statistical inference, such as the randomisation test, are typically taught within NZ high school classrooms using GUI-driven tools such as VIT. A teaching experiment was conducted over three five-hour workshops with six high school statistics teachers, using new tasks designed to blend the use of both GUI-driven and code-driven tools for learning statistical modelling. Our findings from this exploratory study indicate that teachers began to enrich and expand their ideas about statistical modelling through the complementary experiences of using both GUI-driven and code-driven tools.

** Keynote: Follow the data** (NZAMT, Wellington, October 3, 2019)

**Abstract: **Data science is transforming the statistics curriculum. The amount, availability, diversity and complexity of data that are now available in our modern world requires us to broaden our definitions and understandings of what data is, how we can get data, how data can be structured and what it means to teach students how to learn from data. In particular, students will need to integrate statistical and computational thinking and to develop a broader awareness of, and practical skills with, digital technologies. In this talk I will demonstrate how we can* follow the data* to develop new learning tasks for data science that are inclusive, engaging, effective, and build on existing statistics pedagogy.

**Workshop: ****Just hit like! Data science for everyone, including cats (and maybe dogs)** (NZAMT, Wellington, October 2, 2019)

**Abstract: **Data science is all about integrating statistical and computational thinking with data. In this hands-on workshop we will explore a collection of learning tasks I have designed to introduce students to the exciting world of image data, measures of popularity on the web, machine learning, algorithms, and APIs. We’ll explore questions such as “Are photos of cats or dogs more popular on the web?”, “What makes a good black and white photo?”, “How can we sort photos into a particular order?”, “How can I make a cat selfie?” and many more. We’ll use familiar statistics tools and approaches, such as data cards, collaborative group tasks and sampling activities, and also try out some new computational tools for learning from data. Statistical concepts covered include features of data distributions, informal inference, exploratory data analysis and predictive modelling. We’ll also discuss how each task can also be extended or adapted to focus on specific aspects and levels of the statistics curriculum. Please bring along a laptop to the workshop.

I’m also presenting a workshop at NZAMT with Christine Franklin on what makes a good statistical task. I’ve been assisting Maxine Pfannkuch and members of the NZSA education committee to set up a new teaching journal, which we will be launching at the workshop!!

]]>Since I already had a tool that creates data cards from the *Quick, Draw! *data set, I’ve created a prototype for the kind of tool that would support this approach using the same data set.

I’ve written about the Quick, Draw! data set already:

- http://teaching.statistics-is-awesome.org/cat-and-whisker-plots-sampling-from-the-quick-draw-dataset/
- http://teaching.statistics-is-awesome.org/you-say-data-i-say-data-cards/
- http://teaching.statistics-is-awesome.org/the-power-of-pixels-modelling-with-images/

For this new tool, called different strokes, users sort drawings into two or more groups based on something visible in the drawing itself. Since you have the drag the drawings around to manually “classify” them, the larger the sample you take, the longer it will take you.

There’s also the novelty and creativity of being able to create your own rules for classifying drawings. I’ll use cats for the example below, but from a teaching and assessment perspective there are SO many drawings of so many things and so many variables with so many opportunities to compare and contrast what can be learned about how people draw in the *Quick, Draw!*

Here’s a precis of the kinds of questions I might ask myself to explore the general question **What can we learn from the data about how people draw cats in the Quick, Draw! game?**

- Are drawings of cats more likely to be heads only or the whole body? [I can take a sample of cat drawings, and then sort the drawings into heads vs bodies. From here, I could bootstrap a confidence interval for the population proportion].
- Is how someone draws a cat linked to the game time? [I can use the same data as above, but compare game times by the two groups I’ve created – head vs bodies. I could bootstrap a confidence interval for the difference of two population means/medians]
- Is there a relationship between the number of strokes and the pause time for cat drawings? [And what do these two variables actually measure – I’ll need some contextual knowledge!]
- Do people draw dogs similarly to cats in the
*Quick, Draw!*game? [I could grab new samples of cat and dog drawings, sort all drawings into “heads” or “bodies”, and then bootstrap a confidence interval for the difference of two population proportions]

**Check out the tool and explore for yourself here: ** http://learning.statistics-is-awesome.org/different_strokes/

A little demo of the tool in action!

Here’s a scenario. You buy a jumbo bag of marshmallows that contains a mix of pink and white colours. Of the 120 in the bag, 51 are pink, which makes you unhappy because you prefer the taste of pink marshmallows.

Time to write a letter of complaint to the company manufacturing the marshmallows?

The thing we work so hard to get our statistics students to believe is that there’s this crazy little thing called chance, and it’s something we’d like them to consider for situations where random sampling (or something like that) is involved.

For example, let’s assume the manufacturing process overall puts equal proportions of pink and white marshmallows in each jumbo bag. This is not a perfect process, there will be variation, so we wouldn’t expect exactly half pink and half white for any one jumbo bag. But how much variation could we expect? We could get students to flip coins, with each flip representing a marshmallow, and heads representing white and tails representing pink. We then can collate the results for 120 marshmallows/flips – maybe the first time we get 55 pink – and discuss the need to do this process again to build up a collection of results. Often we move to a computer-based tool to get more results, faster. Then we compare what we observed – 51 pink – to what we have simulated.

I use these kind of activities with my students, but I wanted something more so I made a very simple app earlier this year. You can find it here: learning.statistics-is-awesome.org/threethings/. You can only do three things with it (in terms of user interactions) but in terms of learning, you can do way more than three things. Have a play!

In particular, you can show that models other than 50% (for the proportion of pink marshmallows) can also generate data (simulated proportions) consistent with the observed proportion. So, not being able to reject the model used for the test (50% pink) doesn’t mean the 50% model is the **one true thing**. There are others. Like I told my class – just because my husband and I are compatible (and I didn’t reject him), doesn’t mean I couldn’t find another husband similarly compatible.

*Note: **The app is in terms of percentages, because that aligns to our approach with NZ high school students when using and interpreting survey/poll results. However, I first use counts for any introductory activities before moving to percentages, as demonstrated with this marshmallow example. The app rounds percentages to the closest 1% to keep the focus on key concepts rather than focusing on (misleading) notions of precision. I didn’t design it to be a tool for conducting formal tests or constructing confidence intervals, more to support the reasoning that goes with those approaches.*

If you’ve been keeping track of my various talks & workshops over the last year or so, you will have noticed that I’ve become a little obsessed with analysing images (see power of pixels and/or read more here). As part of my PhD research, I’ve been using images to broaden students’ awareness of what is data, and data science, and it’s been so much fun!

If you’re in the Auckland area next week, you could come along to a workshop I’m running for R-Ladies and have some fun yourself using the statistical programming language R to explore images. The details for the workshop and how to sign up are here: https://www.meetup.com/rladies-auckland/events/255112995/

This is not a teaching-focused workshop, it’s more about learning fun and cool things you can do with images, like making GIFs like the one below….

…. and other cool things, like classifying photos as cats or dogs, or finding the most similar drawing of a duck!

It will be at an introductory level, and you don’t need to be a “lady” to come along, just supportive of gender diversity in the R community (or more broadly, data science)! **If you’ve never used R before, don’t worry – just bring yourself along with a laptop and we’ll look after you **

]]>This long weekend (in Auckland anyway!), I spent some time updating the Quick! Draw! sampling tool (read more about it here Cat and whisker plots: sampling from the Quick, Draw! dataset). You may need to clear your browser cache/data to see the most recent version of the sampling tool.

One of the motivations for doing so was a visit to my favourite kind of store – a stationery store – where I saw (and bought!) this lovely gadget:

It’s a circle punch with a 2″/5 cm diameter. When I saw it, my first thought was “oh cool I can make dot-shaped data cards”, like a normal person right?

Using data cards to make physical plots is not a new idea – see censusatschool.org.nz/resource/growing-scatterplots/ by Pip Arnold for one example:

But I haven’t seen dot-shaped ones yet, so this led me to re-develop the Quick! Draw! sampling tool to be able to create some

I was also motivated to work some more on the tool after the fantastic Wendy Gibbs asked me at the NZAMT (New Zealand Association of Mathematics Teachers) writing camp if I could include variables related to the times involved with each drawing. I suspect she has read this super cool post by Jim Vallandingham (while you’re at his site, check out some of his other cool posts and visualisations) which came out after I first released the sampling tool and compares strokes and drawing/pause times for different words/concepts – including cats and dogs!

So, with Quick! Draw! sampling tool you can now get the following variables for each drawing in the sample:

The drawing and pause times are in seconds. The drawing time captures the time taken for each stroke from beginning to end and the pause time captures all the time between strokes. If you add these two times together, you will get the total time the person spent drawing the word/concept before either the 20 seconds was up, or Google tried to identify the word/concept. Below the word/concept drawn is whether the drawing was correctly recognised (true) or not (false).

I also added three ways to use the data cards once they have been generated using the sampling tool (scroll down to below the data cards). You can now:

- download a PDF version of the data cards, with circles the same size as the circle punch shown above (2″/5cm)
- download the CSV file for the sample data
- show the sample data as a HTML table (which makes it easy to copy and paste into a Google sheet for example)

In terms of options (2) and (3) above, I had resisted making the data this accessible in the previous version of the sampling tool. One of the reasons for this is because I wanted the drawings themselves to be considered as data, and as human would be involved in developed this variable, there was a need to work with just a sample of all the millions of drawings. I still feel this way, so I encourage you to get students to develop at least one new variable for their sample data that is based on a feature of the drawing For example, whether the drawing of a cat is the face only, or includes the body too.

There are other cool things possible to expand the variables provided. Students could create a new variable by adding** drawing_time** and **pause_time** together. They could also create a variable which compares the **number_strokes** to the **drawing_****time **e.g. average time per stroke. Students could also use the **day_sketched** variable to classify sketches as weekday or weekend drawings. Students should soon find the **hemisphere** is not that useful for comparisons, so could explore another country-related classification like continent. More advanced manipulations could involve working with the time stamps, which are given for all drawings using UTC time. This has consequences for the variable **day_sketched** as many countries (and places within countries) will be behind or ahead of the UTC time.

**If you’ve made it this far in the post…. why not play with a little R **

I wonder which common household pet

Quick!drawers tend to use the most strokes to draw? Cats, dogs, or fish?

Have a go at modifying the R code below, using the iNZightPlots package by Tom Elliott and my [very-much-in-its-initial-stages-of-development] iNZightR package, to see what we can learn from the data If you’re feeling extra adventurous, why not try modifying the code to explore the relationship between number of strokes and drawing time!

]]>Simulation-based inference is taught as part of the New Zealand curriculum for Statistics at school level, specifically the randomisation test and bootstrap confidence intervals. Some of the reasons for promoting and using simulation-based inference for testing and for constructing confidence intervals are that:

- students are working with data (rather than abstracting to theoretical sampling distributions)
- students can see the re-randomisation/re-sampling process as it happens
- the “numbers” that are used (e.g. tail proportion or limits for confidence interval) are linked to this process.

If we work with the output only, for example the final histogram/dot plot of re-sampled/bootstrap differences, in my opinion, we might as well just use a graphics calculator to get the values for the confidence interval

In our intro stats course, we use the suite of VIT (Visual Inference Tools) designed and developed by Chris Wild to construct bootstrap confidence intervals and perform randomisation tests. Below is an example of the randomisation test “in action” using VIT:

Last year, VIT was made available as a web-based app thanks to ongoing work by Ben Halsted! So, in this short post I’ll show how to use VIT Online with Google sheets – my two favourite tools for teaching simulation-based inference

**1. **Create a rectangular data set using a Google sheet. If you’re stuck for data, you can make a copy of this Google sheet which contains giraffe height estimates (see this Facebook post for context – read the comments!)

**2. **Under File –> Publish to web, choose the following settings (this will temporarily make your Google sheet “public” – just “unpublish” once you have the data in VIT Online)

Be careful to select “Sheet1” or whatever the sheet you have your data in, not “Entire document”. Then, select “Comma-separated values (.csv)” for the type of file. Directly below is the link to your published data which you need to copy for step 3.

**3. **Head to VIT online –>** **https://www.stat.auckland.ac.nz/~wild/VITonline/index.html. Choose “Randomisation test” and copy the link from step 2 into the first text box. Then press the “Data from URL” button.

**4. **At this point, your data is in VIT online, so you can go back and unpublish your Google sheet by going back to File –> Publish to web, and pressing the button that says “Stop publishing”.

The same steps work to get data from a Google spreadsheet into VIT online for the other modules (bootstrapping etc.).

[Actually, the steps are pretty similar for getting data from a Google spreadsheet into iNZight lite. Copy the published sheet link from step 2 in the appropriately named “paste/enter URL” text box under the File –> Import dataset menu option.]

In terms of how to use VIT online to conduct the randomisation test, I’ll leave you with some videos by Chris Wild to take a look at (scroll down). Before I do, just a couple of differences between the VIT Chris uses and VIT Online and a couple of hints for using VIT Online with students.

You will need to hold down ctrl to select more than one variable before pressing the “Analyse” button e.g. to select both the “Prompt” and “Height estimate in metres” variables in the giraffe data set.

Also, to define the statistic to be tested, in VIT Online you need to press the button that says “Precalculate Display” rather than “Record my choices” as shown in the videos.

Lastly, a really cool thing about VIT Online is that once you have copied over the URL for your published Google sheet, as long as you keep your Google sheet published, you can grab the URL from VIT Online to share with students e.g. https://www.stat.auckland.ac.nz/~wild/VITonline/randomisationTest/RVar.html?file=https://docs.google.com/spreadsheets/d/e/2PACX-1vTcaGSrAbGSntbrUoifNv8g048KJwEnBI–Rmmxqu1N0rb0VRUHoUkIeT-8xo3O9eqTUqZIML_EH523/pub?gid=0&single=true&output=csv&var=%20Prompt,c&var=Height%20estimate%20in%20metres,n. Sure, it’s not the nicest looking URL in the world, so use a URL shortener like bit.ly, goo.gl, tiny.cc etc. if sharing with students to type into their devices.

Note: VIT Online is not optimised to work on small screen devices, due to the nature of the visualisations. For example, it’s important that students can see all three panels at the same time during the process, and can see what is happening!

Now, here are those videos I promised

]]>Want to make some awesome gift tags/labels for Christmas or holiday-related presents? Here’s a fun little statistical art project. Write whatever words you want in the app below, create some secret snowflakes (the secret part being no one else will know what words you used unless of course you choose to display them), play around with colours if you want (uncheck the option to use random colours), freeze the snowflakes when you get something you like, download your masterpiece and use in some way.

Oh yeah, the snowflakes are made by rotating each letter in the words in a magical statistical way (i.e. randomness).

To make our gift labels, I made the first colour white (the background #ffffff), made the other two colours black (#000000), and then printed on to adhesive sticker paper I had left over from our wedding.

Enjoy and have a great holiday break!

**Secret snowflakes app should be shown below (**otherwise here is the link) – works best using a Chrome browser

For many high school teachers here in New Zealand, the teaching year is over and it’s now a six-week summer break before school starts again next year. Despite the well-deserved break, some teachers are already thinking about ideas for next year. I’ve been amazed (and inspired) by the teachers who have signed up to spend a day with Liza and I on Friday 15th December to learn more about working with modern data (more details here). We are both really looking forward to the full-day workshop One of the tools we’ll be working with at the workshop is the platform IFTTT (If This Then That). It’s basically a way to connect devices and online accounts using APIs (application programming interfaces) without using code.

I used IFTTT recently to collect data on New York Times articles. One of the reasons why I started collecting data on New York Times articles was because of their free, online feature “*What’s Going On in This Graph?”*. On Tuesday, December 12 and every second Tuesday of the month through the US school year, *The New York Times Learning Network, *in partnership with the American Statistical Association, hosts a live online discussion about a timely graph like the one shown below.

Students from around the world “read” the graph by posting comments about what they notice and wonder in an online forum. Teachers live-moderates by responding to the comments in real time and encouraging students to go deeper. All releases are archived so that teachers can use previous graphs anytime (read this introductory post to learn more). I used “*What’s Going On in This Graph?*” when I was teaching our Lies, Damned lies and Statistics course, and it is such an awesome resource for helping build statistical literacy and thinking.

So, inspired by the New York Times graphs, about two months ago I created an “applet” on IFTTT that creates a new row in a Google spreadsheet every time a new article is posted to the New York Times website. It stopped working for some reason at the end of November – check out the “raw” data here: https://docs.google.com/spreadsheets/d/1PXGh0xBrJbmrfWq3nRylH5GBqzVd4SYWWiXQj3v9tdQ/edit?usp=sharing

So what’s going on with the data I collected? Your first thought on viewing the data might be – huh? You call this data? The only variable that is “graph ready” is which section each of the nearly 6000 articles were published in. But there are so many variables in data sets just like this one waiting to be defined and explored. After our workshop on Friday, I’ll post an “after” version of this same data set

]]>This post is second in a series of posts where I’m going to share some strategies for getting real data to use for statistical investigations that require sample to population inference. As I write them, you will be able to find them all on this page.

**What’s your favourite board game?**

I read an article posted on fivethirtyeight about the worst board games ever invented and it got me thinking about the board games I like to play. The *Game of life* has a low average rating on the online database of games referred to in this article but I remember kind of enjoying playing it as a kid. boardgamegeek.com features user-submitted information about hundreds of thousands of games (not just board games) and is constantly being updated. While there are some data sets out there that already feature data from this website (e.g. from kaggle datasets), I am purposely demonstrating a non-programming approach to getting this data that maximises the participation of teachers and students in the data collection process.

To end up with data that can be used as part of a sample to population inference task:

*You need a clearly defined and nameable population*(in this case, all board games listed on boardgamegeek.com)*You need a sampling frame that is a very close match to your population.**You need to select from your sampling frame using a random sampling method to obtain the members of your sample.**You need to define and measure variables from each member of the sample/population so the resulting data is multivariate.*

boardgamegeek.com actually provide a link that you can use to select one of the games on their site at random (https://boardgamegeek.com/boardgame/random), so using this “random” link (hopefully) takes care of (2) and (3). For (4), there are so many potential variables that could be defined and measured. To decide on what variables to measure, I spent some time exploring the content of the webpages for a few different games to get a feel for what might make for good variables. I decided to stick to variables that are measured directly for each game, rather than ones that were based on user polls, and went with these variables:

- Millennium the game was released (1000, 2000, all others)
- Number of words in game title
- Minimum number of players
- Maximum number of players
- Playing time in minutes (if a range was provided, the average of the limits was used)
- Minimum age in years
- Game type (strategy or war, family or children’s, other)
- Game available in multiple languages (yes or no)

**Time to play!**

I’ve set up a Google form with instructions of how you can help create a random sample of games from boardgamegeek.com at this link: https://goo.gl/forms/8yBqryGTzrZGhEVx2. As people play along, the sample data will be added here: https://docs.google.com/spreadsheets/d/e/2PACX-1vSzR_VSVzaaeWpCvYbAQCUewaM3Tad2zfTBO7AWuDgFFTj5Jaq2TBo6N-gQGCe5e5t_qKW7Knuq6-pr/pub?gid=552938859&single=true&output=csv . The URL to the game is included so that the data can be checked. Feel free to copy and adapt however you want, but do keep in mind that nature of the variables you use. In particular, be very careful about using any of the aggregate ratings measures (and another great article by fivethirtyeight about movie ratings explains some of the reasons why.)

**Bonus round**

I wrote a post recently – Just Google it – which featured real data distributions. boardgamegeek.com also provides simple graphs of the ratings for each game, so we can play a similar matching game. You could also try estimating the mean and standard deviation of the ratings from the graph, with the added game feature of reverse ordering!

Which games do you think match which ratings graphs?

- Monopoly
- The Lord of the Rings: The Card Game
- Risk
- Tic-tac-toe

I couldn’t find a game that had a clear bi-modal distribution for its ratings but I reckon there must be games out there that people either love or hate Let me know if you find one! To get students familiar with boardgamegeek.com, you could ask them to first search for their favourite game and then explore what information and ratings have been provided for this on the site. Let the games begin

]]>Here’s a really quick idea for a matching activity, totally building off Pip Arnold’s excellent work on shape.

At the bottom of this post are six “Popular times” graphs generated today by Google when searching for the following places of interest:

- Cafe
- Shopping mall
- Library
- Swimming pool
- Gym
- Supermarket

Can you match which graphs go with which places?

[you can find the answers at the bottom]

Click here to reveal the answers

]]>