Upcoming workshop: Using R to explore and exploit features of images

If you’ve been keeping track of my various talks & workshops over the last year or so, you will have noticed that I’ve become a little obsessed with analysing images (see power of pixels and/or read more here).  As part of my PhD research, I’ve been using images to broaden students’ awareness of what is data, and data science, and it’s been so much fun!

If you’re in the Auckland area next week, you could come along to a workshop I’m running for R-Ladies and have some fun yourself using the statistical programming language R to explore images. The details for the workshop and how to sign up are here: https://www.meetup.com/rladies-auckland/events/255112995/

The power of pixels: Using R to explore and exploit features of images

Thursday, Oct 18, 2018, 6:00 PM

G15, Science Building 303, University of Auckland
38 Princes Street Auckland, NZ

30 Members Attending

Kia ora koutou Anna Fergusson, one of our R-ladies Auckland co-organisers, will be the speaker at this meetup. We’ll explore a range of techniques and R packages for working with images, all at an introductory level. Time: 6:00 arrival for a 6:30pm start. What to bring: Laptops with R installed, arrive early if you are a beginner and would like hel…

Check out this Meetup →

This is not a teaching-focused workshop, it’s more about learning fun and cool things you can do with images, like making GIFs like the one below….
Of course it’s a cat!
…. and other cool things, like classifying photos as cats or dogs, or finding the most similar drawing of a duck!

Duck duck stats!
It will be at an introductory level,  and you don’t need to be a “lady” to come along, just supportive of gender diversity in the R community (or more broadly, data science)! If you’ve never used R before, don’t worry – just bring yourself along with a laptop and we’ll look after you 🙂

The power of pixels: Modelling with images

This post provides the notes for the plenary I gave for the Auckland Mathematical Association (AMA) about using images as a source of data for teaching statistical investigations.

You might be disappointed to find out that my talk (and this post) is not about the movie pixels, as my husband initially thought it was. It’s probably a good thing I decided to focus on pixels in terms of data about a computer or digital image, as the box office data about pixels the movie suggests that the movie didn’t perform so well 🙂 Instead for this talk I presented some examples of using images as part of statistical investigations that (hopefully) demonstrated how the different combinations of humans, digital technologies, and modelling can lead to some pretty interesting data. The abstract for the talk is below:

How are photos of cats different from photos of dogs? How could someone determine where you come from based on how you draw a circle? How could the human job of counting cars at an intersection be cheaply replaced by technology? I will share some examples of simple models that I and others have developed to answer these kinds of questions through statistical investigations involving the analysis of both static and dynamic images. We will also discuss how the process of creating these models utilises statistical, mathematical and computational thinking.

As I was using a photo of my cat Elliot to explain the different ways we can use images to collect data, a really funny thing happened (see the embedded tweet below).

Yes, an actual real #statscat appeared in the room! What are the chances of that? 🙂

Pixels are the squares of colour that make up computer or digital (raster) images. Each image has a certain number of pixels e.g. an image that is 41 pixels in width and 15 pixels in height contains 615 pixels, which is an obvious link to concepts of area. The 615 pixels are stored in an ordered list, so the computer knows how to display them, and each pixel contains information about colour. Using RGB colour values (other systems exist), each pixel contains information about the amounts of red, green and blue on a scale of 0 to 255 inclusive. To get at the information about the pixels is going to require some knowledge of digital technologies, and so the use of images within statistical investigations can be a nice way to teach objectives from across the different curriculum learning areas.

Using images as a source of data can happen on at least three levels. Using the aforementioned photo of my cat Elliot, humans could extract data from the image by focusing on things they can see, for example, that that image is a black and white photo and not in colour, that there are two cats in the photo, and that Elliot does not appear to be smiling. Data that is also available about the image using digital tech includes variables such as the number of pixels, the file type and the file size. Data that can be generated using models related to this image could be identifying the most prominent shade of grey, the likelihood this photo will get more than 100 likes on instagram and what the photo is of (cat vs dog for example, a popular machine learning task).

Static images

The first example used the data, in particular the photos, collected as part of the ongoing data collection project I have running about cats and dogs (the current set of pet data cards can be downloaded here). As humans, we can look at images, notice things that are different and these features can be used to create variables. For example, if you look at some of the photos submitted: some pets are outside while others are inside; some pets are looking at the camera while others are looking away from the camera; and some are “close ups” while others taken from a distance.

These potential variables are all initially categorical, but by using digital technologies, numerical variables are also possible. To create a measure of whether a photo is a “close up” shot of a pet, the area the pet takes up of the photo can be measured. This is where pixels are super helpful. I used paint.net, free image editing software, to show that if I trace around the dog in this photo using the lasso tool that the dog makes up about 61 000 pixels. If you compare this figure to the total number of pixels in the image (90 000), you can calculate the percentage the dog makes up of the photo.

For the current set of pet data card, each photo now has this percentage displayed. Based on this very small sample of six pets, it kind of looks like maybe cats typically make up a larger percentage of the photo than dogs, but I will leave this up to you to investigate using appropriate statistical modelling 🙂

For a pretty cool example of using static images, humans, digital technologies and models, you should take a look at how-old.net. As humans, we can look at photos of people and estimate their age and compare our estimates to people’s actual ages. What how-old.net has done is used machine learning to train a model to predict someone’s age based on the features of the photo submitted. I asked teachers at the talk to select which of the three photos they thought I looked the youngest in (most said B), which is the same photo that the how-old.net model predicted I looked the youngest in. A good teaching point about the model used by how-old.net is that it does get updated, as new data is used to refine its predictions.

You can also demonstrate how models can be evaluated by comparing what the model predicts to the actual value (if known). Fortunately I have a large number of siblings and so a handy (and frequently used) range of different aged people to test the how-old.net model. Students could use public figures, such as athletes, politicians, media personalities or celebrities, to compare each person’s actual age to what the model predicts (since it’s likely that both photos and ages are available on the internet).

There is also the possibility of setting up an activity around comparing humans vs models – for the same set of photos, are humans better at predicting ages than how-old.net? Students could be asked to consider how they could set up this kind of activity, what photos could they use, and how would they decided who was better – humans or models?

Drawings

The next example used the set of drawings Google has made available from their Quick! Draw! game and artificial intelligence experiment. I’ve already written a post about this data set, so have a read of that post if you haven’t already 🙂 In this talk, I asked teachers to draw a quick sketch of cat and then asked them to tell me whether they drew just the face, or the body as well (most drew the face and body – I’m not sure if the appearance of an actual cat during the talk influenced this at all!) I also asked them to think about how many times they lifted their pen off the paper. I probably forgot to say this at the time, but for some things humans are pretty good at providing data but for others, digital technologies are better. In the case of drawing and thinking about how many strokes you made while drawing, we would get more accurate data if we could measure this using a mouse, stylus or touchscreen than asking people to remember.

Using the random sampler tool that I have set up that allows you to choose one of the objects players have been asked to draw for Quick! Draw!, I generated a random sample of 200 of the drawings made when asked to draw a cat. The data the can be used from each drawing is a combination of what humans and digital technologies can measure. The drawing itself (similar to the photos of pets in the first example) can be used to create different variables, for example whether the sketch is of the face only, or the face and body. Other variables are also provided, such as the timestamp and country code, both examples of data that is captured from players of the game without them necessarily realising (e.g. digital traces).

After manually reviewing all 200 drawings and recording data about the variables, I used iNZight VIT to construct bootstrap confidence intervals for the proportion of all drawings made of cats in the Quick! Draw! dataset that were only of faces and for the difference between the mean number of strokes made for drawings of cats in the Quick! Draw! dataset that were of bodies and mean number of strokes made for drawings of cats in the Quick! Draw! dataset that were of faces. Interestingly, while the teachers at the talk mostly drew sketches of cats with bodies, most players of Quick! Draw! only sketch the faces of cats. This could be due to the 20 second time limit enforced when playing the game. It makes sense that the, on average, Quick! Draw! players use more strokes to draw cats with bodies versus cats with just faces. I wished at the time that I had also recorded information about the other variables provided for each drawing, as it would have been good to further explore how the drawings compare in terms of whether the game correctly identified more of the face-only drawings of cats than the body drawings.

What is also really interesting is the artificial intelligence aspect of the game. The video below explains this pretty well, but basically the model that is used to guess what object is being drawn is trained on what previous players of the game have drawn.

From a maths teachers perspective, this is a good example of what can go wrong with technology and modelling. For example, players are asked to draw a square, and because the model is trained on how they draw the object, players who draw four lines that are roughly perpendicular behave similarly from the machine’s perspective because the technology is looking for commonalities between the drawings. What the technology is not detecting is that some players do not know what a square is, or think squares and rectangles are the same thing. So the data being used to train the model is biased. The consequence of this bias is that the model will now reinforce players misunderstanding that a rectangle is a square by “correctly” predicting they are drawing a square when they draw a rectangle! An interesting investigation I haven’t done yet would be to estimate what percentage of drawings made for squares are rectangles 🙂 I would also suggest checking out some of the other “shape” objects to see other examples e.g. octagons.

Using a more complex form of the Google Quick! Draw! dataset, Thu-Huong Ha and Nikhil Sonnad analysed over 100 000 of the drawings made of circles to show how language and culture influences sketches. For example, they found that 86% of the circles drawn by players in the US were drawn counter clockwise, while 80% of the circles drawn by players in Japan were drawn clockwise. To me, this is really fascinating stuff, and really cool examples of how using images as a source of data can result in really meaningful investigations about the world.

Animation

The last example I used was about using videos as a source of data for probability distribution modelling activities. I’ve presented some workshops before where I used a video (traffic.mp4) from a live streaming traffic camera positioned above a section of the motorway in Wellington. Focusing on the lane of traffic closest to the front of the screen, I got teachers to count how many cars arrived to a fixed point in that lane every five seconds. This gave us a nice set of data which we could then use to test the suitability of a Poisson distribution as a model.

For this talk, I wanted to demonstrate how humans could be replaced (potentially) by digital technologies and models. Since the video is a collection of images shown quickly (around 50 frames per second), we can use pixels, or potentially just a single pixel, in the images to measure various attributes of the cars. About a year ago, I set myself the challenge of exploring whether it would be possible to glean information about car counts, car colours etc. and shared my progress with this personal project at the end of the talk.

So, yes there does exist pretty fancy video analysis software out there that I could use to extract the data I want, but I wanted to investigate whether I could use a combination of statistical, mathematical and computational thinking to create my own model to generate the data. As part of my PhD, I’m interesting in finding out what activities could help introduce students to the modern art and science of learning from data, and what is nice about this example is that idea of how the model could count how many cars are arriving every five seconds to a fixed point on the motorway is actually pretty simple and so potentially a good entry point for students.

The basic idea behind the model is that when there are no cars at the point on the motorway, the pixel I am tracking is a certain colour. This colour becomes my reference colour for the model. Using the RBG colour system, for each frame/image in the traffic video, I can compare the current colour of the pixel e.g. rgb(100, 250, 141) to the reference colour e.g. rgb(162, 158, 162). As soon as the colour changes from the reference colour, I can infer this means a car has arrived to the point on the motorway. And as soon as the colour changes back to the reference colour, I can infer that the car has left the point on the motorway. While the car is moving past the point, I can also collect data on the colour of the pixel from each frame, and use this to determine the colour of the car.

I’m still working on the model (in that I haven’t actually modified it since I first played around with the idea last year) and the video below shows where my project within CODAP (Common Online Data Analysis Platform) is currently at. When I get some time, I will share the link to this CODAP data interactive so you and your students can play around with choosing different pixels to track and changing other parameters of the model I’ve developed 🙂

You might notice by watching this video that the model needs some work. The colours being recorded for each car are not always that good (average colour is an interesting concept in itself, and I’ve learned a lot more about how to work with colour since I developed the model) and some cars end up being recorded twice or not at all. But now that I’ve developed an initial model to count the cars that arrive every five seconds, I can compare the data generated from the model to the data generated by humans to see how well my model performed.

You can see at the moment, that the data looks very different when comparing what the humans counted and what the digital tech + model counted. So maybe the job of traffic counter (my job during university!) is still safe – for now 🙂

Going crackers

I didn’t get time in the talk to show an example of a statistical investigation that used images (photos of animal crackers or biscuits) to create a informal prediction model. I’ll write about this in another post soon – watch this space!

Statistical reasoning with data cards (webinar)

UPDATE: The video of the webinar is now available here.

I’m super excited to be presenting the next ASA K-12 Statistics Education Webinar. The webinar is based on one of my sessions from last year’s Meeting Within a Meeting (MWM) and will be all about using data cards featuring NZ data/contexts. I’ll also be using the digital data cards featured in my post Initial adventures in Stickland if you’d like to see these in “teaching action”.

The webinar is scheduled for Thursday April 20 9:30am New Zealand Time (Wednesday April 19 at 5:30 pm Eastern Time, 2:30 pm Pacific), but if you can’t watch it live a video of the webinar will be made available after the live presentation 🙂

Here are all the details about the webinar:

Title: Statistical Reasoning with Data Cards

Presenter: Anna-Marie Fergusson, University of Auckland

Abstract: Using data cards in the teaching of statistics can be a powerful way to build students’ statistical reasoning. Important understandings related to working with multivariate data, posing statistical questions, recognizing sampling variation and thinking about models can be developed. The use of real-life data cards involves hands-on and visual-based activities. This talk will present material from the Meeting Within a Meeting (MWM) Statistics Workshop held at JSM Chicago (2016) which can be used in classrooms to support teaching within the Common Core State Standards for Mathematics. Key teaching and learning ideas that underpin the activities will also be discussed.

To RSVP to participate in the live webinar, please use the following link: https://goo.gl/forms/pQ5taydWwOZy2WOJ3

The ASA is offering this webinar without charge and only internet and telephone access are necessary to participate. This webinar series was developed as part of the follow-up activities to the Meeting Within a Meeting (MWM) Workshop for Math and Science teachers held in conjunction with the Joint Statistical Meetings (www.amstat.org/education/mwm). MWM will be held again in Baltimore, MD on August 1-2, 2017.  For those unavailable to participate in the live webinar, ASA will record this webinar and make it available after the live presentation. Previous webinar recordings are available at http://www.amstat.org/asa/education/K-12-Statistics-Education-Webinars.aspx.

Using data and simulation to teach probability modelling

This post provides the notes and resources for a workshop I ran for the Auckland Mathematical Association (AMA) on using data and simulation to teach probability modelling (specifically AS91585/AS91586). This post also includes notes about a workshop I ran for the AMA Statistics Teachers’ Day 2016 about my research into this area.

Using data in different ways

The workshop began by looking at three different questions from the AS91585 2015 paper. What was similar about all three questions was that they involved data, however, how this data was used with a probability model was different for each question.

For the first question (A), we have data on a particular shipment of cars: we know the proportion of cars with petrol cap on left-hand side of the car and the percentage of cars that are silver. We are then told that one of the cars is selected at random, which means that we do not need to go beyond this data to solve the problem. In this situation, the “truth” is the same as the “model”. Therefore, we are finding the probability.

For the second question (B), we have data on 10 cars getting petrol: we know the proportion of cars with petrol caps on the left-hand side of the car. However, we are asked to go beyond this data and generalise about all cars in NZ, in terms of their likelihood of having petrol caps on the left-hand side of the cars. This requires developing a model for the situation. In this situation, the “truth” is not necessarily the same as the “model”, and we need to take into account the nature of the data (amount and representativeness) and consider assumptions for the model (the conditions, the model applies IF…..). Therefore, when we use this model we are finding an estimate for the probability.

For the third question (C), we have data on 20 cars being sold: we know the proportion of cars that have 0 for the last digit of the odometer reading (six). What we don’t know is if observing six cars with odometer readings that end in 0 is unusual (and possibly indicative of something dodgy). This requires developing a model to test the observed data (proportion), basing this model on an assumption that the last digit of an odometer reading should just be explained by chance alone (equally likely for each digit). Therefore, when we use this model, we generate data from the model (through simulation) and use this simulated data to estimate the chance of observing 6 (or more) cars out of 20 with odometer readings that end in 0. If this “tail proportion” is small (less than 5%), we conclude that chance was not acting alone.

There’s a lot of ideas to get your head around! Sitting in there are ideas around what probability models are and what simulations are (see the slides for more about this) and as I discovered during my research last year with teachers and probability distribution modelling, these ideas may need a little more care when defining and using with students. The main reason I think we need to be careful using data when teaching probability modelling is because it matters whether you are using data from a real situation, where you do not know the true probability, or whether you are using data that you have generated from a model through simulation. Each type of data tells you something different and are used in different ways in the modelling process. In my research, this led to the development of the statistical modelling framework shown below:

All models are wrong but some are more wrong than others: Informally testing the fit of a probability distribution model

At the end of 2016, I presented a workshop at the AMA Statistics Teachers’ Day based on my research into probability distribution modelling (AS91586). This 2016 workshop also went into more detail about the framework for statistical modelling I’m developing. The video for this workshop is available here on Census At School NZ.

We have a clear learning progression for how “to make a call” when making comparisons, but how do we make a call about whether a probability distribution model is a good model? As we place a greater emphasis on the use of real data in our statistical investigations, we need to build on sampling variation ideas and use these within our teaching of probability in ways that allow for key concepts to be linked but not confused. Last year I undertook research into teachers’ knowledge of probability distribution modelling. At this workshop, I shared what I learned from this research, and also shared a new free online tool and activities I developed that allows students to informally test the fit of probability distribution models.

During the workshop, I showed a live traffic camera from Wellington (http://wixcam.citylink.co.nz/nph-webcam.cgi/terrace-north), which was the context for a question developed and used (the starter question AKA counting cars). Before the workshop, I recorded five minutes of the traffic and then set up a special html file that pauses the video every five seconds. This was so teachers at the workshop (and students) could count the number of cars passing different points on the motorway (marked with different coloured lines) every five seconds. To use this html file, you need to download both of these files into the same folder – traffic.html and traffic.mp4. I’ve only tested my files using the Chrome browser 🙂

If you don’t want to count the cars yourself, you can head straight to the modelling tool I developed as part of my research: http://learning.statistics-is-awesome.org/modelling-tool/. In the dropdown box under “The situation” there are options for the different coloured points/lines on the motorway. The idea behind getting teachers and students to actually count the cars was to try to develop a greater awareness of the complexity of the situation being modelled, to reinforce the idea that “all models are wrong” – that they are approximations of reality but not the truth. Also, I wanted to encourage some deeper thinking about limitations of models. For example, in this situation, looking at five second periods, there is an upper limit on how many cars you can count due to speed restrictions and following distances. We also need to get students to think more about model in terms of sample space (the set of possible outcomes) and the shape of the distribution (which is linked to the probabilities of each of these outcomes), not just the conditions for applying the probability distribution 🙂

In terms of the modelling tool, I developed a set of teaching notes early last year, which you can access in the Google drive below. This includes some videos I made demonstrating the tool in action 🙂 I also started developing a virtual world (stickland http://learning.statistics-is-awesome.org/stickland-modelling/) but this is still a work in progress. Once you have collected data on either the birds or the stick people, you can copy and paste it into the modelling tool. There will be more variables to collect data on in the future for a wider range of possible probability distributions (including situations where none is applicable).

Slides from IASC-ARS/NZSA 2017 talk

https://goo.gl/dfA9MF

Resources for workshop (via Google Drive)

Developing learning and formative assessment tasks for evaluating statistically-based reports

This post provides the notes and resources for a workshop I ran for the Auckland Mathematical Association (AMA) on developing learning and formative assessment tasks for evaluating statistically-based reports (specifically AS91584).

Notes for workshop

The starter task for this workshop was based around a marketing leaflet I received in my letterbox for a local school back in 2014. I was instantly skeptical about the claims being made by the school and went straight to sources of public data to check the claims. As was often the case, this personal experience turned into an activity I used with my Scholarship Statistics students to help them develop their critical evaluation skills. The task, public data I used, and my attempt at answers (from my past self in 2014) are provided at the bottom of this post. My overall conclusion was that most of the claims check out until around 2011, but not so much for 2012 – 2013, leading my to speculate that the school had not updated their marketing leaflet. The starter task is all about claims and data, and not so much about statistical processes, study design, or inferential reasoning – all of which are required for students to engage with the evaluation of statistically-based reports. However, I used this task to set the focus of the workshop, which was to focus on the claims that are being made, and whether they can be supported or not, and why.

The questions used for the external assessment tasks for AS91584 (available here) are designed to help scaffold students to critique the report in terms of the claims, statements or conclusions made within the report. Students need to draw on what has been described in the report and relevant contextual and statistical knowledge to write concise and clear discussion points that show statistical insight and answer the questions posed. This is hard for students. Students find it easy to write very creative, verbose and vague responses, but harder to write responses that are not based only on speculation or that are not rote learned. We see this difficulty with internally assessed tasks as well, so it’s not that surprising that students struggle to write concise, clear, and statistically insightful discussion points under exam pressure.

Teachers who I have spoken to who have taught this standard (which includes me) really enjoy teaching statistical reports to students. In reflections and conversations with teachers on how we could further improve the awesome teaching of statistical reports, a few ideas or suggestions emerged:

  • Perhaps we focus our teaching too much on content, keeping aspects such as margin of errors and confidence intervals, observational studies vs experiments, and non-sampling errors too separate?
  • Perhaps we focus too much on “good answers” to questions about statistical reports, rather than “good questions” to ask of statistical reports?

Great ideas for teaching statistical report can be sourced from Census at School NZ or from conversations with “statistical friends” (see the slides for more details). These include ideas such as: experiencing the study design first and then critiquing a statistical report that used a similar design, using matching cards to build confidence with different ideas, keeping a focus on the statistical inquiry cycle, teaching statistical reports through the whole year rather than in one block, and teaching statistical reports alongside other topics such as time series, bivariate analysis, and bootstrapping confidence intervals. I quite like the idea of the “seven deadly sins” of statistical reports, but didn’t quite have enough time to develop what these could be before the workshop – feel free to let me know if you come up with a good set! [Update: Maybe these work or could be modified?]

When I taught statistical reports in 2013 (the first year of the new achievement standard/exam), I was gutted when I got my students’ results back at the start of 2014.  I reflected on my teaching and preparation of students for the exam and realised I had been too casual about teaching students how to respond to questions. In particular, I had expected my “good” students would gain excellence (the highest grade – showing statistical insight) because they had gained excellences for the internally-assessed students or were strong contenders to get a Scholarship in Statistics. So, a bit later in 2014, when the assessment schedules came out, I looked carefully at what had been written as expected responses. To me, it seemed that a good discussion point had to address three questions: What? Why? How? Depending on the question being asked, the whats, whys and hows were a bit different, but at the time (only having one exam and schedule to go with!) it seemed to make sense. At least, in my teaching that year with students, I felt that using this simple structure allowed me to teach and mark discussion points more confidently. You can see more details for this “discussion point” structure in the slides.

The last part of the workshop involved providing teachers with one of three statistical reports (all around the theme of coffee of course!) and asking them, in groups, to develop a formative assessment task. After identifying one or two key claims made in the report, they had to select three or four questions from previous year’s exams that would be relevant for questioning the report in front of them (relevant to the conclusions made in the report). We didn’t quite get this finished in the workshop – the goal was to create three formative assessment tasks that could be shared! However, perhaps some of the teachers who attended the workshop will go on to develop formative assessment tasks and email these to me to share at a later date. I do feel strongly that all teachers of statistics should feel confident to write their own formative or practice assessment tasks for whatever they are teaching – if you’re not sure about what understanding you are trying to assess and what questions to ask to assess that understanding, how do you feel confident with what to teach? I’m hoping to launch a project next term to help support statistics teachers to feel more confident with writing formative assessment tasks, so watch this space 🙂

Resources for workshop (via Google Drive)

Ideas for using technology to design and carry out experiments online

This post provides the notes for a workshop I ran at the Otago Mathematics Association (OMA) conference about using technology to design and carry out experiments online.

Actually, at the moment this post only provides a PDF of the slides I used for the workshop – I will update this post with more detail later this year 🙂 Links and documents referred to in the slides are at the bottom of this page.

Associated links/documents

Initial adventures in Stickland

stickland_adventure

This post provides the notes for a workshop I ran at the Otago Mathematics Association (OMA) Conference about using data challenges to encourage statistical thinking.

Until last week, I had never re-presented or adapted a workshop that I had developed in a previous year.  So it really interesting to take this workshop on data challenges, which I had presented at the AMA and CMA stats days last year, and work through it again with a new bunch of awesome teachers in Dunedin.  I wrote notes about this workshop last year –  Using data challenges to encourage statistical thinking  – so this post will just share a few things I tweaked the second time around, including an activity we tried in Stickland 🙂

Some changes and additions

To show an example of a predictive model in action, we used one of a few online tools which attempt to predict your age using your name (based on US data) e.g. rhiever.github.io/name-age-calculator/index.html. I also demonstrated another online tool that attempts to predict your gender based on writing (hackerfactor.com/GenderGuesser.php) by using my abstract for this workshop (it did correctly predict, based on the writing being formal, that it was written by a female). For the actual data challenge itself using the celebrity data, I purposefully removed Dr Dre from the training data set to make it easier to explore the data without worrying about how to handle his extremely high earnings for 2014 (new link here).

Testing Stickland

Another thing I changed about the workshop this time around was that rather than use physical data cards (these Census at school stick people data cards), we tried out my new digital data cards in the virtual world of Stickland. I’ve already shared a little bit about the ideas behind Stickland – see the Welcome to stickland! post – so what follows is an example of how we used Stickland in the workshop. (Just a quick reminder that the data cards are real students from the NZ Census At School 2015 data, the names being the only variable that is not real).

data_card

The activity starts with the idea of wanting to predict whether a stick person chosen at random from Stickland uses Facebook or not. If you head to learning.statistics-is-awesome.org/stickland, the first thing you could do is select a sample of stick people and see what proportion of them use Facebook. I got the teachers in this workshop to select 20 stick people and then let them play with moving the data cards around in the grey screen below (click or touch the card to drag the card to somewhere else on the screen e.g. to sort the cards into Facebook users and non-Facebook users).

stick-layout

For the sample shown above, an equal number of stick people are Facebook users than not, but of course this will vary from sample to sample. I then told the teachers that this particular stick person is a Snapchat user, and asked them if this changes their prediction of whether they are a Facebook user or not. One way to explore this is to create a two way table with the cards (see below) and then reason with this.

stick-sort

Most of the different samples showed a similar story to the sample above: Of the Snapchat users, most were Facebook users and of the non-Snapchat users, most were non-Facebook users. I then suggested (if we had time) we could also explore whether knowing the gender and age of the stick person would help us build a better model for predicting Facebook usage. At this stage (considering multiple variables/factors) I would want the students to move into software that allows them to explore the data more deeply (more about how that is possible is discussed in the Welcome to stickland! post). We didn’t do this in the workshop and the teachers had to leave Stickland perhaps before they wanted to 🙂

Where to next?

Stickland is just in “proof of concept” form at the moment and will no doubt have lots of bugs and weird features. In the Welcome to stickland! post, I discuss the influence of others in developing these digital data cards, in particular Pip Arnold and her work with statistical investigations and data cards that stretches back to at least 2005 (if not earlier!). Feel free to have a play and to let me know what you think about the concept, but this is definitely a possible project for 2017 and not intended to be a fully featured product yet.