Thinking about what it means to explore data, and how to teach students to explore data, has become a passion of mine ever since I started exploring teaching with a wider range of data. It started back in 20151, when I worked on rewriting a set of lectures for our very large introductory statistics course at the University of Auckland for a new chapter called “Exploring data”. A paper I wrote and presented at the International Conference on Teaching Statistics (ICOTS) describes one example from this data exploration chapter involving social media data. I tell the story of using Instagram data to learn more about people who visited the Eiffel Tower, motivated by my observation of people taking photographs at the Eiffel Tower and wondering how similar these photos were.

Important aspects to note about how we currently teach data exploration in our intro stats course is that we focus on: different sources of data, how features of data can be visualised and summarised, how other sources of data can be used and combined, how new variables can be developed and how questions – many many questions –  drive the exploration of data. We do not provide a formal structure to the explorations for students, instead we provide examples of different explorations that start with questions like Are the olympics games a “game for all ages”? Importantly and crucially, we do not teach or assess sample-to-population ideas for this chapter. Instead, we focus on welcoming students into the exciting world of data, a world of creativity, discoveries and possibilities. We try to get the point across that our data exploration is guided and informed by a range of knowledge beyond the statistical, including personal, data-contextual, cultural, social, ethical, computational, and so much more.

Or at least, that is the vision! Of course, to teach these things is harder than writing them down in a paper or describing them in this post. But one thing that has helped so much in my teaching of data explorations is that I do, and continue to do, data explorations of my own. This helps me really experience and reflect on what it means to explore data, and also means I can share these experiences with students. Two paragraphs from the R for data science book written by Hadley Wickham and Garrett Grolemund nicely describe my feelings when I am exploring data, or doing exploratory data analysis (EDA):

EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others … EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data—and develop a set of thought-provoking questions—if you follow up each question with a new question based on what you find.
Wickham, H., & Grolemund, G. (2016). R for data science: import, tidy, transform, visualize, and model data.

When I first read this chapter from the R for data science book back in 2018, I made the following tweet:

But now that I have been really teaching data exploration for the last few years, I don’t think high school statistics teachers – or any level statistics teachers – should be fearful! That’s not the same as me saying that teaching exploratory data analysis is easy. But I believe the inital fears that there will not be enough structure for students will be quickly resolved if teachers are really given an opportunity to explore data with their students and see the kinds of awesome ways of reasoning and thinking with data that are possible for all students with a careful balance of structure and variation (which is what statistics is all about!)

First, a point of clarification. At the moment in our senior school levels I believe we have an impoverished view of EDA. We have limited and confined it to merely something we do on the way to “making a call” based on box plots, constructing a confidence interval or conducting a randomisation test. That is, we are not using it for exploration in itself, but to “check” data as part of a focused investigation involving a pre-planned form of analysis. We should of course, always look at data when we “make calls” or similar. But let’s look at one sentence from the earlier paragraphs from Wickham and Grolemund again, “EDA is fundamentally a creative process.” Exploratory data analysis can be, and should, so much more. For example, think of what I could learn about you if I explored your last two years of work emails? Or your last month of credit card transactions? Exploration does not need to be constrained by important sample versus population ideas, we can learn so much from immersing ourselves in a rich multivariate data set and following the data to make new discoveries or to develop fuzzy ideas that could be made more clearer using appropriate modelling and visualisation approaches.

Second, teaching exploratory data analysis is hard! It’s all those three things combined – teaching how to explore, teaching about different forms of data, and teaching different ways to analyse these data – across and within multiple contexts that need integration and by using tools that support effective data visualisation. So you have to think carefully about what it is about doing EDA that you value the most for student learning. You can’t teach all the things within any one learning task or within one year/semester. For my teaching, I care the most about providing learning opportunities where students can follow their curiosity to learn from data through visualisations and interpretations within personally-relevant contexts. I want them to be both creative and skeptical about how and what they learn. I’m willing and prepared to put in the mahi to help my students learn these skills, by getting them to explore and write frequently and by reading their work and providing personal feedback.

Fortunately, there is some great work happening in the adjacent and complementary fields of data science education and data journalism. To pick just one example from many awesome people working in this area, Sara Stoudt and Deb Nolan have written a book called Communicating with data: The art of writing for data science. They also provided a wonderful poster and video for this year’s USCOTS conference with some practical ideas for teaching. Sara is giving a talk at the upcoming IASE satellite conference on Captions: The unsung heroes of data communiation which also promises to provide pedagogical approaches effective data visualisation.

I am also going to try to help by starting to share more examples of explorations in variation though this blog. These will be a mix of data explorations I have done just for fun and personal learning, data explorations I have used with students, explorations shared by other data scientists and other ideas for explorations I come across.

I’ve recently shared the approach of Exploring data landscapes, which is one very important part of how I teach and assess data exploration. Last weekend I presented a workshop where I took one of these data landscapes (food prices) and went through how this data landscape could be used to explore time series data, categorical data developed from visual assessments of photos, and recipe data (a mix of numeric and categorical variables) created by crowdsourcing data collecting within a class. Below is the recording from this workshop and here is a link to the slides I used for the talk.

1 Actually, that’s not entirely true. I wrote this document about exploratory data analysis to support the New Zealand curriculum back in 2010, which much help and support from Maxine Pfannkuch and Pip Arnold. It’s just I didn’t get to teach EDA in serious way until I moved to the university level ?