Exploring YouTube auto-captions

Back in June, I gave a five minute talk as part of the opening session of USCOTS – the U.S. Conference On Teaching Statistics. We were warned to practice our talks to make sure we keep to our time limit, which made me wonder how many words I could actually fit into five-minute talk. Since recordings are made for my lectures, I had this idea to explore the number of words I use when teaching, by analysing my lecture recordings.

I was able to obtain automatically generated captions for each of my lectures via the YouTube data API and a R package called {tuber}. In total, for my STATS 100 (Concepts In Statistics) course this ended up being 33 lectures, or around 28 hours of “lecture talk”. However, I don’t actually talk for all of this time since I try to make the lectures as interactive as possible :-).

The captions generated provide the start time and end time as well as the words spoken, remembering that the captions are automatically generated using YouTube’s algorithm, so are not 100% accurate. Below are the first eight rows from the caption file for my first lecture of the semester.

The first eight lines of the caption file from my first lecture

You may have noticed me talking about “lockdown” and the use of positive words like “happy”, “amazing”, and “fantastic”. Noticing these features of my text planted the seed of an idea for me to explore my “teaching vocabulary” later in the exploration. Which is how EDA works right? As you look at the data and start to work with it, more questions for the data happen.

I was initially confused by the end_time variable/attribute. If you compare the end_time for one row with the start_time for the row below it, you can see often they overlap. After some searching on the internet, I discovered that this is because the timings provided are about when (start_time) and how long (end_time) the caption stays visible on the video when it is played, which was like, duh, that’s why they are called captions Anna! I had forgotten the data wasn’t created by a human actually transcribing my lectures so I could do this exploration, and this data was created for another purpose.

Take a look at lines 6, 7 and 8 – interesting right? At this point I went back and watched that part of my video to get to understand the data better, to watch the captions appear and match this behaviour back to my data.

A screenshot from the first lecture video, showing the captions from lines 2 and 3 (partial)

It then all made sense because, at most, two lines are shown for the captions. There’s this scrolling down movement, so the bottom line moves to the top when the next line appears on the bottom, which explains the overlap of end_time and start_time. (Go watch a YouTube video now and turn on the captions to see what I mean!)

But then I noticed the words also appear individually on each line when I say them but there’s no timing information for that in the subtitles data I had. So I went back to the YouTube Data API information and discovered that there are five different formats available for the captions. None of these formats contain information about when individual words are timed to appear, so I guess that’s just YouTube algorithmic “magic”.

Again, this is also what happens in data exploration, right? You need to constantly find and use contextual knowledge to make sense of your data, not that I even have “data” yet to visualise. For this kind of data – captions – I need to spend some time getting to know it in its current form so I can make a plan to create variables/attributes I can use for exploration. I can’t always define variables first before I’ve even spent time familiarising myself with the data source.

At this point, I had to spend some time thinking about what I wanted to find out, to guide how I try to extract meaning from the caption data. As I described earlier, even though my initial motivation was about how many words I could speak in five minutes, I also became interested in exploring what words I use when teaching e.g. Am I a “positive” teacher? Are there certain words I say a lot? Obviously, for the “speaking rate” focus I need to take into time, but for the “teaching vocab” focus maybe I don’t – oh wait, except if I want to track something like whether my teaching language changes as the course progresses, I will need the lecture number.

Thinking about what questions I have for the data made me think about how I will need to merge together the caption data from across all 33 lectures. I needed to add another variable that records what lecture number each caption belonged to, so I could compare between and across different lectures. I’ve done lots of merging like this before, so this kind of behaviour is almost automatic. Past experiences with exploring data help me to know what is possible for future explorations.

But before I scaled things up and merged all the caption files together as well as other data manipulations, I decided to try some things out first with just the first lecture – a small scale exploration. I actually honestly didn’t know where to start with my original “how many words can I say in five minutes” motivation! So, I thought about just counting how many words are in the caption file and how long the recording was to get started.

Here’s the last line in the caption file:

The last line of the caption file from my first lecture

Oh, that’s right! I definitely went over time those first few lectures “online”. Lectures are supposed to be 50 minutes long, and this one was around 64 minutes. Hmmm…. interesting, now I have another idea for what I could explore with this data! I could create a data set where each row is a different lecture, and one of the variables measures how long in minutes between my first and last word spoken, as a proxy for the length of my lecture.

But back to the word count. According to the caption file, I managed to get 59 710 words spoken in those 64 minutes! Wait, really? I immediately tried to make that number related to something personal to me – my PhD dissertation has a maximum of 100 000 words. So 59 710 words can’t possibly be right! Turns out it wasn’t – I was counting the number of characters/letters, not the number of words. When I corrected my code, the total number of words came out to be 12 073. Which then made me wonder about the length and complexity of the words I use when I lecture. When I compare 59 710 letters/characters to 12 073 words, seems like my words are on average just under 5 letters long but I know just using a summary statistic doesn’t reveal the distribution of word lengths.

I don’t mean to overstate the point, but it is a really important one. Exploring data, data exploration, exploratory data analysis – any combination of the words “explore” and “data” – is not just ask one question, get some data, make some plots, write something about the plots, in that order and that direction. Already in this post, you can see how I am shuttling between questions, data and context, using both statistical and computational thinking to support this journey of discovery. I hope you can also see that I am invested in this exploration because it’s about me and something that I’m really interested in. This helps to “keep me going” even though I haven’t visualised anything yet 🙂

Now let’s talk about data structures. A rectangular (or tidy) data set is one where each row is a different entity, each column is a different variable/attribute about that entity, each cell is the value for that entity for that variable/attribute. From the caption data, I can create different data sets not just one – it totally depends on what questions I am asking of the data. I’ll stick to the current focus on individual words and use the caption data to create a data set where each word is on its own row.

First 12 rows of a data set created based on individual words

In this form, I can develop more variables e.g. word length, sentiment of word – oh wait, we haven’t talked about sentiment analysis yet! I’ll come back to that later (Edit: Nope, not in this post!). How about a plot of word length?

Yes, my eyes focused on features that summarise the data (median 4 letters, positive skew, middle 50% between 3 and 5 words), but I was also drawn to the 14 and 15 length words because it’s my data and what words were these? I filtered my data set to find words that were 14 letters or longer to produce the subset of the table below:

Longest words spoken in the first lecture

Yeah, those are totally words I would have said, particularly the combination of “troubleshooting” and “computationally” with respect to learning to code 🙂 I wondered why I was talking about this only eight minutes into my lecture (my approach is to sneak in coding stuff not start the lecture with it) but then I remembered that we had issues using one of the apps I had developed so I was talking about ways to help students resolve these issues. It makes sense in this context to talk about these individual cases.

At this point in the exploration, I stopped and thought about where to next. I have the data I need to answer my initial question, How many words can I say in five minutes?, right? I “know” the lecture was 64 minutes and 8 seconds long. I “know” that I said 12 073 words during this time. Using some maths, that’s around 188.3 words per minute, or 941.3 words per five minutes.

But I’ve made some HUGE assumptions with what I said I “know” as the basis of this calculation. For example, I know that I didn’t speak continuously for all 64 minutes of this lecture, so can I find a way to find the “non speaking” times? Can the caption times help with this? I also know that the YouTube algorithm for creating the captions includes filler words such as “um” and “uh” – should these be included in my analysis? The captions may not have captured all the words I spoke, or may have added more. And of course, I have only explored one of my lectures – the first one of the semester – how I spoke in this lecture may be different from others.

And there’s just so much else I could explore. Can I find a way to measure how “interactive” my lectures are e.g. by using “silence” time? (assuming when I am not talking, students are doing something!). What are the key words I use the most often? Do I keep saying the same annoying phrase to my students? How positive are my lectures in terms of words I use? Do I tend to speak a similar rate of words per minute, or do I speed up and slow down sometimes e.g. at the end of the lecture when I realise I am running out of time?

I’ll finish this post by showing you the top two “words” I spoke during the first lecture, after removing stop words like “a”, “the”, “of”, etc., because they seem kind of appropriate for this post: Um ……. data!

Did you like this partial example of an exploration with data? I’m planning on sharing more! It’s a core part of my teaching preparation – to keep doing explorations myself, sharing these with students, and then supporting them to do their own. It’s really hard to teach something if you haven’t experienced it for yourself, and each time I work through a data exploration it reminds me just how many things you have to consider along the way. I didn’t end up including all the ideas I had for exploring this data but hopefully I provided enough to get you thinking. You can read more about my current focus on explorations in variations here.

teaching statistics is awesome

Exploring YouTube auto-captions

Related

Who writes this blog?

Sub blogs

Sharing is caring!

Latest posts

Awesome things

Exploring YouTube auto-captions

Share this:

Related

Who writes this blog?

Sub blogs

Sharing is caring!

Latest posts

Awesome things