Part 2: Turning Zoom Downloads into Datasets

In the second part of this guide, you will learn how to turn downloadable files from Zoom recordings into datasets that you can analyze in R using zoomGroupStats. Again, because zoomGroupStats relies on Zoom Cloud recording features, this guide will focus specifically on the files that you download from the Zoom Cloud. However, the functions in zoomGroupStats can also be used on meetings that you record locally.

This guide will progress from instructions on which files to download from Zoom to how to convert those files into usable datasets. In its most basic form, a research dataset using virtual meetings can be structured as individual meeting participants who are nested within virtual meetings. Individual participants can attend multiple virtual meetings; indeed, individual participants could attend multiple virtual meetings at the same time. Individuals can further be members of multiple other organizational units (e.g., groups, teams, departments, organizations). To build a basic, clean dataset, though, what you must know is which individual human being was logged into which virtual meeting.

How to Download Files from Zoom

Depending on your research objectives, you will need different output files from Zoom. The components that are currently supported and used in this package are:

A participants file. This is a .csv file that contains meta-data about your meeting. This is an essential file that should be downloaded for each meeting that you wish to process. To download it:
- Log into your Zoom account through a web browser
- Navigate to the Reports
- Click Usage Reports
- Scroll to the Participants column for your focal meeting
- Click the linked number of participants
- Check “export with meeting data” and “show unique users”
- Click Export
A transcript file. This is a .vtt file that contains Zoom’s cloud transcription of the spoken audio during the recorded meeting.
- Navigate to the Recordings page
- Click the link indicating the number of items
- Download the Audio Transcript file
A chat file. This is a.txt file that contains the record of public chat messages posted during the recorded meeting. Download it from the same page where you accessed the transcript.
An audio file. This is an .mp4 file that contains the recorded audio during the session. Download it from the same age as the items above.
Several different video file options. Zoom’s Help Center describes The nature of these different formats. For analyzing facial expressions, one of the most useful is the gallery style video.

Naming the Downloaded Files from Zoom

zoomGroupStats is designed to batch proces virtual meetings. Batch processing simply means that you are combining several meetings into a single dataset. Even if you are only analyzing a single meeting, though, it is useful to treat that single meeting as a batch of one. This will help in creating an extensible dataset that you could add to in the future.

To organize your data for a batch run, you must rename the files in a systematic way and save them in a single directory. To name the files, imagine that each file will have a “prefix” and a “suffix”. The prefix will be an identifier for what meeting it came from and the suffix will be an identifier for what data source is in the file. Whereas you can use whatever prefix you choose, the suffix must be standardized as follows:

Suffix	Description
_chat.txt	Used for the chat file for meetings
_transcript.vtt	Used for the transcript file for meetings
_participants.csv	Used for the participants file for meetings
_video.mp4	Used for a video file for meetings

Imagine a sample batch run, then, which includes all four elements for three meetings. The prefixes for the meetings could be “meeting001”, “meeting002”, and “meeting003”. This would yield a total of 12 files, named as follows:

meeting001_chat.txt
meeting001_transcript.vtt
meeting001_participants.csv
meeting001_video.mp4
meeting002_chat.txt
meeting002_transcript.vtt
meeting002_participants.csv
meeting002_video.mp4
meeting003_chat.txt
meeting003_transcript.vtt
meeting003_participants.csv
meeting003_video.mp4

All 8 files should be saved in a single directory.

Prepare a Batch Spreadsheet

Next, you will prepare a spreadsheet in .xlsx format and save it within the same directory where your Zoom downloads are. This spreadsheet tells zoomGroupStats how to find your files and what information to process. Practically, this spreadsheet is also helpful for organizing your data and keeping track of what you have downloaded.

You must use the following structure and column headers for your spreadsheet:

batchMeetingId	fileRoot	participants	transcript	chat	sessionStartTime	recordingStartDateTime
00000000001	/myMeetings/meeting001	1	1	1	2020-09-04 15:00:00	2020-09-04 15:03:30
00000000002	/myMeetings/meeting002	1	1	1	2020-09-05 15:00:00	2020-09-04 15:03:04
00000000003	/myMeetings/meeting003	1	1	1	2020-09-06 15:00:00	2020-09-04 15:03:07

The first row in the file is the header, which should mirror the example above. Each subsequent row provides the information for a single meeting. Here is a sample that you could use, replacing any rows after the header with your own information.

Column Name	Description
batchMeetingId	A string identifier for this particular meeting
fileRoot	A string that gives the full path and prefix where the files from this meeting can be found. The final part of this string (e.g., meeting001 above) is how you have named the files downloaded for this meeting.
participants	Binary - 0 if you did not download the participants file, 1 if you did
transcript	Binary - 0 if you did not download the transcript file, 1 if you did
chat	Binary - 0 if you did not download the chat file, 1 if you did
video	Binary - 0 if you did not download a video file, 1 if you did
sessionStartDateTime	A string giving the timestamp for when the meeting began as YYYY-MM-DD HH:MM:SS
recordingStartDateTime	A string giving the timestamp for when the recording of the meeting began as YYYY-MM-DD HH:MM:SS

Process your Batch

If you have followed the steps above, turning your Zoom sessions into data should be straightforward using the zoomGroupStats package.You should begin by installing the latest version of zoomGroupStats. Currently this is best done through the install_github function from the devtools package:

# Install zoomGroupStats from CRAN
install.packages("zoomGroupStats")

# Install the development version of zoomGroupStats
devtools::install_github("andrewpknight/zoomGroupStats")

After you have installed the package, you can then load it into your environment as usual:

library(zoomGroupStats)

The first step in turning downloads into data is to process your batch using the batchProcessZoomOutput function.

batchOut = batchProcessZoomOutput(batchInput="./myMeetingsBatch.xlsx")

This function will iterate through the meetings listed in the batchInput file that you created above. For each meeting, the function will detect any of the named files described above (except the video file) and do an initial processing of them. The function currently ignores the video file because video processing can take a considerable amount of time. Processing video data will be covered in Part 4.

The output, stored in this example in batchOut, will be a multi-item list that contains any data that were available according to the batch instructions. Currently, the following items can be included, if raw files are available:

batchInfo

This is information about the batch, drawn from the input batch file that you have supplied. It is helpful to have this information stored for later analysis of the different components of a Zoom recording.

str(batchOut$batchInfo)
#> 'data.frame':    3 obs. of  13 variables:
#>  $ batchMeetingId        : chr  "00000000001" "00000000002" "00000000003"
#>  $ fileRoot              : chr  "meeting001" "meeting002" "meeting003"
#>  $ participants          : num  1 1 1
#>  $ transcript            : num  1 1 1
#>  $ chat                  : num  1 1 1
#>  $ video                 : num  1 0 1
#>  $ sessionStartDateTime  : chr  "2020-09-04 15:00:00" "2020-09-05 15:00:00" "2020-09-06 15:00:00"
#>  $ recordingStartDateTime: chr  "2020-09-04 15:00:00" "2020-09-05 15:03:15" "2020-09-06 15:20:00"
#>  $ participants_processed: num  1 1 1
#>  $ transcript_processed  : num  1 1 1
#>  $ chat_processed        : num  1 1 1
#>  $ video_processed       : num  0 0 0
#>  $ dirRoot               : chr  "/private/var/folders/hg/rsmy7v891ts4zm81674_mq100000gr/T/RtmpQRVS3w/temp_libpath16772380d70cf/zoomGroupStats/extdata" "/private/var/folders/hg/rsmy7v891ts4zm81674_mq100000gr/T/RtmpQRVS3w/temp_libpath16772380d70cf/zoomGroupStats/extdata" "/private/var/folders/hg/rsmy7v891ts4zm81674_mq100000gr/T/RtmpQRVS3w/temp_libpath16772380d70cf/zoomGroupStats/extdata"

meetInfo & partInfo

These are based on information extracted from the participants file downloaded from Zoom. If a meeting has a *_participants.csv file, these will be included. They provide meta-data about the full set of virtual meetings. These are useful files for nailing down the structure of your full dataset (in terms of individuals nested within meetings) and for creating a unique individual identifier for the people in your dataset.

meetInfo is a data.frame that gives information pulled about the meetings from the participants file downloaded from Zoom. Each row in this file is a meeting:

str(batchOut$meetInfo)
#> 'data.frame':    3 obs. of  8 variables:
#>  $ meetingId       : chr  "00000000001" "00000000002" "00000000003"
#>  $ meetingTopic    : chr  "Meeting about Reunion" "Discuss Concert Tour" "Show at the Ed Sullivan"
#>  $ meetingStartTime: chr  "2020-09-24 11:35:16" "2020-09-27 11:35:16" "2020-09-30 11:35:16"
#>  $ meetingEndTime  : chr  "2020-09-24 12:49:20" "2020-09-27 12:49:20" "2020-09-30 12:49:20"
#>  $ userEmail       : chr  "knightap@wustl.edu" "knightap@wustl.edu" "knightap@wustl.edu"
#>  $ meetingDuration : num  75 75 75
#>  $ numParticipants : num  170 170 170
#>  $ batchMeetingId  : chr  "00000000001" "00000000002" "00000000003"

partInfo is a data.frame that gives the participants in each meeting and any information available in the participants file downloaded from Zoom

str(batchOut$partInfo)
#> 'data.frame':    30 obs. of  5 variables:
#>  $ userName      : chr  "Andrew Knight" "George Harrison" "John Lennon" "Paul McCartney (Wings)" ...
#>  $ userEmail     : chr  "knightap@wustl.edu" "george.harrison@gmail.com" "" "paul@gmail.com" ...
#>  $ userDuration  : int  75 72 70 73 53 60 58 54 57 57 ...
#>  $ userGuest     : logi  NA NA NA NA NA NA ...
#>  $ batchMeetingId: chr  "00000000001" "00000000001" "00000000001" "00000000001" ...

transcript & chat

These two items provide parsed text data from the transcribed audio spoken during the meeting and the text-based chat in the meeting. These files are the basis of any further text-based analysis. One important thing to note in processing transcripts is that Zoom timestamps the transcript file anchored on the start of the recording–not at the start of the session itself. The issue here is that a recording may not begin until a period of time after the launch of the session. This means that you could have a transcript file out of sync with the chat file, which begins its timestamp at the start of the session. Resolving this issue requires careful records of when the recording was started or setting up your session to automatically record.

transcript is a data.frame that is the parsed audio transcript file. Each row represents a single marked “utterance” using Zoom’s cloud-based transcription algorithm. Utterances are marked as a function of pauses in speech and/or speaker changes.

str(batchOut$transcript)
#> 'data.frame':    900 obs. of  10 variables:
#>  $ utteranceId          : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ utteranceStartSeconds: num  4.86 9.54 15.63 19.98 24 ...
#>  $ utteranceStartTime   : POSIXct, format: "2020-09-04 15:00:04" "2020-09-04 15:00:09" ...
#>  $ utteranceEndSeconds  : num  7.41 12.99 17.76 21.24 26.07 ...
#>  $ utteranceEndTime     : POSIXct, format: "2020-09-04 15:00:07" "2020-09-04 15:00:12" ...
#>  $ utteranceTimeWindow  : num  2.55 3.45 2.13 1.26 2.07 ...
#>  $ userName             : chr  "Andrew Knight" "Andrew Knight" "Andrew Knight" "Andrew Knight" ...
#>  $ utteranceMessage     : chr  "Okay, so we're recording. We're streaming" "We have that setup. Okay." "It's like Mary Kate's here. I'll let her in." "Gonna make her a host to" ...
#>  $ utteranceLanguage    : chr  "en" "en" "en" "en" ...
#>  $ batchMeetingId       : chr  "00000000001" "00000000001" "00000000001" "00000000001" ...

chat is a data.frame that is the parsed text-based chat file. Each row represents a single chat message submitted by a user. Non-ASCII characters will not be correctly rendered in the message text.

str(batchOut$chat)
#> 'data.frame':    30 obs. of  7 variables:
#>  $ messageId      : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ messageSeconds : num  1274 1295 1321 1476 1802 ...
#>  $ messageTime    : POSIXct, format: "2020-09-04 15:21:14" "2020-09-04 15:21:35" ...
#>  $ userName       : chr  "Andrew Knight" "Andrew Knight" "Ringo Starr" "Andrew Knight" ...
#>  $ message        : chr  "Hello Everyone - It’s great to see folks dropping into the session. We’ll get started here in just about 5 minutes." "Please don’t hesitate to use the chat to say hello to friends and colleagues" "Also drop in questions and comments throughout the discussion in that chat and I’ll respond as we go thanks!" "Wow, Joe - that looks like a nice spot for joining a webinar" ...
#>  $ messageLanguage: chr  "en" "en" "en" "en" ...
#>  $ batchMeetingId : chr  "00000000001" "00000000001" "00000000001" "00000000001" ...

rosetta

The final element is called “rosetta” because it will help you deal with one of the most vexing challenges in analyzing virtual meetings with a large number of participants: The lack of a persistent and unique individual identifier for participants. Because the transcript and chat files rely on people’s Zoom display names–and because people can change their display names–you will frequently encounter duplicate names and/or multiple identifiers for the same person.

From a data structure perspective, what is most important to know is which individual person logged into which virtual meeting. Unfortunately, complications in how Zoom identifies individuals require careful human attention to this issue. To illustrate, consider the following exaggerated example. Arun Jah is an attendee in Meeting A. Arun logs in on his laptop through his corporate account. But, he also logs in on his iPad through his family account to the same meeting. After a few minutes, Arun realizes that his iPad account has his child’s name displayed (the child was using it for virtual school). So, he changes his display name. Halfway through the meeting, Arun needs to get in his car to go pick up his child. So, he logs into the meeting on his smartphone. In a raw Zoom dataset, the person “Arun” will show up as four different names–(1) his corporate name that was not changed; (2) his child’s name on his iPad; (3) the name he changed his iPad to; and, (4) his mobile phone. Clearly, to properly study human behavior, all of the actions associated with these four names should be attached to “Arun”.

To address this problem, the rosetta file compiles every unique display name (by meeting) encountered across the participants, chat, and transcript files.

str(batchOut$rosetta)
#> 'data.frame':    39 obs. of  3 variables:
#>  $ userName      : chr  "Andrew Knight" "Clapton" "Eric Clapton" "Eric's iPad" ...
#>  $ userEmail     : chr  "knightap@wustl.edu" "" "shredder@cream.org" "" ...
#>  $ batchMeetingId: chr  "00000000001" "00000000001" "00000000001" "00000000001" ...

Add a Unique Individual Identifier to All Elements

I have found that by exporting the rosetta file and manually attaching a unique individual identifier (e.g., number) is a necessary process to ensure that the right data are attached to the right people. In terms of process, here is what I do:

Export the rosetta file. You can do this in your initial batchProcessZoomOutput call as follows:

batchOut = batchProcessZoomOutput(batchInput="./myMeetingsBatch.xlsx", exportZoomRosetta="./myMeetings_rosetta_original.xlsx")

Copy the file that you have exported and save it with a new name (e.g., myMeetings_rosetta_edited.xlsx). This will prevent you from losing your manual work if you re-run the command above and overwrite the file.
Add a new column to the new file that contains a unique user identifier. Name the column something like ‘indivId’. In this column you will enter a persistent individual identifier for individuals in your dataset.
Manually review the file, entering the unique identifier that is correct for each Zoom display name. I tend to sort the file by the Zoom display name column first. This increases the chances of catching the same person who shows up in multiple meetings or slightly different names (e.g., Ringo, Ringo Starr). For unique identifier, I recommend either using a master numeric identifier or email address that connects to other data streams that you might have.
Import the reviewed and edited rosetta file. The following command will import the file and add your new identifier to each of the elements that are included in batchOut. Going forward, you should use the new individual identifier (e.g., indivId) in your analyses. Together with the meeting identifier (e.g., batchMeetingId), you will be able to link records to other relevant data that you have collected (e.g., survey data).

batchOutIds = importZoomRosetta(zoomOutput=batchOut, zoomRosetta="./myEditedRosetta.xlsx", 
meetingId="batchMeetingId")

Running the importZoomRosetta command will attach the new unique identifier that you have created to the datasets that you created before for any of the available files in your directory.

Next Steps

Following the process above, you should have a single R object with a tremendous amount of information. Part 3 of this guide will cover how to analyze the conversational data in this R object.

Andrew P. Knight

2021-05-11