part02-process-zoom-files.Rmd
In the second part of this guide, you will learn how to turn downloadable files from Zoom recordings into datasets that you can analyze in R
using zoomGroupStats
. Again, because zoomGroupStats
relies on Zoom Cloud recording features, this guide will focus specifically on the files that you download from the Zoom Cloud. However, the functions in zoomGroupStats
can also be used on meetings that you record locally.
This guide will progress from instructions on which files to download from Zoom to how to convert those files into usable datasets. In its most basic form, a research dataset using virtual meetings can be structured as individual meeting participants who are nested within virtual meetings. Individual participants can attend multiple virtual meetings; indeed, individual participants could attend multiple virtual meetings at the same time. Individuals can further be members of multiple other organizational units (e.g., groups, teams, departments, organizations). To build a basic, clean dataset, though, what you must know is which individual human being was logged into which virtual meeting.
Depending on your research objectives, you will need different output files from Zoom. The components that are currently supported and used in this package are:
zoomGroupStats
is designed to batch proces virtual meetings. Batch processing simply means that you are combining several meetings into a single dataset. Even if you are only analyzing a single meeting, though, it is useful to treat that single meeting as a batch of one. This will help in creating an extensible dataset that you could add to in the future.
To organize your data for a batch run, you must rename the files in a systematic way and save them in a single directory. To name the files, imagine that each file will have a “prefix” and a “suffix”. The prefix will be an identifier for what meeting it came from and the suffix will be an identifier for what data source is in the file. Whereas you can use whatever prefix you choose, the suffix must be standardized as follows:
Suffix | Description |
---|---|
_chat.txt | Used for the chat file for meetings |
_transcript.vtt | Used for the transcript file for meetings |
_participants.csv | Used for the participants file for meetings |
_video.mp4 | Used for a video file for meetings |
Imagine a sample batch run, then, which includes all four elements for three meetings. The prefixes for the meetings could be “meeting001”, “meeting002”, and “meeting003”. This would yield a total of 12 files, named as follows:
All 8 files should be saved in a single directory.
Next, you will prepare a spreadsheet in .xlsx format and save it within the same directory where your Zoom downloads are. This spreadsheet tells zoomGroupStats
how to find your files and what information to process. Practically, this spreadsheet is also helpful for organizing your data and keeping track of what you have downloaded.
You must use the following structure and column headers for your spreadsheet:
batchMeetingId | fileRoot | participants | transcript | chat | video | sessionStartTime | recordingStartDateTime |
---|---|---|---|---|---|---|---|
00000000001 | /myMeetings/meeting001 | 1 | 1 | 1 | 0 | 2020-09-04 15:00:00 | 2020-09-04 15:03:30 |
00000000002 | /myMeetings/meeting002 | 1 | 1 | 1 | 0 | 2020-09-05 15:00:00 | 2020-09-04 15:03:04 |
00000000003 | /myMeetings/meeting003 | 1 | 1 | 1 | 0 | 2020-09-06 15:00:00 | 2020-09-04 15:03:07 |
The first row in the file is the header, which should mirror the example above. Each subsequent row provides the information for a single meeting. Here is a sample that you could use, replacing any rows after the header with your own information.
Column Name | Description |
---|---|
batchMeetingId | A string identifier for this particular meeting |
fileRoot | A string that gives the full path and prefix where the files from this meeting can be found. The final part of this string (e.g., meeting001 above) is how you have named the files downloaded for this meeting. |
participants | Binary - 0 if you did not download the participants file, 1 if you did |
transcript | Binary - 0 if you did not download the transcript file, 1 if you did |
chat | Binary - 0 if you did not download the chat file, 1 if you did |
video | Binary - 0 if you did not download a video file, 1 if you did |
sessionStartDateTime | A string giving the timestamp for when the meeting began as YYYY-MM-DD HH:MM:SS |
recordingStartDateTime | A string giving the timestamp for when the recording of the meeting began as YYYY-MM-DD HH:MM:SS |
If you have followed the steps above, turning your Zoom sessions into data should be straightforward using the zoomGroupStats
package.You should begin by installing the latest version of zoomGroupStats
. Currently this is best done through the install_github
function from the devtools
package:
# Install zoomGroupStats from CRAN
install.packages("zoomGroupStats")
# Install the development version of zoomGroupStats
devtools::install_github("andrewpknight/zoomGroupStats")
After you have installed the package, you can then load it into your environment as usual:
The first step in turning downloads into data is to process your batch using the batchProcessZoomOutput
function.
batchOut = batchProcessZoomOutput(batchInput="./myMeetingsBatch.xlsx")
This function will iterate through the meetings listed in the batchInput
file that you created above. For each meeting, the function will detect any of the named files described above (except the video file) and do an initial processing of them. The function currently ignores the video file because video processing can take a considerable amount of time. Processing video data will be covered in Part 4.
The output, stored in this example in batchOut
, will be a multi-item list that contains any data that were available according to the batch instructions. Currently, the following items can be included, if raw files are available:
This is information about the batch, drawn from the input batch file that you have supplied. It is helpful to have this information stored for later analysis of the different components of a Zoom recording.
str(batchOut$batchInfo)
#> 'data.frame': 3 obs. of 13 variables:
#> $ batchMeetingId : chr "00000000001" "00000000002" "00000000003"
#> $ fileRoot : chr "meeting001" "meeting002" "meeting003"
#> $ participants : num 1 1 1
#> $ transcript : num 1 1 1
#> $ chat : num 1 1 1
#> $ video : num 1 0 1
#> $ sessionStartDateTime : chr "2020-09-04 15:00:00" "2020-09-05 15:00:00" "2020-09-06 15:00:00"
#> $ recordingStartDateTime: chr "2020-09-04 15:00:00" "2020-09-05 15:03:15" "2020-09-06 15:20:00"
#> $ participants_processed: num 1 1 1
#> $ transcript_processed : num 1 1 1
#> $ chat_processed : num 1 1 1
#> $ video_processed : num 0 0 0
#> $ dirRoot : chr "/private/var/folders/hg/rsmy7v891ts4zm81674_mq100000gr/T/RtmpQRVS3w/temp_libpath16772380d70cf/zoomGroupStats/extdata" "/private/var/folders/hg/rsmy7v891ts4zm81674_mq100000gr/T/RtmpQRVS3w/temp_libpath16772380d70cf/zoomGroupStats/extdata" "/private/var/folders/hg/rsmy7v891ts4zm81674_mq100000gr/T/RtmpQRVS3w/temp_libpath16772380d70cf/zoomGroupStats/extdata"
These are based on information extracted from the participants file downloaded from Zoom. If a meeting has a *_participants.csv file, these will be included. They provide meta-data about the full set of virtual meetings. These are useful files for nailing down the structure of your full dataset (in terms of individuals nested within meetings) and for creating a unique individual identifier for the people in your dataset.
meetInfo
is a data.frame that gives information pulled about the meetings from the participants file downloaded from Zoom. Each row in this file is a meeting:
str(batchOut$meetInfo)
#> 'data.frame': 3 obs. of 8 variables:
#> $ meetingId : chr "00000000001" "00000000002" "00000000003"
#> $ meetingTopic : chr "Meeting about Reunion" "Discuss Concert Tour" "Show at the Ed Sullivan"
#> $ meetingStartTime: chr "2020-09-24 11:35:16" "2020-09-27 11:35:16" "2020-09-30 11:35:16"
#> $ meetingEndTime : chr "2020-09-24 12:49:20" "2020-09-27 12:49:20" "2020-09-30 12:49:20"
#> $ userEmail : chr "knightap@wustl.edu" "knightap@wustl.edu" "knightap@wustl.edu"
#> $ meetingDuration : num 75 75 75
#> $ numParticipants : num 170 170 170
#> $ batchMeetingId : chr "00000000001" "00000000002" "00000000003"
partInfo
is a data.frame that gives the participants in each meeting and any information available in the participants file downloaded from Zoom
str(batchOut$partInfo)
#> 'data.frame': 30 obs. of 5 variables:
#> $ userName : chr "Andrew Knight" "George Harrison" "John Lennon" "Paul McCartney (Wings)" ...
#> $ userEmail : chr "knightap@wustl.edu" "george.harrison@gmail.com" "" "paul@gmail.com" ...
#> $ userDuration : int 75 72 70 73 53 60 58 54 57 57 ...
#> $ userGuest : logi NA NA NA NA NA NA ...
#> $ batchMeetingId: chr "00000000001" "00000000001" "00000000001" "00000000001" ...
These two items provide parsed text data from the transcribed audio spoken during the meeting and the text-based chat in the meeting. These files are the basis of any further text-based analysis. One important thing to note in processing transcripts is that Zoom timestamps the transcript file anchored on the start of the recording–not at the start of the session itself. The issue here is that a recording may not begin until a period of time after the launch of the session. This means that you could have a transcript file out of sync with the chat file, which begins its timestamp at the start of the session. Resolving this issue requires careful records of when the recording was started or setting up your session to automatically record.
transcript
is a data.frame that is the parsed audio transcript file. Each row represents a single marked “utterance” using Zoom’s cloud-based transcription algorithm. Utterances are marked as a function of pauses in speech and/or speaker changes.
str(batchOut$transcript)
#> 'data.frame': 900 obs. of 10 variables:
#> $ utteranceId : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ utteranceStartSeconds: num 4.86 9.54 15.63 19.98 24 ...
#> $ utteranceStartTime : POSIXct, format: "2020-09-04 15:00:04" "2020-09-04 15:00:09" ...
#> $ utteranceEndSeconds : num 7.41 12.99 17.76 21.24 26.07 ...
#> $ utteranceEndTime : POSIXct, format: "2020-09-04 15:00:07" "2020-09-04 15:00:12" ...
#> $ utteranceTimeWindow : num 2.55 3.45 2.13 1.26 2.07 ...
#> $ userName : chr "Andrew Knight" "Andrew Knight" "Andrew Knight" "Andrew Knight" ...
#> $ utteranceMessage : chr "Okay, so we're recording. We're streaming" "We have that setup. Okay." "It's like Mary Kate's here. I'll let her in." "Gonna make her a host to" ...
#> $ utteranceLanguage : chr "en" "en" "en" "en" ...
#> $ batchMeetingId : chr "00000000001" "00000000001" "00000000001" "00000000001" ...
chat
is a data.frame that is the parsed text-based chat file. Each row represents a single chat message submitted by a user. Non-ASCII characters will not be correctly rendered in the message text.
str(batchOut$chat)
#> 'data.frame': 30 obs. of 7 variables:
#> $ messageId : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ messageSeconds : num 1274 1295 1321 1476 1802 ...
#> $ messageTime : POSIXct, format: "2020-09-04 15:21:14" "2020-09-04 15:21:35" ...
#> $ userName : chr "Andrew Knight" "Andrew Knight" "Ringo Starr" "Andrew Knight" ...
#> $ message : chr "Hello Everyone - It’s great to see folks dropping into the session. We’ll get started here in just about 5 minutes." "Please don’t hesitate to use the chat to say hello to friends and colleagues" "Also drop in questions and comments throughout the discussion in that chat and I’ll respond as we go thanks!" "Wow, Joe - that looks like a nice spot for joining a webinar" ...
#> $ messageLanguage: chr "en" "en" "en" "en" ...
#> $ batchMeetingId : chr "00000000001" "00000000001" "00000000001" "00000000001" ...
The final element is called “rosetta” because it will help you deal with one of the most vexing challenges in analyzing virtual meetings with a large number of participants: The lack of a persistent and unique individual identifier for participants. Because the transcript and chat files rely on people’s Zoom display names–and because people can change their display names–you will frequently encounter duplicate names and/or multiple identifiers for the same person.
From a data structure perspective, what is most important to know is which individual person logged into which virtual meeting. Unfortunately, complications in how Zoom identifies individuals require careful human attention to this issue. To illustrate, consider the following exaggerated example. Arun Jah is an attendee in Meeting A. Arun logs in on his laptop through his corporate account. But, he also logs in on his iPad through his family account to the same meeting. After a few minutes, Arun realizes that his iPad account has his child’s name displayed (the child was using it for virtual school). So, he changes his display name. Halfway through the meeting, Arun needs to get in his car to go pick up his child. So, he logs into the meeting on his smartphone. In a raw Zoom dataset, the person “Arun” will show up as four different names–(1) his corporate name that was not changed; (2) his child’s name on his iPad; (3) the name he changed his iPad to; and, (4) his mobile phone. Clearly, to properly study human behavior, all of the actions associated with these four names should be attached to “Arun”.
To address this problem, the rosetta
file compiles every unique display name (by meeting) encountered across the participants
, chat
, and transcript
files.
str(batchOut$rosetta)
#> 'data.frame': 39 obs. of 3 variables:
#> $ userName : chr "Andrew Knight" "Clapton" "Eric Clapton" "Eric's iPad" ...
#> $ userEmail : chr "knightap@wustl.edu" "" "shredder@cream.org" "" ...
#> $ batchMeetingId: chr "00000000001" "00000000001" "00000000001" "00000000001" ...
I have found that by exporting the rosetta
file and manually attaching a unique individual identifier (e.g., number) is a necessary process to ensure that the right data are attached to the right people. In terms of process, here is what I do:
rosetta
file. You can do this in your initial batchProcessZoomOutput
call as follows:
batchOut = batchProcessZoomOutput(batchInput="./myMeetingsBatch.xlsx", exportZoomRosetta="./myMeetings_rosetta_original.xlsx")
Copy the file that you have exported and save it with a new name (e.g., myMeetings_rosetta_edited.xlsx). This will prevent you from losing your manual work if you re-run the command above and overwrite the file.
Add a new column to the new file that contains a unique user identifier. Name the column something like ‘indivId’. In this column you will enter a persistent individual identifier for individuals in your dataset.
Manually review the file, entering the unique identifier that is correct for each Zoom display name. I tend to sort the file by the Zoom display name column first. This increases the chances of catching the same person who shows up in multiple meetings or slightly different names (e.g., Ringo, Ringo Starr). For unique identifier, I recommend either using a master numeric identifier or email address that connects to other data streams that you might have.
Import the reviewed and edited rosetta file. The following command will import the file and add your new identifier to each of the elements that are included in batchOut
. Going forward, you should use the new individual identifier (e.g., indivId
) in your analyses. Together with the meeting identifier (e.g., batchMeetingId
), you will be able to link records to other relevant data that you have collected (e.g., survey data).
batchOutIds = importZoomRosetta(zoomOutput=batchOut, zoomRosetta="./myEditedRosetta.xlsx",
meetingId="batchMeetingId")
Running the importZoomRosetta command will attach the new unique identifier that you have created to the datasets that you created before for any of the available files in your directory.
Following the process above, you should have a single R object with a tremendous amount of information. Part 3 of this guide will cover how to analyze the conversational data in this R object.