A Data Analysis of My Discord Messages

In recent years, I have noticed a rise in what is best described as per-user activity reports in the style of the annual "Spotify Wrapped." I usually enjoy these, as it's fascinating seeing my stats for an app and fun to compare with the results of my friends.

This year, the multimedia instant messaging and video calling platform Discord created their own reports similar to wrapped, called "Discord Checkpoints". I use the app frequently so I was interested to see my results. However, when I went to view my report this screen appeared:

The Discord logo above the words "Checkpoint Unavailable."

Because I had some advertising trackers turned off in the app's settings, it said that they could not show me the info. So, during winter break, I decided to instead make my own analysis of my usage of Discord. However, rather than constraining myself to a single year, I used all of my data from Discord totaling to nearly 7 years, starting in January of 2019 and ending in mid-November of 2025 (around a month before I started most of my work on the analysis).

Due to privacy laws, Discord legally has to send you all of your data they have gathered when you request it. I could not definitively figure out what requires them to do this, but from some research it seems that the data request system most likely exists to comply with an act from the European Union regulatory organization GDPR (General Data Protection Regulation) which allows for SARs (Subject Access Requests).

Shortly after requesting my data, I got an email with a large .zip file containing several subfolders. This included an "Account" folder which contains data such as my email, IP address, and account settings, a "Messages" folder which had a JSON file for each channel I had ever typed in with a list of messages I sent, and a folder containing info on "Reporting" and "Trust and Safety" totaling 1.8 gigabytes from two text files which contained nearly 2 billion characters of data.

I processed this data using Python through Google Colab alongside numerous libraries , including Pandas and NumPy for data analysis, Regex for processing text data, and Matplotlib for visualizations.

After processing the data, the most straightforward figure to get was the number of messages I have sent:

102,556 total messages

If I go for a conservative estimate of each message taking approximately 15 seconds to send, my messages would total around 427 hours of sending messages. This also does not account for time spent reading others' messages. That's roughly equal to driving coast-to-coast across the US ten times, without stopping, or watching the film The Neverending Story 272 times.

Of these messages, 9,574 had attachments such as images or files, leaving 92,982 more without.

Those messages were written using:

 3,844,800 total characters of text.

That's roughly to 1.2 bibles' worth! If everything I typed was printed using 12pt Courier New (a very common monospaced font), it would stretch for just over 6 miles, which is the length of 106.8 football fields.

Using my extracted data I also can analyze things like how many times I have used each letter:

Letter Usage Amounts

This does not show much on its own, but it can also be compared to another data source to show how my letter usage differs from the average English document:

Letter Frequency Comparison

Control data is from Cryptological Mathematics by Robert Edward Lewand

The data includes a timestamp for each message. By using this, I can compare my messages over time:

Messages by Month

As you can probably see, I (unsurprisingly) started using Discord much more when the Covid-19 shutdown began.

I can instead group messages by the hours they were sent (using EDT) to see when I use Discord the most:

Messages by Hour

Or I can split this further by year. As you can see, I did not use Discord much during 2019 so the numbers are a bit strange.

Messages by Hour, Split by Year

Using similar grouping methods as what I used to show message frequency, I can also analyze a message’s length in characters:

Message Lengths by Month

Message Lengths by Hour

Message Lengths by Hour, Split by Month

The obvious next step after analyzing letters was the words they form. However, deciding what does or does not constitute a word is easier said than done. While experimenting I came up with three possible definitions for how my code should recognize words:

Loose Definition

Any characters surrounded by whitespace (spaces and line breaks)

"example:" is valid (including the colon as part of the word)

RegEx:
\S+

Strict Definition

Any chain of letters

"wasn't" becomes "wasn" and "t"

RegEx:
[a-zA-Z]+

Goldilocks definition

A chain of letters and apostrophes, and any dashes that are "sandwiched" between other letters

Avoids both of the shown issues

RegEx:
(?:[a-zA-Z']|(?<=[a-zA-Z])-(?=[a-zA-Z]))+

I decided to use my third definition, which let me determine my total:

720,259 words

I can pretty consistently type at 70 words per minute, so if I had to re-type every discord message it would take 171.5 hours, or a bit over four 40-hour workweeks. Additionally, based on the assumption that a non-double-sided page can fit 250 words, a printed copy of my messages would weigh upwards of 115 pounds.

Most of my top words were not particularly surprising:

Word Rank Uses
the 1 22,245
i 2 21,407
a 3 17,234
to 4 14,371
and 5 12,239
it 6 11,266
of 7 10,693
you 8 10,267
is 9 9,088
that 10 7,309
for 11 7,253
in 12 7,032
nan 13 6,268
just 14 6,009
this 15 5,285

I was initially confused at what "nan" meant, but I realized when looking back through the messages that it was actually the lowercase version of NaN (Not a Number), which appears in my data every time I sent a message with an attachment but no text.

The final visualization I made was a representation of how my usage of my 15 most-used words has changed over time:

Overall, this project was both very fun and interesting to create and I plan to work on expanding it or doing more like it in the future.