What Tinder users' demographics are around me?
Summing up my year with a tutorial on data scrapping and analysis on Tinder users around me according to my preference.
A small introduction
Tinder as we know it is a matchmaking app that both parties have to swipe right on each other to match. Over the years, Tinder has evolved and unofficially been used as an app for hookups or just finding friends to socialize with.
Our goal today is to find out what are people around me really looking for in Tinder, what are the demographics of users around me, and if there is any influence based on my preference settings. My current preference settings in Tinder are set to female users aged 20 to 60 and 60KM within my radius. Please note that I am not paying for the service such as Tinder Gold, therefore the results may skew as Tinder may recommend more relevant results to users who are paying.
At the end of this article, we will produce a dashboard on the Tinder user demographic.
The section you can swipe right to:
- Getting the data
- Storing and creating a view
- Connecting to Google Datastudio
- Building the dashboard
- Creating a word cloud of users bio
- What does my analysis say?
- What is next
Getting the data
The first step of every data analysis project is getting the data. For this project, I want the data of people around me in Tinder at this point in time, therefore online datasets are not relevant to me. After setting up a Tinder account, the next step would be to understand how Tinder works on the browser. There are a few ways to scrap, with one using a tool like Selenium to login and parse the data, or via API. However, it seems like Tinder does not provide any official API (for obvious reasons such as bots swiping), therefore I would not delve in too much on the API I unofficially used from their website.
But for some context, the API returns around 20 users every time the endpoint is called along with the user's photo and some information about that user's preference in a JSON format.
Since I am planning to use BigQuery to store the data and visualize it with Google Datastudio, there is no need to preprocess the JSON response from the API and flatten the nested fields as BigQuery has native support for nested fields.
As BigQuery requires JSONL (JSON Lines) format instead of JSON, we will have to export the data from the API in a JSONL file format. The difference between JSON and JSONL is the latter having each record separated by rows as shown in the screenshot above.
After loading the JSONL into BigQuery using the UI, we will get a table similar to the screenshot above. Nested fields such as user information are stored as records on BigQuery.
Storing and creating a view
We will be creating a view to query the table for a few reasons:
- We do not require all the columns, so selecting only the required columns is a cost-effective method.
- In the event I was to share my dataset or data source, I do not want to share user identifiable information from my raw table.
- Some basic data transformation can be done in the view itself rather than doing it in Datastudio.
Filtering users bio
Running LENGTH in BigQuery returns the number of characters instead of words. Fortunately, filtering by character length is good enough as some users tend to put their Instagram or Telegram handlers (IG:blabla) in their bio which results in 1 word only.
IF(LENGTH(user.bio) > 3, 1, 0) AS hasBio
The above would return 1 if the user has a bio and it has more than 3 characters. I chose 3 characters as a quick Google tells me that an average English word takes up to 4.7 characters.
Other data standardization
Based on the screenshot above, both “student” are similar yet different in the eyes of a computer. One way to circumvent such a scenario so our demographics are cleaner is to uppercase or lowercase the text.
LOWER(jobs.name) will lowercase all the job names and “student” will be grouped together with an occurrence of 17. Most user input fields will need some form of standardization, so it's always advisable to run some data exploration and figure out which fields require standardization.
After narrowing down to the required columns and data transformation, the following is the view we will be using for the dashboard.
The row number/row_n is used to deduplicate records that are repeated.
The view schema will look similar to the screenshot above.
Connecting to Google Datastudio
Google Datastudio is a great tool to quickly pull in data from BigQuery and visualize it. They provided a quick start guide for everyone to create a report and connect to your data source. After connecting to the view we created earlier from BigQuery, we can start designing and setting up charts.
Building the dashboard
Based on the data points I have, it is good enough to have a general understanding of the Tinder user demographic around me such as their jobs, universities they have been to or still going to, age range, and even their interest if they added.
In Datastudio, we shall create a field that subtracts the current year with the user's birthday to get their age, then drag it into a simple pie chart. We will then add the built-in Record Count metric which will COUNT the number of rows based on the dimension, and sort it descending.
I have shared my final dashboard which is interactive at the Tinder users around me dashboard page of my website. The dashboard is interactive and allows you to play around with the filters and charts.
Creating a Word Cloud of users bio
Apart from creating a dashboard, we can also create a word cloud based on the user’s bio. Andreas Mueller has created WordCloud for Python that is simple to use yet achieves beautiful results, which is what was used for the Tinder bio word cloud art as shown above.
What does my analysis say?
I have uploaded my dashboard at “What Tinder users’ demographics are around me?”.
After all the work, it is time to understand what is my data is showing. These are just my assumptions based on the data I collected and not to be taken personally or seriously.
Based on a treemap of top jobs Tinder users are doing around me, students consist of the majority. This could largely be due to my age and Tinder tries to recommend people around my age range, which mostly would still be university students. The next majority are users from the STEM field of work.
We can assume that the majority of Tinder users are young adults and looking for love or just being friends based on the word cloud combined with the treemap information. Looking at top universities Tinder users go to tells what the surrounding environment of that area is, however, we will not delve too much into that as it is specific to my country.
Interestingly, no IT as their job users put Travel as their interest, however these IT job users selected music, coffee, gamer, and Netflix.
These are data that should not be shared with the general public. Therefore a few steps have to be taken to ensure the data that I am sharing with you are data that cannot be identified back to any single user. A few considerations that were necessary are ensuring the view I created and sharing if there are any plans to does not contain any IDs or user-identifiable data. The least privileged access concept to my Google Cloud Project and Big Query is applied.
What is next
There is so much more to do with the information, however, I am putting this out as I progress along. With fields for users who connected their Instagram, we could create another demographic on users who have a public or private Instagram profile, what group usually connects Tinder with Instagram, or if there are any “boomer” users.
Opening my data gathering preference to be as diverse as possible and collect people of all gender around me so we can see the demographics better could be next. And with the data I have, there is potential on training a model to understand intentions from a user's bio, keyword extraction, image recognition to understand if a user is uploading a selfie or an image of a cat, etc.
Periodically running the script to extract and load the data may be in the pipeline to understand if there are any demographic changes over time, but that may not be my priority as of now.
Perhaps trying to expand my region and understand how the demographics vary across Southeast Asia and understand where Tinder can enhance its services to target regions (-hints Tinder to hire me). Jokes aside there is so much we can do with the data we have now, and it would be interesting to see how Tinder enhance their service as the pandemic moves over and social gathering increases.