To access the Jupyter notebook of this study, click here: https://awneto-basic.github.io/estonian_citizens_around_the_world/
This dataset lists the number of Estonian citizens living outside of Estonia, grouped by country, as of 21st February 2022.
I asked the Ministry of Interior of Estonia for the relevant data and they sent me a spreadsheet containing a list of contries and territories around the world and the number of Estonian citizens living in each of them. To be clear, the data was anonymised: only the number of Estonians was contained in the database, not their identities.
It is important to note that this data does not necessarily refer to the distribution of people who identify as ethnic estonians across the world, but rather to the number of estonian citizens (i.e. people bearing an Estonian passport) living outside of Estonia.
With the dataset in hand, I decided to undertake an Exploratory Data Analysis exercise to see if I could find patterns and insights on the data.
I started my exercise by adding up features I considered relevant. I enriched the dataset with the following features:
- ISO alpha-3 code - a three letter string that identifies the country and is used as an input for some geographic plotting tools.
- Continent - for grouping the data into regions
- Sub-regions - as defined by the United Nations' Standard Country or Area Codes for Statistical Use. For grouping the data into regions.
- Capital city and its coordinates - to set the locations for the datapoints related to each country on the geographic scatter plot.
- Distance between the country's capital city and Tallinn - to assess if there was any correlation between how far the country is to Estonia and how many estonians have chosen to live in the country
- Population (2020) - as estimated and registered on the United Nations' "2019 Revision of World Population Prospects". There were missing entries for some countries that required inserting data from other datasets. Note that these estimates were made before the COVID-19 pandemic, which means that there was a significant change in the population data from when this estimate was made (2019) and when the target data (i.e., number of estonian citizens) living in each country was generated. Thus, any correlation between this feature and the target feature may not reflect the current reality.
- GDP PPP per capita (2022) - extracted from IMF's "World Economic Outlook Database 2022" published on October 2021. The majority of the GDP PPP figures on the dataset are estimates from years prior.
- Former member of the USSR? - manually populated. I was curious to see if I could get any insights from assessing the number of estonian citizens living in former members of the Soviet Union (apart from Estonia).
- Sovereignty - manually populated. This categorical feature indicates if the target is a sovereign state.
There were at least three relevant features that were not added to the dataset:
- Percentage of Russian speakers in each country: I found it difficult to find consistent datasets with the percentage of russian speakers worldwide.
- Number of English speakers in each country: absent from this dataset for the same reason as above.
- Number of speakers of Finno-Ugric Languages: among the finno-ugric language group, Hungarian, Finnish and Estonian, national languages of Hungary, Finland and Estonia, respectively, are on the top three in terms of number of speakers. I've chosen not to add this feature because of difficulties of finding consistent data for the number of speakers of the other Finno-Ugric languages.