About this repository

This is a public repository providing an example of basic Exploratory Data Analysis (EDA) using R and RStudio. After loading a dataset, EDA is often the next step of data analysis. The basic idea of EDA is to quickly analyse and investigate data sets and summarize their main characteristics, often employing data visualization methods. As presented by IBM, EDA helps

data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions. EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate. The main purpose of EDA is to help look at data before making any assumptions. It can help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables.¹

We documented the following steps using a RMarkdown document, which you can clone as explore further. We published the example of EDA report produced here in RPubs, which is available at http://movimentar.co/eda_wdi.

This will be just a very basic example for beginners, and there is a plenty of other methods that can be used for a full EDA. Have fun and keep coding!

Steps for basic EDA from scratch

Install R (see: https://cran.r-project.org/) and RStudio (see https://www.rstudio.com) in your laptop/desktop.
Install the tidyverse package for data manipulation by typing install.packages("tidyverse") . This is required for the step 9.
Use some raw dataset in any format you prefer such as CSV (comma-separated values) or XLSX. You may download data any free dataset listed at https://r-dir.com/reference/datasets.html, https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html or connect to the World Bank Development Indicators as shown here https://www.r-project.org/nosvn/pandoc/WDI.html. You may also connect to a remote or local database such as MySQL or MongoDB, if you prefer. R is very flexible in terms of connections to data sources.
Load the dataset as a data frame and give it a clear and short name (e.g. povertydata). We recommend to use a clear convention for names, such as only lower case or camel case (see: https://en.wikipedia.org/wiki/Camel_case).
Visualise the dataset by typing: View(povertydata) in the R console.
Do some basic EDA (exploratory data analysis) of your dataset as shown here https://blog.datascienceheroes.com/exploratory-data-analysis-in-r-intro/ and https://www.r-bloggers.com/2018/11/explore-your-dataset-in-r/.
Install the DataExplorer package (simply typing this in the R console: install.packages("DataExplorer")) and run DataExplorer::create_report(povertydata).

IBM (2021) Exploratory Data Analysis. Available at: https://www.ibm.com/cloud/learn/exploratory-data-analysis. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
Example_of_EDA.Rmd		Example_of_EDA.Rmd
Example_of_EDA.html		Example_of_EDA.html
LICENSE		LICENSE
README.md		README.md
WV.2_Global_goals_ending_poverty_and_improving_lives.xlsx		WV.2_Global_goals_ending_poverty_and_improving_lives.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About this repository

Steps for basic EDA from scratch

About

Uh oh!

Releases

Packages

Languages

License

movimentar/EDA

Folders and files

Latest commit

History

Repository files navigation

About this repository

Steps for basic EDA from scratch

Footnotes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages