EDA is a necessary process for data analysis, used to view the statistical characteristics of variables, which can be used as a basis to try to do feature engineering. East this time to share three EDA tools, in fact, each of them has been shared before, this time the three toolkits are summarized to introduce together.
1. Pandas_Profiling
This belongs to three of the most lightweight, simple. It can quickly generate reports and provide an overview of variables. First, we need to install the package.
# Install the Jupyter extension widget
jupyter nbextension enable –py widgetsnbextension
# Or install it via conda
conda env create -n pandas-profiling
conda activate pandas-profiling
conda install -c conda-forge pandas-profiling
# Or install directly from the source address
pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
After successful installation, you can import the data and generate reports directly.
import pandas as pd
import seaborn as sns
mpg = sns.load_dataset(‘mpg’)
mpg.head()
from pandas_profiling import ProfileReport
profile = ProfileReport(mpg, title=’MPG Pandas Profiling Report’, explorative = True)
profile
A quick report with great visualization is generated using Pandas Profiling. The report results are displayed directly in the notebook, rather than opening in a separate file.
In total, six sections are provided: overview, variables, interactions, correlations, missing values, and samples.
The variables section of Pandas profiling is complete, and it generates detailed reports for each variable.
As you can see from the figure above, there is too much information available for just one variable, such as access to descriptive and quantile information.
Interaction
The interaction section allows us to obtain a scatter plot between two numerical variables.
Correlation
Information about the relationship between two variables can be obtained.
Missing values
You can get information about the missing value count for each variable.
Samples
You can display the sample rows in the data set for understanding the data.
2. Sweetviz
Sweetviz is another open source package for Python that generates beautiful EDA reports with just one line of code. The difference with Pandas Profiling is that it outputs a completely standalone HTML application.
Install the package using pip
pip install sweetviz
Once installed, we can generate reports using Sweetviz, try it out below.
import sweetviz as sv
# Target features can be selected
my_report = sv.analyze(mpg, target_feat = ‘mpg’)
my_report.show_html()
As you can see from the above image, Sweetviz report generation is similar to the previous Pandas Profiling, but with a different UI.
Sweetviz can not only view the distribution and statistical properties of a single variable, it can also set the target scalar to correlate the variable with the target variable for analysis. As on the far right of the report above, it obtains correlation information for numerical associations and category associations for all available variables.
The advantage of Sweetviz does not lie in the EDA reports on individual datasets, but in the comparison of the datasets.
Datasets can be compared in two ways: by splitting them (e.g. training and testing datasets) or by using some filters to segment the overall.
For example, in the example below, there are two datasets, USA and NOT-USA.
# Set the variables to be analyzed
my_report = sv.compare_intra (mpg, mpg [” origin”] == “usa”, [” USA” , “NOT-USA”], target_feat = ‘mpg’)
my_report.show_html()
This allows us to quickly analyze these variables without knocking too much code, which will reduce a lot of work in the EDA session and leave the time for variable analysis and filtering.
Some of the advantages of Sweetviz are.
Ability to analyze datasets about target values
The ability to compare between two datasets
However, there are some disadvantages.
No visualization between variables, such as scatter plots
Reports are opened in a separate tab
Personally, I like Sweetviz better.
3. pandasGUI
PandasGUI differs from the previous two in that PandasGUI does not generate reports, but rather a GUI (Graphical User Interface) data frame that we can use to analyze our Dataframe in more detail.
First, install PandasGUI.
# pip install
pip install pandasgui
# or download via source
pip install git+https://github.com/adamerose/pandasgui.git
Then, run a few lines of code to try it out.
from pandasgui import show
# Deploy the dataset for the GUI
gui = show(mpg)
There are many things you can do in this GUI, such as filtering, tallying information, creating charts between variables, and reshaping data. These operations can be done by dragging and dropping tabs as needed.
For example, statistics like the one below.
The most awesome thing is the plotter function. Drag-and-drop operations with it are virtually indistinguishable from excel, with almost zero operational difficulty and threshold.
It can also be reshaped by creating a new pivot table or fusing data sets.
The processed datasets can then be exported directly to csv.
Some of the advantages of pandasGUI are
Ability to drag and drop
Quickly filter data
Fast plotting
Disadvantages are
No complete statistical information
No report generation
4. Conclusion
Pandas Profiling, Sweetviz and PandasGUI are all great and designed to simplify our EDA processing. Each has its own advantages and applicability in different workflows, and the specific advantages of the three tools are as follows.
- Pandas Profiling is suitable for quickly generating analyses of individual variables.
- Sweetviz is suitable for analysis between datasets and between target variables.
- PandasGUI is suitable for deep analysis with manual drag-and-drop functionality.