DataAnalyzer is a Python library designed for statistical data analysis and automated test selection. It supports a range of statistical tests—from normality tests and t-tests to non-parametric, correlation, regression, and variance tests. Additionally, DataAnalyzer can be extended to use a machine learning model to reason about which test to run based on extracted data features.
This is a tool to help you select a statistical test. In concrete terms, it is used during the analysis or statistical inference phase of your data, i.e. :
- After you have collected, cleaned and explored the data (pre-processing).
- Before or in parallel with any Deep Learning training, if you wish to test statistical hypotheses (e.g. comparing two groups, assessing the presence of a confounding variable, etc.).
Typically, you use it when you want to test the significance of differences between your groups or variables (means, distribution, dependency) and need to determine the most suitable statistical procedure (parametric or non-parametric model, number of samples, paired or independent data, etc.).
DataAnalyser.mp4
- Statistical Tests:
- Normality: Shapiro-Wilk test.
- Comparative Tests: Student's t-test, Welch's t-test, Wilcoxon signed-rank test, Mann-Whitney U test.
- Correlation Tests: Pearson and Spearman correlations.
- Regression: Linear and non-linear regression.
- Variance Tests: Levene's test.
- Multiple Group Comparison: ANOVA and Kruskal-Wallis tests.
- Contingency Analysis: Chi-square test.
- ANCOVA: Analysis of covariance using a DataFrame.
- Flexible Data Input:
- Accepts one or more datasets.
- Validates data to ensure it meets the requirements for each test.
- Model Reasoning:
- Optionally load a pre-trained model (e.g., a pickled scikit-learn estimator) that predicts the appropriate statistical test based on a rich set of extracted features.
- Robust feature extraction including mean, standard deviation, skewness, kurtosis, sample size, and normality flag.
- Centralized Logging:
- Detailed logging for debugging and tracking test execution.
- Graphical User Interface (GUI):
- Enhanced visualization using Seaborn with custom colors, labels, and pointer control.
- Multi-selection support for statistical tests and graphical representation.
- User-friendly design for intuitive data analysis.
To install the required dependencies, you can use pip
:
pip install numpy pandas scipy statsmodels seaborn matplotlib
If you intend to use a machine learning model for test reasoning, make sure you have installed the dependencies for your model (for example, scikit-learn):
pip install scikit-learn
Below is an example of how to use DataAnalyzer:
import numpy as np
import pandas as pd
from data_analyzer import DataAnalyzer # assuming the module is named data_analyzer.py
# Example datasets
data1 = [1, 2, 3, 4, 5]
data2 = [2, 4, 6, 8, 10]
# Initialize the analyzer with two datasets
analyzer = DataAnalyzer(data1, data2)
# Run a default comparison test (automatically selects t-test or non-parametric equivalent)
comparison_result = analyzer.run_test("comparison")
print("Comparison test result:", comparison_result)
# Run a chi-square test for independence
chi_result = analyzer.run_test("independence")
print("Chi-square result:", chi_result)
# Run a correlation test
correlation_result = analyzer.run_test("correlation")
print("Correlation test result:", correlation_result)
# Run linear regression
regression_result = analyzer.run_test("regression")
print("Linear regression result:", regression_result)
# Run Levene's test for equality of variances
levene_result = analyzer.run_test("variance")
print("Levene's test result:", levene_result)
# Run ANCOVA test with a DataFrame
df = pd.DataFrame({
"y": [1, 2, 3, 4],
"x": [1, 2, 3, 4],
"cov": [2, 4, 6, 8]
})
analyzer_with_df = DataAnalyzer(data1, data2, df=df)
ancova_result = analyzer_with_df.run_test("ancova", formula="y ~ x + cov")
print("ANCOVA result:\n", ancova_result)
# Non-linear regression example
x_data = np.array([0, 1, 2, 3])
y_data = np.array([1, 2.7, 7.4, 20.1])
func = lambda x, a, b: a * np.exp(b * x)
analyzer_nl = DataAnalyzer(x_data, y_data)
non_linear_result = analyzer_nl.run_test("non_linear", func=func, p0=[1, 1])
print("Non-linear regression result:", non_linear_result)
# Multiple groups example (ANOVA or Kruskal-Wallis will be selected automatically)
group1 = [1, 2, 3, 4, 5]
group2 = [2, 3, 4, 5, 6]
group3 = [1.5, 2.5, 3.5, 4.5, 5.5]
analyzer_multi = DataAnalyzer(group1, group2, group3)
multi_result = analyzer_multi.run_test("comparison")
print("Multiple groups comparison test result:", multi_result)
DataAnalyzer can optionally load a pre-trained model to reason about which statistical test to run based on the characteristics of the input data. The model is expected to implement a predict
method and accept a feature vector that includes:
- Mean
- Standard deviation
- Skewness
- Kurtosis
- Sample size
- Normality flag
You can load a model from a file (e.g., a pickled scikit-learn model):
analyzer.load_model("path_to_model.pkl")
Once a model is loaded, you can use it to determine the test to run:
model_test_result = analyzer.model_reason("comparison")
print("Model reasoning test result:", model_test_result)
The model_reason
function has been refined to extract a robust feature set from the datasets, verify that the model is valid, and fall back to default behavior if any error occurs.
The GUI enables users to easily select and analyze data using a visually appealing interface:
- Multi-selection for statistical tests.
- Graph customization: Colors, labels, and pointer control for enhanced clarity.
- Seaborn integration for high-quality visualizations.
- User-friendly interface for data input and visualization selection.
To launch the GUI, use:
streamlit run gui_app.py
Contributions are welcome! Please feel free to submit issues or pull requests for improvements and additional features.
- Realize the complete workflow below
flowchart TD
%% Styling for improved readability
classDef startend fill:#F5EBFF,stroke:#BE8FED,stroke-width:2px,color:black;
classDef decision fill:#FFF6CC,stroke:#FFBC52,stroke-width:2px,color:black;
classDef process fill:#E5F6FF,stroke:#45AEE6,stroke-width:2px,color:black;
classDef z fill:#D4EDDA,stroke:#28A745,stroke-width:2px,color:black;
%% Start Node
A[Start]:::startend --> B{Are the data normally distributed?}:::decision
%% Path for normally distributed data %%
B -- Yes --> C{Is there a difference in mean values?}:::decision
B -- No --> Q{Are there more than two samples?}:::decision
%% Decision tree for normal data
C -- Yes --> E{Are the data independent?}:::decision
C -- No --> F[Paired t-test]:::process
E -- Yes --> G[Independent t-test: Comparison of two means]:::process
E -- No --> F
%% More than two groups?
C -- Yes --> K{Are there more than two groups?}:::decision
K -- Yes --> L{Is the experimental design with independent groups?}:::decision
L -- No --> M[Two-way ANOVA: Analysis of Variance with two factors]:::process
L -- Yes --> N[One-way ANOVA: Analysis of Variance with one factor]:::process
%% Covariable: ANCOVA %%
B -- Yes --> O{Is there a confounding variable?}:::decision
O -- Yes --> P[ANCOVA: Analysis of Covariance]:::process
O -- No --> K
%% Path for non-normally distributed data %%
B -- No --> Q
Q -- Yes --> R{Is the experimental design with dependent samples?}:::decision
Q -- No --> S[Friedman Test: Dependent Samples]:::process
R -- Yes --> T[Kruskal-Wallis Test: Independent Samples]:::process
R -- No --> S
%% Handling non-parametric correlation %%
B -- No --> D{Is there a difference in mean values?}:::decision
D -- Yes --> H{Is there a nonlinear relationship?}:::decision
D -- No --> U[Mann-Whitney U test: Two samples]:::process
H -- Yes --> J[Spearman’s Correlation: Monotonic Relationship]:::process
H -- No --> U
%% End states for tests
G -->|Done| Z[End of Test]:::z
F -->|Done| Z
M -->|Done| Z
N -->|Done| Z
P -->|Done| Z
T -->|Done| Z
S -->|Done| Z
J -->|Done| Z