Skip to content

Beta 0.3.0

Compare
Choose a tag to compare
@ryanurbs ryanurbs released this 06 Aug 06:15
· 87 commits to main since this release

The current version of STREAMLINE is based on our initial STREAMLINE project release Beta 0.2.5, and has since undergone a major refactoring
STREAMLINE's codebase. Many functionalities have been reorganized and extended.

Major Updates ----------------------------------------

  • Extended to be able to run in parallel on 7 different types of HPC clusters using dask_jobqueue
  • Extended Phase 1 (previously EDA), to included numerical data encoding, automated data cleaning, feature engineering, and a second round of EDA:
    • Added numerical encoding for any binary, text-valued features, with a map file Numerical_Encoding_Map.csv output to document this numerical mapping of original text-values
    • Added quantitative_feature_path parameter in addition to categorical_feature_path allowing users to indicate which features to treat as categorical vs. quantitative (or specify one list and all other features will be treated as the other type). New .csv output files are also generated to identify what features were treated as one feature type or the other after data processing.
    • Added automated feature engineering of 'missingness' features to evaluate missingness as being predictive (assuming MNAR) along with featureeng_missingness parameter to control this function. Missingness_Engineered_Features.csv is output to document what features were added to the processed dataset as a result.
    • Added automated cleaning of features with high 'missingness'; with cleaning_missingness parameter added to control this function. Missingness_Feature_Cleaning.csv is output to document what features were removed from the processed dataset as a result.
    • Added automated cleaning of instances with high 'missingness'; with cleaning_missingness parameter added to control this function.
    • Added automated one-hot-encoding of all numerical and text-valued categorical features (with 3 or more values) so that they will be treated as such throughout all STREAMLINE phases.
    • Added automated cleaning of highly correlated features (one feature randomly removed out of a highly correlated feature pair); with correlation_removal_threshold parameter added to control this function. correlation_feature_cleaning.csv is output to document what features were removed in this way.
    • Added DataProcessSummary.csv output file to document changes in feature, feature type, instance, class, and missing value counts during each new cleaning/engineering step.
    • Added a secondary EDA applied to the processed dataset, saved with separate output files to the 'initial' EDA.
  • Adapted the 'replication' phase of STREAMLINE to process the replication data in the same way as the initial 'target dataset' ensuring that the same features are present. This accounts for any new 'as-of-yet' unseen values for categorical features that had previously been one-hot-encoded.
  • Added ability to run the whole pipeline as a single command in the different command line run modes (i.e. from the command line locally or on an HPC). This includes the addition of a variety of new command-line specific run parameters.
  • Added support for running STREAMLINE from the command line using a configuration file (in addition to commandline parameters)
  • Modularize all ML modeling algorithms within classes, which adds the ability for users to (relatively easily) add other scikit-learn-compatible classification modeling algorithms to the STREAMLINE code-base by making a python file in streamine/models/ based on the base model template. This allows code-savy users to easily add other algorithms we have not yet included, including their own.
  • As a demonstration of the ability to add new ML algorithms in this way, we've added Elastic Net (EN) as the 16th ML algorithm included within STREAMLINE.
  • Extended Google Colab Notebook to (1) automatically download the latest version of STREAMLINE, (2) offer separate 'Easy' and 'Manual' run modes for users to apply the notebook to their own data, where 'Easy' mode uses a prompt to gather essential run parameter information including a file navigation window to select the target dataset folder, (3) automatically download the output experiment folder and open the PDF summary reports on their screen (with user permission).

Minor Updates --------------------------------------------

  • Reverted back to using mean (rather than median) to present and sort model feature importances in plots (which was changed in Beta 0.2.4). This is to prevent confusion when running the notebook demos on the demonstration datasets, where using 3-fold CV yields median = 0 for all decision tree model feature importance scores which confuses picking and sorting the top features for plotting, as well as eliminates decision trees from the composite feature importance plots. We have added a hard-coded option to revert back to median ranking within the fi_stats() function within statistics.py.
  • Updated repository folder hierarchy, filenames, and some outputfile names.
  • Updated STREAMLINE phase groupings/numberings.
  • Updated the STREAMLINE schematic figure to reflect all major changes and new phase grouping.
  • Updated the feature correlation heatmap outputs: (1) color scheme used (for clarity), (2) view the non-redundant triangle vs. the full square (3) scale the feature names to avoid overlap, and don't show names at all when there are a large number of features (such that names would be unreadable)
  • Feature correlation results are now also documented within FeatureCorrelations.csv.
  • Reformatted the PDF output summary files to (1) add and re-organize all run parameters on the first page, (2) indicate the STREAMLINE version on the bottom of the page, and (3) include the new data processing/counts summary.
  • Univariate analysis output files now include the test run and test score in addition to p-values.
  • Updated the STREAMLINE Jupyter Notebook and other 'Useful Notebooks' to function with this new code framework.
  • Created a new hcc_data_custom.csv dataset for the demo that adds simulated features and instances to hcc_data.csv to explicitly test (and demonstrate the functionality of) the new automatic data cleaning and engineering steps in STREAMLINE phase 1. Similarly created a replication dataset hcc_data_custom_rep.csv which adds some noise to hcc_data_custom.csv and some other custom additions to demonstrate replication functionality. The code to generate these 'custom' datasets from hcc_data.csv are included in the data folder as the notebook Generate_expanded_HCC_Dataset.