A set of examples demonstrating Ignite Machine Learning capabilities.
In most insurance claims processing applications, each claim is submitted after some sort of service is provided. The reason for this is that in many verticals, such as healthcare, a service is provided first and then afterwards some kind of payment (reimbursement) is expected for which the claim transaction is the information format. Claims processing is usually a high volume operations so the un-reimbursed claims arrive in large batches; from a workflow management perspective, the claims processes can better capture the most value sooner if they are able sort out the "higher value" claims and priortize them for sooner processing before "lower value" claims. This "highest/medium/lowest" value segmentation can be calculated as a statistical output, based on reimbursement amount + plus the expected likelihood of obtaining that reimbursement amount in a timely manner. This output value segmentation can be estimated from the input fields of each transaction, if the input fields are properly selected and formatted. In this use case, historical claims data with the input values and the actual value of the claim that was reimbursed can be used, with ML processing, to predict the expected value of new incoming claims, and the claims can be priorized for attention by the staff in the order of financial value (again think of this as the maximum amount expected in the shortest time) of each claims. In this way the maximum financial value could be obtained from the claims adjudication process.
In this demo, we use simulated health claims training data, with inputs and statistically generated output values (dollar amounts as the labels). These can generated in bulk (you can set the # of records to create benchmarking jobs that run on Apache Ignite clusters), makes use of a statistical regression algorithm in order to (A) simulate a Random Forest parallel preprocessing and training workload, and then (B) make use of the trained Random Forest model decision tree to return the expected value of new transaction in realtime as the transaction arrives.
If you've worked before with ML training and preprocessing, you know that preprocessing and training an ML model is a highly iterative process. This demo package makes lots of (over) simplifying assumptions about the training and evaluation process, and so focuses more on an example of how the various Apache Ignite ML components can be orchestrated together as key members of a more comprehensive enterprise ML ecosystem. For example, training takes lots of data, and the 1000 or so records that we use here as a default (in order to more quickly run the training stage on a laptop) would normally need a lot more training data. If you have a cluster of machines you can set the amount of rows for the training data to be much larger. In addition, the code shows the Apache Ignite classes to use for predictions, evaluation of prediction accuracy, and even retraining of a model, but in the real world the evaluation and retraining would take a lot more "think time" and detailed evaluation by a data science team, using data discovery tools with a suitable UI and static analysis tools such as those from the large Python ecosystem for data science (take a look at this toolset from Gridgain as one example: https://www.gridgain.com/docs/latest/getting-started/quick-start/python ).
The best way to think about Apache Ignite ML library is as the high performance, massively processing platform (an "ML pipeline chassis" if you will) that works underneath 3rd party AutoML workflow tools to greatly accelerate ML workflows by providing massively parallel preprocessing. Apache Ignite ML provides the parallel processing on large clusters to speed up preprocessing, training, and predictive transaction processing.
There are 2 packages:
-
org.gridgain.demo.interactive.*
- this package runs the ML model preprocess and training, using generated data. Then you run a second program that acts like an actual client transaction engine sending in new transactions that request predictions as outputs from the trained model, the latter which runs as a service and can be shared by multiple clients. -
org.gridgain.demo.batch
- this one runs the entire ML Pipeline as a single run from start (generating the data) to the finish (predicting some outputs for new transactions). -
At the end of this README in the Configuration Details section there are some configuration options, for example to change the number or rows generated, or the data spit between train and test, that you can review once you are familiar with how to run the project.
The code in this package runs the ML Pipeline in data generation and Model building steps in the background, and then when you get to the Predictive step will stop and ask you to enter in a number to use for "new" transactions that need to have predictive values assigned. For this first version of the demo it will simply save both the predicted and the "actual" (which in this case is just a hidden value already provided but will be ignored)
NOTE: this code will output a single large status file with all of the steps and timestamps saved. The file prefix can be specified in the [config/MLPLProperties.txt] file
-
Step0RunTestCacheNode.java
You manually start this node as an optional cache server node if you want to run inside your IDE for example. You may want to run this this when you want to test out the ML pipeline within your IDE, and don't have access to an existing cache cluster in which to deploy your training dataset. -
Step1DataProviderNode.java
You manually starts one instance (only one) to perform the data generation steps. No CSV files are output, just send the data to cache. No data generation will be performed until the applicable method is called on the contained ServiceGrid proxy for data generation services, when you run [A0_Build_Model.java] -
Step2RFModelNote.java
This node is run just once to perform preprocessing and training on the training dataset, that will be performed on demand via the Data provider. No preprocessing or training will be performed until the applicable operations are called on the Service Grid proxy for Model services, when you run [A0_Build_Model.java] -
A0_Biuld_Model.java
You run this when the steps0-2 are up and running; at this time this code acts as an orchestrator that calls services for data generation and model building (preprocessing and training) -
A1_Perform_Predictions.java
When A0_build model has run and the model service is ready, now you run this to first get some new synthetic transactions (schema of the transactions is defined in the data provider service.
NOTE: These are not real transactions coming from an actual client system, rather they are simulated, and have the same output as training data but have the labels "hidden" and ignored at first. The predicted values from the RandomForest model service and the now revealed actual values are saved together and used to calculate the predictive accuracy of the Random Forest model. If the accuracy does not meet the hard-coded MAE then an update on the model is performed with this "new" data
1.cd <path>\ml-demo-interactive
{ with pom.xml}
2.mvn clean install
3.mvn exec:java -Dexec.mainClass="org.gridgain.demo.interactive.<java file you want to run>
- there are 3 services and 2 programs you need to run, see below "To Run" instructions
-
Rename the
{your file prefix}-interactive-status.txt
file if you want to save the output separately for each different run. You may need to delete it to save space. -
Step0RunTestCacheNode.java
-> start the test cache server in IDE (one or more) if you don't have cache servers up already -
Step1DataProviderNode.java
-> start one node only, can be done in parallel with other service nodes -
Step2RFModelNode.java
-> start one Random Forest model node only, can be done in parallel with steps 0 and 1 -
A0_Build_Model.java
-> this initializes the training dataset (see properties file) and preprocesses / trains RandomForest model. Once it completes you can now do predictions with simulated transactions -
A1_Perform_Predictions.java
-> makes calls on Model methods to predict outputs for new transactions; you will get a loop statement when running this that asks you to enter one of these inputs at the console command line: a. enter to default to 10 new transactions b. any number you want, say 100, to generate 100 new transactions c. "0" if you want to quit entering new transactions and have process continue on the accuracy and retrain logic -
once the transactions are run and predicted upon, the
A1_Perform_Prediction.java
runs a simple MAE (Mean absolute error) check and if this error exceeds the MAE you hard-coded, then it will send the new transaction cache is input to the RF Model to retrain.
The code in this package runs the entire RandomForest pipeline in one straight-through process batch run. It performs these steps in order
-
"Step0" just starts an optional cache server node if you want to run inside your IDE for example.
-
Generates sample training dataset of configurable size (based on claims processing data)
NOTE: Writes training dataset to a CSV file (this should be turned off if you plan to generate a large dataset). This is if you want to use the CSV for other training set inputs, not needed here since the generated dataset is fed directly into a cache. NOTE: the CSV loader is not yet implemented in this code.
-
This training dataset is used for Preprocessing into a vectorized dataset and then sent to the Random Forest training model.
-
A simple batch-oriented prediction cycle is run, first to get the predicted value from the Random Forest model, then to save both the predicted and actual values in a map.
NOTE: the same datagenerator is used to create the "live" transactions, so the label is just ignored and then called up later to simulate an actual output value.
- A process compares the Actuals to the Predictions and if a target MAE (mean absolute error) is exceeded, will trigger an update to the Random Forest model and then point to the latest version.
1. cd <path>\ml-demo-interactive { i.e. the directory with pom.xml}
2. mvn clean install
3. mvn exec:java -Dexec.mainClass="org.gridgain.demo.batch.A0_Run_Steps0to5.java
(1) programs in package "org.gridgain.demo.batch" starting with name A0* are parent programs that call the individual pipeline steps.
calls Steps0-5 to start cache server first, then on to generate synthetic data, preprocess, train, and then do predictions
assumes you have already started cache cluster, so no cache server started. Moves directly to generate does synthetic data, preprocess, train, then do predictions
(2) Optional Thin client UI
Step9RemoteUI.java
-----------------------
optional thin client connect to cache when done to see # entries. Edit if you want to connect remotely to Kubernetes cluster, right now it connects to localhost
(3) Step0RunTestCacheNode is the test server
(not used when you already have started cache servers externally)
(Can use any of these approaches to change settings: MLPLProperties.txt, Environment variables, or change in ConfigPipeLineSettings.java)
Properties File /config/MLPLProperties.txt : ROWS=1212 CONFIG_FILE=ignite-client.xml DATA_SPLIT=0.78 OUTPUT_DIR= TEST_SERVER= FILEPREFIX=200608-5-23
Environment variables :
CONFIG_FILE=ignite-client.xml <only used for local test, blank otherwise>
ROWS=1234 <data rows to be generated, if not set then default used>
DATA_SPLIT=.75 <if not set then default used, has to be >0.0 AND <1.0>