Skip to content

akshitasingh0706/NaturallyDrifted

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NaturallyDrifted is drift detection library that focusses on drift detection in text data.

Table of Contents

  1. Installation and Usage
  1. Drift Detector Fundamentals

Importing Code directly into Google Drive

Step 1: Loading packaged and setting up the environment

  • Import the Drift Detection folder onto Google Drive. You can clone the repo or download the folder from the following Github repo. If you are new to Github, this resource might come in useful.
  • Launch a new Google Colab notebook. If you are new to Colab, this tutorial might come in useful.
  • Connect to Google Drive using the following commands:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
  • Specify the path to Drift Detection folder in Google Drive
filepath = [path to file]
# Ex: filepath = /content/gdrive/MyDrive/DriftDetection
  • Load the relevant packages
!pip install -r filepath/requirements.txt
  • Load additional python functions and classes
import sys
filepath = str(filepath)
sys.path.insert(0, filepath) # very important
from fileImports import imports 
samplingData, baseModels, embedding, distributions, detectors, AlibiDetectors = imports.run()

Step 2: Loading and processing the Data

  • Load and process data. One example with the IMDB dataset is as follows:
import nlp
def load_dataset(dataset: str, split: str = 'test'):
    data = nlp.load_dataset(dataset)
    X, y = [], []
    for x in data[split]:
        X.append(x['text'])
        y.append(x['label'])
    X = np.array(X)
    y = np.array(y)
    return X, y
X, y = load_dataset('imdb', split='train')
print(X.shape, y.shape)
  • Split it into the different pieces that will act as the reference and deployment data
X1 = X[:round(X.shape[0]*.4)] # data_ref, data_h0
X2 = X[round(X.shape[0]*.4):] # data_h1

Step 3: Drift Detection

  • Check for Drift Detection. An example for Doc2Vec and KS test is given below
# define variables/parameters 
sample_size = 500
windows = 10

test = "MMD"
drift_type = "Online"
embedding_model = 'SBERT'
SBERT_model = 'bert-base-uncased'

ert = 50
n_runs = 20
window_size = 20


# initialize the detector class with the above parameters
detectors = allDetectors(
                #### Step 1: data/sampling related parameters
                data_ref = X1, data_h0 = X1, data_h1 = X2, test = test, sample_size = sample_size
                
                #### Step 2: text embedding related parameters
                embedding_model = embedding_model, SBERT_model = SBERT_model
                
                #### Step 3: drift detection test and drift type related parameters
                test = test, drift_type = drift_type
                
                #### Step 4: selected drift detector related parameters
                ert = ert, n_runs = n_runs, window_size = window_size
                )
                
# run the code to get the detector results
result = detectors.run() 

[To be completed]

Step 1: Loading packaged and setting up the environment

  • Set up your virtual environment and cd into it. For users with Macs with M1 chips, setting up the environment might be a little bit more involved. There are some online resources that help can help you set up Tensorflow/Transformers on Mac 1 such as this article
filepath = [path to venv]
cd filepath
conda activate [name of venv]
  • Load the Drift Detection folder/clonded repo into the same folder as your virtual environment.
  • Launch a jupyter notebook
jupyter notebook
  • cd into the Drift Detection folder
  • Load the relevant packages
pip install -r requirements.txt
  • Load additional python functions and classes

Steps 2 and 3 are the same as the ones in Google Colab section

To detect drifts, we need to look into the "reference data" as well as the comparison data. A convenient (but not the only) way to divide our data for our analyses is as follows:

data_ref: np.ndarray, list

  • This is the dataset on which is used as the reference/baseline when detecting drifts. For instance, if our test of choice is KL Divergence, then we will declare a possible drift based on whether any other data is close in distribution to data_ref.
  • Generally, the goal is to have all future datasets be as close (in embeddings, distributions) to data_ref, which is how we conclude that there is no drift in the dataset.
  • data_ref is typically sampled from the "training data". During real world application, this is the data on which the test will be modeled on because this would generally be the only data the user would have access to at that point of time.

data_h0: np.ndarray, list (optional)

  • This is generally the same dataset as data_ref (or a stream that comes soon after). We use the lack of drift in data_h0 (with data_ref as our reference) as the necessary condition to decide the robustness of the drift detection method.
  • If the method ends up detecting a drift in data_h0 itself, we know it is most likely not doing a good job. This is because both data_ref and data_h0 are expected to be coming from the same source and hence should result in similar embeddings and distributions. If the user is confident in the efficacy of their drift detection method, then it would be worthwhile to consider change the size of data_ref and data_h0 and then re-evaluate detector performance, before proceeding to data_h1.

data_h1: np.ndarray, list

  • This is the primary dataset on which we can expect to possibly detect a drift. In the real world, this would usually be the dataset we get post model deployment. To test detectors, a convenient (but not necessarily the best) practice is to take the test data and use that as our proxy for the deployed dataset.
  • Multiple research papers and libraries tend to also use "perturbed" data for their choice of data_h1. Perturbations can include corruptions in images (vision data) or introduction of unneccessary words and phrases (text data). This is generally the first step in testing the efficacy of a drift detection method. Once again, if the detectors fails to detect a drift on manually perturbed data, then its quite likely it will not be able to detect drifts in the real, deployed data as well.
  • Therefore, for our purposes, we have tried to minimize the use of artifically perturbed data and instead rely on test data/data from far away time periods as our data_h1 source.

sample_size: int

Decides the number of samples from each of the above 3 datasets that we would like to work with. For instance, if the entire training data is 100K sentences, we can use a sample_size = 500 to randomly sample 500 of those sentences.

windows: int (optional)

This parameter is relevant for gradual drifts and helps break down the data into a certain number of buckets. These buckets can act like “batches” or “data streams”. The idea behind this approach is that we are trying to localize drifts to a certain time frame and check for consistencies (or lack thereof) in detection. If data_h1 has 100K data points, and if we wish to detect drifts gradually over time, a proxy approach would be to break the data in sets of 5K points and then randomly sample from each set separately.

embedding_model: str

This parameter decides the kind of embedding the text goes through. The embeddings we consider thus far are: \ a) SBERT: A Python framework for state-of-the-art sentence, text and image embeddings. \ b) Universal Sentence Encoders: USE encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks \ c) Doc2Vec: a generalization of Word2Vec, which in turn is an algorithm that uses a neural network model to learn word associations from a large corpus of text

SBERT_model: str

This parameter is specific to the SBERT embedding models. If we choose to work with SBERT, we can specify the type of SBERT embedding out here. Ex. 'bert-base-uncased'

transformation: str

Embeddings render multiple multi-dimensional vector spaces. For instance, USE results in 512 dimensions, and 'bert-base-uncased' results in 768 dimensions. We can thus try to use some neural network technique, such as autoencoders to reconstruct the data in reduced dimensions. Alibi Detectors use Untrained Autoencoders. For feature levels tests such as KLD or JSD, such a large dimension might not be feasible to analyse, and thus we can reduce the dimensionality by selecting the most important components using methods such as PCA and SVD.

Step 3: Specifying Drift Detection test and related parameters

test: str

Specify the kind of drift detection test we want: "KS", "KL", "JS", "MMD", "LSDD" (discussed below).

drift_type: str

Specify the drift type we are looking for, based on the time/frquency: "Sudden", "Gradual", or "Online" (discussed below).

ert: int

Expected Run Time before a drift is detected. Alibi detect uses this approach for it's 
online drift detectors. If the average ERT for the reference data is significantly higher
than the average run time for the drifted data, that might indicate a possible drift. 

window_size: int

This parameter is used within Alibi's online detectors. 
It specifies the number of datapoints to include in one window.

n_run: int

This parameter is used within Alibi's online detectors and specifies the number of runs
the detector must perform before we can get an average ERT. 

n_bootstraps: int

 This parameter is used within Alibi's online detectors

context_type: str

Context that we wish to ignore
1) sub-population: if we wish to ignore the relative change in sub-population of certain 
classes

iterations: int

We can run through multiple iterations of the embeddings to make our drift detection test more robust. For instance, if we only detect a drift on 1 out of 10 itertions, then we might be better off not flagging a drift at all.

<<<<<<< HEAD alt text

alt text

42a9d733a1631c2c814538fab9627ec31b54e4f6

Based on Data (What kind of drift took place (features, labels, model/concept)?)

<<<<<<< HEAD alt text

alt text

42a9d733a1631c2c814538fab9627ec31b54e4f6

Covariate Drifts

When the input data drifts: P(X) != Pref(X) even though P(X|Y) != Pref(X|Y). Such drifts happen when the distribution of the input data, or some of the features, drifts. The drift can happen gradually, or right after deployment (discussed in the next section). For further reading, please refer to this Seldon article

Prior Drifts

When the output data drifts: P(Y) != Pref(Y) even though P(X|Y) != Pref(X|Y). For instance, let's say trying to predict whether people with certain symptoms have COVID-19. Now, if we pick a training dataset before the pandemic, and test our model on a dataset during the pandemic, then our label distributions would be vastly different. The distribution of features of those people (ex. age, health parameters, location etc.) would be the exact same but the test data would just have a whole lot more labels that are COVID-19 positive heavy.

Concept Drifts

When process generating y from x drifts: P(Y|X) != Pref(Y|X). Concept drift happens when the relationship between the input data (X) and outputs (Y) changes. \ \ For further reading, please refer to Alibi Detect Documentation

Based on time/frequency (When did the drift happen?)

alt text

Sudden

These drifts generally happens quite instataneously right after deployment, likely because of a very immediate change in some external factors. Ex. a sudden drift in labels from "News" to "Sports" right after an election and right before the Olympics.

Gradual

These drifts, as the name suggests, happen more gradually over time. These drifts might not be very obvious right away and depending on our thresholds, there is a possibility we might not catch them especially in the earlier time periods. One example could be the gradual drift in labels as we go from the peek of the pandemic all the way to it trough. Often such outbreaks reduce in intensity over time and hence the drift in labels might be more gradual.

Incremental

Incremental drifts are similar to gradual drifts, with the additional consitency. This means that drift increases consistently over time, unlike the possible drops here and there in gradual drifts.

Recurrent

Recurrent drifts are drifts wherein the model requires perpetual retraining. These drifts can be challening to work with as they can be hard to identify and will often require both our reference and comparison data to be updated after certain time intervals. For the previous 3 drifts, where the reference dataset stayed constant and we tested for Sudden or Gradual drifts on all the dataset that came after that. But for recurrent drifts, we cannot pick all data into eternity to test for drifts and will have to keep updating our information as we move in time.

Feature Level

Kolmogorov–Smirnov (2-sample) test

alt text

A nonparametric test of the equality of continuous distributions. It quantifies a distance between the empirical distributions. For further reading, a possible resource you can refer to is this article

Kullback–Leibler divergence

alt text

A distribution-dependent test that calculates the divergence between 2 distributions. This resource gives a good overview on how we can implement KL divergence in Python.

Jensen–Shannon divergence

A symmetric version of KL Divergence. It essent

Data Level

Maximum Mean Discrepency (MMD)

alt text

A kernel based statistical test used to determine whether given two distribution are the same, first proposed in the paper "A kernel two-sample test". MMD quantifies the distance between the mean embeddings (distribution map- ping into Reproducing Kernel Hilbert Space) of the distributions.This feature map reduces the distributions to simpler mathematical val- ues.

Least Squares Density Difference (LSDD)

LSDD is a method grounded in least squares, that estimates difference in distributions of a data pair without computing density distribution for each dataset indepen- dently. It was first proposed in the paper Density Difference Estimation

Learned Kernel

Learned Kerned drift detectors are very similar to MMD detectors, but the "kernel" in the MMD is now replaced by a "deep kernel", in an attempt to create more complex mappings for more complex distributions.

Prior drifts refer to a drift in the labels in the dataset, even when th input feature distributions remain the same. For instance, let's say we want to classify, through a diagnostic test, whether a patient has a certain allergy. Now, if the frequency of allergy dramatically changes, then that would be an example of a prior drift.

A concept drift occurs when there is a fundamental change in disease presentation (same inputs, different output). For instance, in the ame problem as above, if now the severity of allergic response varies across populations (possible because of differential immunities given the different climates), that would be an example of a concept drift.

Sentence Transformers (SBERT)

SBERT is Python framework for state-of-the-art sentence, text and image embeddings. The initial work is described in the paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. This framework can be used to compute sentence / text embeddings for more than 100 languages.

Universal Sentence Encoders (USE)

USE encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks. For further reading, refer to Tensorflow-hub documentation

Document Embeddings (Doc2Vec/ Word2Vec, Glove)

Word2Vec

An algorithm that uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. For further reading, please refer to the paper Efficient Estimation of Word Representations in Vector Space

Doc2Vec

An NLP tool for representing documents as a vector and is a generalizing of the word2vec method. For further usage guidance, please refer to the Gensim documentation

About

Drift Detection Pipeline with focus on Clinical Text

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published