loglabs
diff --git a/‎README.md
Lines changed: 2 additions & 3 deletions b/‎README.md
Lines changed: 2 additions & 3 deletions
diff --git a/‎docs/requirements.txt
Lines changed: 2 additions & 2 deletions b/‎docs/requirements.txt
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/changelog.rst
Lines changed: 9 additions & 0 deletions b/‎docs/source/changelog.rst
Lines changed: 9 additions & 0 deletions
diff --git a/‎docs/source/concepts.rst
Lines changed: 65 additions & 4 deletions b/‎docs/source/concepts.rst
Lines changed: 65 additions & 4 deletions
diff --git a/‎docs/source/index.rst
Lines changed: 4 additions & 2 deletions b/‎docs/source/index.rst
Lines changed: 4 additions & 2 deletions
diff --git a/‎docs/source/logging.rst
Lines changed: 97 additions & 65 deletions b/‎docs/source/logging.rst
Lines changed: 97 additions & 65 deletions
@@ -5,10 +5,9 @@
 ![PyPI](https://img.shields.io/pypi/v/mltrace)
 [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
 
-`mltrace` tracks data flow through various components in ML pipelines and
-contains a UI and API to show a trace of steps in an ML pipeline that produces
-an output. It offers the following:
+`mltrace` is a lightweight, open-source Python tool to get "bolt-on" observability in ML pipelines. It offers the following:
 
+- interface to define data and ML tests for components in pipelines
 - coarse-grained lineage and tracing
 - Python API to log versions of data and pipeline components
 - database to store information about component runs
 
@@ -2,10 +2,10 @@ furo
 Sphinx>=3.5.4
 sphinxcontrib-applehelp==1.0.2
 sphinxcontrib-devhelp==1.0.2
-sphinxcontrib-htmlhelp==1.0.3
+sphinxcontrib-htmlhelp>=1.0.3
 sphinxcontrib-jsmath==1.0.1
 sphinxcontrib-qthelp==1.0.3
-sphinxcontrib-serializinghtml==1.1.4
+sphinxcontrib-serializinghtml>=1.1.4
 attrs==20.3.0
 autopep8==1.5.6
 click==7.1.2
 
@@ -2,6 +2,15 @@
 Changelog
 =========
 
+- :release:`0.17 <2021-11-04>`
+- :support:`-` Added ability to create tests and execute them before and after components are run. Also, the web app has a React Router refactor, thanks to `@Boyuan-Deng`.
+
+  .. warning::
+    This change is requries a DB migration. You can follow the documentation to perform the migration_ if you are using a release prior to this one.
+
+- :feature:`226` Adds functionality to run triggers before and after components are run. Thanks `@aditim1359` for taking this on!
+
+
 - :release:`0.16 <2021-07-08>`
 - :support:`-` Added the review feature to aid in debugging erroneous outputs and functionality to log git tags to integrate with DVC.
 
 
@@ -14,9 +14,31 @@ Knowing data flow is a precursor to debugging issues in data pipelines. ``mltrac
 Data model
 ^^^^^^^^^^
 
-The two prominent client-facing abstractions are the :py:class:`~mltrace.entities.Component` and :py:class:`~mltrace.entities.ComponentRun` abstractions.
+The two prominent client-facing abstractions are the :py:class:`~mltrace.Component` and :py:class:`~mltrace.ComponentRun` abstractions.
 
-:py:class:`mltrace.entities.Component`
+:py:class:`~mltrace.Test`
+"""""""""
+
+The ``Test`` abstraction represents some reusable computation to perform on component inputs and outputs. Defining a ``Test`` is similar to writing a unit test:
+
+.. code-block :: python
+
+    from mltrace import Test
+
+    class OutliersTest(Test):
+        def __init__(self):
+            super().__init__(name='outliers')
+
+        def testSomething(self; df: pd.DataFrame):
+            ....
+        
+        def testSomethingElse(self; df: pd.DataFrame):
+            ....
+
+
+Tests can be defined and passed to components as arguments, as described in the section below.
+
+:py:class:`mltrace.Component`
 """""""""
 
 The ``Component`` abstraction represents a stage in a pipeline and its static metadata, such as:
@@ -25,10 +47,49 @@ The ``Component`` abstraction represents a stage in a pipeline and its static me
 * description
 * owner
 * tags (optional list of string values to reference the component by)
+* tests
 
 Tags are generally useful when you have multiple components in a higher-level stage. For example, ETL computation could consist of different components such as "cleaning" or "feature generation." You could create the "cleaning" and "feature generation" components with the tag ``etl`` and then easily query component runs with the ``etl`` tag in the UI.
 
-:py:class:`mltrace.entities.ComponentRun`
+Components have a life-cycle:
+
+* ``c = Component(...)``: construction of the component object
+* ``c.beforeTests``: a list of ``Tests`` to run before the component is run
+* ``c.run``: a decorator for a user-defined function that represents the component's computation
+* ``c.afterTests``: a list of ``Tests`` to run after the component is run 
+
+Putting it all together, we can define our own component:
+
+.. code-block :: python
+
+    from mltrace import Component
+
+    class Featuregen(Component):
+        def __init__(self, beforeTests=[], afterTests=[OutliersTest]):
+
+        super().__init__(
+            name="featuregen",
+            owner="spark-gymnast",
+            description="Generates features for high tip prediction problem",
+            tags=["nyc-taxicab"],
+            beforeTests=beforeTests,
+            afterTests=afterTests,
+        )
+    
+
+And in our main application code, we can decorate any feature generation function:
+
+.. code-block :: python
+
+    @Featuregen().run
+    def generateFeatures(df: pd.DataFrame):
+        # Generate features
+        df = ...
+        return df
+
+See the next page for a more in-depth tutorial on instrumenting a pipeline.
+
+:py:class:`mltrace.ComponentRun`
 """""""""
 
 The ``ComponentRun`` abstraction represents an instance of a ``Component`` being run. Think of a ``ComponentRun`` instance as an object storing *dynamic* metadata for a ``Component``, such as:
@@ -41,7 +102,7 @@ The ``ComponentRun`` abstraction represents an instance of a ``Component`` being
 * source code
 * dependencies (you do not need to manually declare)
 
-If you dig into the codebase, you will find another abstraction, the :py:class:`~mltrace.entities.IOPointer`. Inputs and outputs to a ``ComponentRun`` are stored as ``IOPointer`` objects. You do not need to explicitly create an ``IOPointer`` -- the abstraction exists so that ``mltrace`` can easily find and store dependencies between ``ComponentRun`` objects.
+If you dig into the codebase, you will find another abstraction, the :py:class:`~mltrace.IOPointer`. Inputs and outputs to a ``ComponentRun`` are stored as ``IOPointer`` objects. You do not need to explicitly create an ``IOPointer`` -- the abstraction exists so that ``mltrace`` can easily find and store dependencies between ``ComponentRun`` objects.
 
 You will not need to explicitly define all of these variables, nor do you have to create instances of a ``ComponentRun`` yourself. See the next section for logging functions and an example.
 
 
@@ -1,9 +1,9 @@
 mltrace documentation
 ===================================
 
-mltrace_ is an open-source Python tool to track data flow through various
-components and diagnose failure modes in ML pipelines. It offers the following:
+mltrace_ is a lightweight, open-source Python tool to get "bolt-on" observability in ML pipelines. It offers the following:
 
+- interface to define data and ML tests for components in pipelines
 - coarse-grained lineage and tracing
 - Python API to log versions of data and pipeline components
 - database to store information about component runs
@@ -32,6 +32,8 @@ Roadmap
 
 We are actively working on the following:
 
+- Component input and output monitoring
+- Stateful testing (i.e., being able to use historical component inputs outputs in testing and monitoring)
 - API to log from any type of file, not just a Python file
 - Prometheus integrations to monitor component output distributions
 - Support for finer-grained lineage (at the record level)
 
@@ -17,21 +17,30 @@ For this example, we will add logging functions to a hypothetical ``cleaning.py`
 
 where ``SERVER_IP_ADDRESS`` is your server's IP address or "localhost" if you are running locally. You can also call ``mltrace.set_address(SERVER_IP_ADDRESS)`` in your Python script instead if you do not want to set the environment variable.
 
+If you plan to use the auto logging functionalities for component run inputs and outputs (turned off by default), you will need to set the environment variable ``SAVE_DIR`` to the directory you want to save versions of your inputs and outputs to. The default is ``.mltrace`` in the user directory.
+
 Component creation
 ^^^^^^^^^^^^^^^^^^
 
-For runs of components to be logged, you must first create the components themselves using :py:func:`mltrace.create_component`. For example:
+For runs of components to be logged, you must first create the components themselves using :py:class:`mltrace.Component`. You can subclass the main Component class if you want to make a custom Component, for example:
 
 .. code-block :: python
 
-    mltrace.create_component(
-        name="cleaning",
-        description="Removes records with data out of bounds",
-        owner="shreya",
-        tags=["etl"],
-    )
+    from mltrace import Component
+
+    class Cleaning(Component):
+        def __init__(self, name, owner, tags=[], beforeTests=[], afterTests=[]):
+
+            super().__init__(
+                name="cleaning_" + name,
+                owner=owner,
+                description="Basic component to clean raw data",
+                tags=tags,
+                beforeTests=beforeTests,
+                afterTests=afterTests,
+            )
 
-You only need to do this once; however nothing happens if you run this code snippet more than once. It is fine to leave it in your Python file to run every time this file is run. If the component hasn't been created, you cannot have any runs of this component name. This is to enforce users to enter static metadata about a component, such as the description and owner, to better facilitate collaboration.
+Components are intended to be defined once and reused throughout your application. You can define them in a separate file or folder and import them into your main Python application. If you do not want a custom component, you can also just use the default Component class, as shown below.
 
 Logging runs
 ^^^^^^^^^^^^
@@ -43,49 +52,45 @@ Suppose we have a function ``clean`` in our ``cleaning.py`` file:
 
 .. code-block :: python
 
-    from datetime import datetime
     import pandas as pd
 
-    def clean_data(filename: str) -> str:
-        df = pd.read_csv(filename)
+    def clean_data(df: pd.DataFrame) -> str:
         # Do some cleaning
-        ...
-        # Save cleaned dataframe
-        clean_version = filename + '_clean_{datetime.utcnow().strftime("%m%d%Y%H%M%S")}.csv'
-        df.to_csv(clean_version)
-        return clean_version
+        clean_df = ...
+        return clean_df
 
-We can include the :py:func:`~mltrace.register` decorator such that every time this function is run, dynamic information is logged:
+We can include the :py:func:`~mltrace.Component.run` decorator such that every time this function is run, dynamic information is logged:
 
 .. code-block :: python
 
-    from datetime import datetime
-    from mltrace import register
+    from mltrace import Component
     import pandas as pd
 
-    @register(
-        component_name="cleaning", input_vars=["filename"], output_vars=["clean_version"]
+    c = Component(
+        name="cleaning",
+        owner="plumber",
+        description="Cleans raw NYC taxicab data",
     )
-    def clean_data(filename: str) -> str:
-        df = pd.read_csv(filename)
+
+    @c.run(auto_log=True)
+    def clean_data(df: pd.DataFrame) -> str:
         # Do some cleaning
-        ...
-        # Save cleaned dataframe
-        clean_version = filename + '_clean_{datetime.utcnow().strftime("%m%d%Y%H%M%S")}.csv'
-        df.to_csv(clean_version)
-        return clean_version
+        clean_df = ...
+        return clean_df
+
+We will refer to ``clean_data`` as the clean_data as the decorated component run function. The ``auto_log`` parameter is set to False by default, but you can set it to True to automatically log inputs and outputs. If ``auto_log`` is True, ``mltrace`` will save and log paths to any dataframes, variables with "data" or "model" in their names, and any other variables greater than 1MB. ``mltrace`` will save to the directory defined by the environment variable ``SAVE_DIR``. If ``MLTRACE_DIR`` is not set, ``mltrace`` will save to a ``.mltrace`` folder in the user directory.
 
-Note that ``input_vars`` and ``output_vars`` correspond to variables in the function. Their values at the time of return are logged. The start and end times, git hash, and source code snapshots are automatically captured. The dependencies are also automatically captured based on the values of the input variables.
+If you do not set ``auto_log`` to True, then you will need to manually define your input and output variables in the :py:func:`~mltrace.Component.run` function. Note that ``input_vars`` and ``output_vars`` correspond to variables in the function. Their values at the time of return are logged. The start and end times, git hash, and source code snapshots are automatically captured. The dependencies are also automatically captured based on the values of the input variables.
 
 Python approach
 """""""""
 
-You can also create an instance of a :py:class:`~mltrace.entities.ComponentRun` and log it using :py:func:`mltrace.log_component_run` yourself for greater flexibility. An example of this is as follows:
+You can also create an instance of a :py:class:`~mltrace.ComponentRun` and log it using :py:func:`mltrace.log_component_run` yourself for greater flexibility. An example of this is as follows:
 
 .. code-block :: python
 
     from datetime import datetime
-    from mltrace.entities import ComponentRun
+    from mltrace import ComponentRun
     from mltrace import get_git_hash, log_component_run
     import pandas as pd
 
@@ -110,51 +115,78 @@ You can also create an instance of a :py:class:`~mltrace.entities.ComponentRun`
 
         return clean_version
 
-Note that in :py:func:`~mltrace.log_component_run`, ``set_dependencies_from_inputs`` is set to ``True`` by default. You can set it to False if you want to manually specify the names of the components that this component run depends on. To manually specify a dependency, you can call :py:func:`~mltrace.entities.ComponentRun.set_upstream` with the dependent component name or list of component names before you call :py:func:`~mltrace.log_component_run`.
+Note that in :py:func:`~mltrace.log_component_run`, ``set_dependencies_from_inputs`` is set to ``True`` by default. You can set it to False if you want to manually specify the names of the components that this component run depends on. To manually specify a dependency, you can call :py:func:`~mltrace.ComponentRun.set_upstream` with the dependent component name or list of component names before you call :py:func:`~mltrace.log_component_run`.
 
-End-to-end example
-^^^^^^^^^^^^^^^^^^
+Testing
+^^^^^^^
 
-To put it all together, here's an end to end example of ``cleaning.py``:
+You can define Tests, or reusable blocks of computation, to run before and after components are run. To define a test, you need to subclass the :py:class:`~mltrace.Test` class. Defining a test is similar to defining a Python unittest, for example:
 
 .. code-block :: python
 
-    """
-    cleaning.py
+    from mltrace import Test
+
+    class OutliersTest(Test):
+        def __init__(self):
+            super().__init__(name='outliers')
+
+        def testComputeStats(self; df: pd.DataFrame):
+            # Get numerical columns
+            num_df = df.select_dtypes(include=["number"])
+
+            # Compute stats
+            stats = num_df.describe()
+            print("Dataframe statistics:")
+            print(stats)
+        
+        def testZScore(
+            self,
+            df: pd.DataFrame,
+            stdev_cutoff: float = 5.0,
+            threshold: float = 0.05,
+        ):
+            """
+            Checks to make sure there are no outliers using z score cutoff.
+            """
+            # Get numerical columns
+            num_df = df.select_dtypes(include=["number"])
+
+            z_scores = (
+                (num_df - num_df.mean(axis=0, skipna=True))
+                / num_df.std(axis=0, skipna=True)
+            ).abs()
+
+            if (z_scores > stdev_cutoff).to_numpy().sum() > threshold * len(df):
+                print(
+                    f"Number of outliers: {(z_scores > stdev_cutoff).to_numpy().sum()}"
+                )
+                print(f"Outlier threshold: {threshold * len(df)}")
+                raise Exception("There are outlier values!")
+
+
+Any function you expect to execute as a test must be prefixed with the name ``test`` in lowercase, like ``testSomething``. Arguments to test functions must be defined in the decorated component run function signature if the tests will be run before the component run function; otherwise the arguments to test functions must be defined as variables somewhere in the decorated component run function. You can integrate the tests into components in the constructor:
 
-    File that cleans data.
-    """
+.. code-block :: python
 
-    from datetime import datetime
-    from mltrace import create_component, register
+    from mltrace import Component
     import pandas as pd
 
-    @register(
-        component_name="cleaning", input_vars=["filename"], output_vars=["clean_version"]
+    c = Component(
+        name="cleaning",
+        owner="plumber",
+        description="Cleans raw NYC taxicab data",
+        beforeTests=[OutliersTest],
     )
-    def clean_data(filename: str) -> str:
-        df = pd.read_csv(filename)
-        # Do some cleaning
-        ...
-        # Save cleaned dataframe
-        clean_version = filename + '_clean_{datetime.utcnow().strftime("%m%d%Y%H%M%S")}.csv'
-        df.to_csv(clean_version)
-        return clean_version
-    
-    if __name__ == "__main__"::
-        # Optional set hostname if you have not set DB_SERVER env var: mltrace.set_address("localhost")
 
-        # Create component
-        create_component(
-            name="cleaning",
-            description="Removes records with data out of bounds",
-            owner="shreya",
-            tags=["etl"],
-        )
+    @c.run(auto_log=True)
+    def clean_data(df: pd.DataFrame) -> str:
+        # Do some cleaning
+        clean_df = ...
+        return clean_df
 
-        # Run cleaning function
-        clean_data("raw_data.csv")
+At runtime, the ``OutliersTest`` test functions will run before the ``clean_data`` function. Note that all arguments to the test functions executed in ``beforeTests`` must be arguments to ``clean_data``. All arguments to the test functions executed in ``afterTests`` must be variables defined somewhere in ``clean_data``.
 
-That's it! Now, every time this file is run, a new run for the cleaning component is logged. 
+End-to-end example
+^^^^^^^^^^^^^^^^^^
 
-To see an example of ``mltrace`` integrated in a toy ML pipeline, check out the ``db`` branch of [this repo](https://github.com/shreyashankar/toy-ml-pipeline/tree/shreyashankar/db). The next step will demonstrate how to query and use the UI.
+To see an example of ``mltrace`` integrated into a Python pipeline, check out this `tutorial <https://github.com/loglabs/mltrace-demo>`_. The full pipeline with ``mltrace`` integrations is defined in ``solutions/main.py``.