Skip to content

Commit 93417d4

Browse files
Merge pull request #266 from loglabs/shreyashankar/docs
Writing documentation
2 parents b44545d + a87f880 commit 93417d4

File tree

11 files changed

+183
-124
lines changed

11 files changed

+183
-124
lines changed

README.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,9 @@
55
![PyPI](https://img.shields.io/pypi/v/mltrace)
66
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
77

8-
`mltrace` tracks data flow through various components in ML pipelines and
9-
contains a UI and API to show a trace of steps in an ML pipeline that produces
10-
an output. It offers the following:
8+
`mltrace` is a lightweight, open-source Python tool to get "bolt-on" observability in ML pipelines. It offers the following:
119

10+
- interface to define data and ML tests for components in pipelines
1211
- coarse-grained lineage and tracing
1312
- Python API to log versions of data and pipeline components
1413
- database to store information about component runs

docs/requirements.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,10 @@ furo
22
Sphinx>=3.5.4
33
sphinxcontrib-applehelp==1.0.2
44
sphinxcontrib-devhelp==1.0.2
5-
sphinxcontrib-htmlhelp==1.0.3
5+
sphinxcontrib-htmlhelp>=1.0.3
66
sphinxcontrib-jsmath==1.0.1
77
sphinxcontrib-qthelp==1.0.3
8-
sphinxcontrib-serializinghtml==1.1.4
8+
sphinxcontrib-serializinghtml>=1.1.4
99
attrs==20.3.0
1010
autopep8==1.5.6
1111
click==7.1.2

docs/source/changelog.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,15 @@
22
Changelog
33
=========
44

5+
- :release:`0.17 <2021-11-04>`
6+
- :support:`-` Added ability to create tests and execute them before and after components are run. Also, the web app has a React Router refactor, thanks to `@Boyuan-Deng`.
7+
8+
.. warning::
9+
This change is requries a DB migration. You can follow the documentation to perform the migration_ if you are using a release prior to this one.
10+
11+
- :feature:`226` Adds functionality to run triggers before and after components are run. Thanks `@aditim1359` for taking this on!
12+
13+
514
- :release:`0.16 <2021-07-08>`
615
- :support:`-` Added the review feature to aid in debugging erroneous outputs and functionality to log git tags to integrate with DVC.
716

docs/source/concepts.rst

Lines changed: 65 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,31 @@ Knowing data flow is a precursor to debugging issues in data pipelines. ``mltrac
1414
Data model
1515
^^^^^^^^^^
1616

17-
The two prominent client-facing abstractions are the :py:class:`~mltrace.entities.Component` and :py:class:`~mltrace.entities.ComponentRun` abstractions.
17+
The two prominent client-facing abstractions are the :py:class:`~mltrace.Component` and :py:class:`~mltrace.ComponentRun` abstractions.
1818

19-
:py:class:`mltrace.entities.Component`
19+
:py:class:`~mltrace.Test`
20+
"""""""""
21+
22+
The ``Test`` abstraction represents some reusable computation to perform on component inputs and outputs. Defining a ``Test`` is similar to writing a unit test:
23+
24+
.. code-block :: python
25+
26+
from mltrace import Test
27+
28+
class OutliersTest(Test):
29+
def __init__(self):
30+
super().__init__(name='outliers')
31+
32+
def testSomething(self; df: pd.DataFrame):
33+
....
34+
35+
def testSomethingElse(self; df: pd.DataFrame):
36+
....
37+
38+
39+
Tests can be defined and passed to components as arguments, as described in the section below.
40+
41+
:py:class:`mltrace.Component`
2042
"""""""""
2143

2244
The ``Component`` abstraction represents a stage in a pipeline and its static metadata, such as:
@@ -25,10 +47,49 @@ The ``Component`` abstraction represents a stage in a pipeline and its static me
2547
* description
2648
* owner
2749
* tags (optional list of string values to reference the component by)
50+
* tests
2851

2952
Tags are generally useful when you have multiple components in a higher-level stage. For example, ETL computation could consist of different components such as "cleaning" or "feature generation." You could create the "cleaning" and "feature generation" components with the tag ``etl`` and then easily query component runs with the ``etl`` tag in the UI.
3053

31-
:py:class:`mltrace.entities.ComponentRun`
54+
Components have a life-cycle:
55+
56+
* ``c = Component(...)``: construction of the component object
57+
* ``c.beforeTests``: a list of ``Tests`` to run before the component is run
58+
* ``c.run``: a decorator for a user-defined function that represents the component's computation
59+
* ``c.afterTests``: a list of ``Tests`` to run after the component is run
60+
61+
Putting it all together, we can define our own component:
62+
63+
.. code-block :: python
64+
65+
from mltrace import Component
66+
67+
class Featuregen(Component):
68+
def __init__(self, beforeTests=[], afterTests=[OutliersTest]):
69+
70+
super().__init__(
71+
name="featuregen",
72+
owner="spark-gymnast",
73+
description="Generates features for high tip prediction problem",
74+
tags=["nyc-taxicab"],
75+
beforeTests=beforeTests,
76+
afterTests=afterTests,
77+
)
78+
79+
80+
And in our main application code, we can decorate any feature generation function:
81+
82+
.. code-block :: python
83+
84+
@Featuregen().run
85+
def generateFeatures(df: pd.DataFrame):
86+
# Generate features
87+
df = ...
88+
return df
89+
90+
See the next page for a more in-depth tutorial on instrumenting a pipeline.
91+
92+
:py:class:`mltrace.ComponentRun`
3293
"""""""""
3394

3495
The ``ComponentRun`` abstraction represents an instance of a ``Component`` being run. Think of a ``ComponentRun`` instance as an object storing *dynamic* metadata for a ``Component``, such as:
@@ -41,7 +102,7 @@ The ``ComponentRun`` abstraction represents an instance of a ``Component`` being
41102
* source code
42103
* dependencies (you do not need to manually declare)
43104

44-
If you dig into the codebase, you will find another abstraction, the :py:class:`~mltrace.entities.IOPointer`. Inputs and outputs to a ``ComponentRun`` are stored as ``IOPointer`` objects. You do not need to explicitly create an ``IOPointer`` -- the abstraction exists so that ``mltrace`` can easily find and store dependencies between ``ComponentRun`` objects.
105+
If you dig into the codebase, you will find another abstraction, the :py:class:`~mltrace.IOPointer`. Inputs and outputs to a ``ComponentRun`` are stored as ``IOPointer`` objects. You do not need to explicitly create an ``IOPointer`` -- the abstraction exists so that ``mltrace`` can easily find and store dependencies between ``ComponentRun`` objects.
45106

46107
You will not need to explicitly define all of these variables, nor do you have to create instances of a ``ComponentRun`` yourself. See the next section for logging functions and an example.
47108

docs/source/index.rst

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
mltrace documentation
22
===================================
33

4-
mltrace_ is an open-source Python tool to track data flow through various
5-
components and diagnose failure modes in ML pipelines. It offers the following:
4+
mltrace_ is a lightweight, open-source Python tool to get "bolt-on" observability in ML pipelines. It offers the following:
65

6+
- interface to define data and ML tests for components in pipelines
77
- coarse-grained lineage and tracing
88
- Python API to log versions of data and pipeline components
99
- database to store information about component runs
@@ -32,6 +32,8 @@ Roadmap
3232

3333
We are actively working on the following:
3434

35+
- Component input and output monitoring
36+
- Stateful testing (i.e., being able to use historical component inputs outputs in testing and monitoring)
3537
- API to log from any type of file, not just a Python file
3638
- Prometheus integrations to monitor component output distributions
3739
- Support for finer-grained lineage (at the record level)

docs/source/logging.rst

Lines changed: 97 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -17,21 +17,30 @@ For this example, we will add logging functions to a hypothetical ``cleaning.py`
1717
1818
where ``SERVER_IP_ADDRESS`` is your server's IP address or "localhost" if you are running locally. You can also call ``mltrace.set_address(SERVER_IP_ADDRESS)`` in your Python script instead if you do not want to set the environment variable.
1919

20+
If you plan to use the auto logging functionalities for component run inputs and outputs (turned off by default), you will need to set the environment variable ``SAVE_DIR`` to the directory you want to save versions of your inputs and outputs to. The default is ``.mltrace`` in the user directory.
21+
2022
Component creation
2123
^^^^^^^^^^^^^^^^^^
2224

23-
For runs of components to be logged, you must first create the components themselves using :py:func:`mltrace.create_component`. For example:
25+
For runs of components to be logged, you must first create the components themselves using :py:class:`mltrace.Component`. You can subclass the main Component class if you want to make a custom Component, for example:
2426

2527
.. code-block :: python
2628
27-
mltrace.create_component(
28-
name="cleaning",
29-
description="Removes records with data out of bounds",
30-
owner="shreya",
31-
tags=["etl"],
32-
)
29+
from mltrace import Component
30+
31+
class Cleaning(Component):
32+
def __init__(self, name, owner, tags=[], beforeTests=[], afterTests=[]):
33+
34+
super().__init__(
35+
name="cleaning_" + name,
36+
owner=owner,
37+
description="Basic component to clean raw data",
38+
tags=tags,
39+
beforeTests=beforeTests,
40+
afterTests=afterTests,
41+
)
3342
34-
You only need to do this once; however nothing happens if you run this code snippet more than once. It is fine to leave it in your Python file to run every time this file is run. If the component hasn't been created, you cannot have any runs of this component name. This is to enforce users to enter static metadata about a component, such as the description and owner, to better facilitate collaboration.
43+
Components are intended to be defined once and reused throughout your application. You can define them in a separate file or folder and import them into your main Python application. If you do not want a custom component, you can also just use the default Component class, as shown below.
3544

3645
Logging runs
3746
^^^^^^^^^^^^
@@ -43,49 +52,45 @@ Suppose we have a function ``clean`` in our ``cleaning.py`` file:
4352

4453
.. code-block :: python
4554
46-
from datetime import datetime
4755
import pandas as pd
4856
49-
def clean_data(filename: str) -> str:
50-
df = pd.read_csv(filename)
57+
def clean_data(df: pd.DataFrame) -> str:
5158
# Do some cleaning
52-
...
53-
# Save cleaned dataframe
54-
clean_version = filename + '_clean_{datetime.utcnow().strftime("%m%d%Y%H%M%S")}.csv'
55-
df.to_csv(clean_version)
56-
return clean_version
59+
clean_df = ...
60+
return clean_df
5761
58-
We can include the :py:func:`~mltrace.register` decorator such that every time this function is run, dynamic information is logged:
62+
We can include the :py:func:`~mltrace.Component.run` decorator such that every time this function is run, dynamic information is logged:
5963

6064
.. code-block :: python
6165
62-
from datetime import datetime
63-
from mltrace import register
66+
from mltrace import Component
6467
import pandas as pd
6568
66-
@register(
67-
component_name="cleaning", input_vars=["filename"], output_vars=["clean_version"]
69+
c = Component(
70+
name="cleaning",
71+
owner="plumber",
72+
description="Cleans raw NYC taxicab data",
6873
)
69-
def clean_data(filename: str) -> str:
70-
df = pd.read_csv(filename)
74+
75+
@c.run(auto_log=True)
76+
def clean_data(df: pd.DataFrame) -> str:
7177
# Do some cleaning
72-
...
73-
# Save cleaned dataframe
74-
clean_version = filename + '_clean_{datetime.utcnow().strftime("%m%d%Y%H%M%S")}.csv'
75-
df.to_csv(clean_version)
76-
return clean_version
78+
clean_df = ...
79+
return clean_df
80+
81+
We will refer to ``clean_data`` as the clean_data as the decorated component run function. The ``auto_log`` parameter is set to False by default, but you can set it to True to automatically log inputs and outputs. If ``auto_log`` is True, ``mltrace`` will save and log paths to any dataframes, variables with "data" or "model" in their names, and any other variables greater than 1MB. ``mltrace`` will save to the directory defined by the environment variable ``SAVE_DIR``. If ``MLTRACE_DIR`` is not set, ``mltrace`` will save to a ``.mltrace`` folder in the user directory.
7782

78-
Note that ``input_vars`` and ``output_vars`` correspond to variables in the function. Their values at the time of return are logged. The start and end times, git hash, and source code snapshots are automatically captured. The dependencies are also automatically captured based on the values of the input variables.
83+
If you do not set ``auto_log`` to True, then you will need to manually define your input and output variables in the :py:func:`~mltrace.Component.run` function. Note that ``input_vars`` and ``output_vars`` correspond to variables in the function. Their values at the time of return are logged. The start and end times, git hash, and source code snapshots are automatically captured. The dependencies are also automatically captured based on the values of the input variables.
7984

8085
Python approach
8186
"""""""""
8287

83-
You can also create an instance of a :py:class:`~mltrace.entities.ComponentRun` and log it using :py:func:`mltrace.log_component_run` yourself for greater flexibility. An example of this is as follows:
88+
You can also create an instance of a :py:class:`~mltrace.ComponentRun` and log it using :py:func:`mltrace.log_component_run` yourself for greater flexibility. An example of this is as follows:
8489

8590
.. code-block :: python
8691
8792
from datetime import datetime
88-
from mltrace.entities import ComponentRun
93+
from mltrace import ComponentRun
8994
from mltrace import get_git_hash, log_component_run
9095
import pandas as pd
9196
@@ -110,51 +115,78 @@ You can also create an instance of a :py:class:`~mltrace.entities.ComponentRun`
110115
111116
return clean_version
112117
113-
Note that in :py:func:`~mltrace.log_component_run`, ``set_dependencies_from_inputs`` is set to ``True`` by default. You can set it to False if you want to manually specify the names of the components that this component run depends on. To manually specify a dependency, you can call :py:func:`~mltrace.entities.ComponentRun.set_upstream` with the dependent component name or list of component names before you call :py:func:`~mltrace.log_component_run`.
118+
Note that in :py:func:`~mltrace.log_component_run`, ``set_dependencies_from_inputs`` is set to ``True`` by default. You can set it to False if you want to manually specify the names of the components that this component run depends on. To manually specify a dependency, you can call :py:func:`~mltrace.ComponentRun.set_upstream` with the dependent component name or list of component names before you call :py:func:`~mltrace.log_component_run`.
114119

115-
End-to-end example
116-
^^^^^^^^^^^^^^^^^^
120+
Testing
121+
^^^^^^^
117122

118-
To put it all together, here's an end to end example of ``cleaning.py``:
123+
You can define Tests, or reusable blocks of computation, to run before and after components are run. To define a test, you need to subclass the :py:class:`~mltrace.Test` class. Defining a test is similar to defining a Python unittest, for example:
119124

120125
.. code-block :: python
121126
122-
"""
123-
cleaning.py
127+
from mltrace import Test
128+
129+
class OutliersTest(Test):
130+
def __init__(self):
131+
super().__init__(name='outliers')
132+
133+
def testComputeStats(self; df: pd.DataFrame):
134+
# Get numerical columns
135+
num_df = df.select_dtypes(include=["number"])
136+
137+
# Compute stats
138+
stats = num_df.describe()
139+
print("Dataframe statistics:")
140+
print(stats)
141+
142+
def testZScore(
143+
self,
144+
df: pd.DataFrame,
145+
stdev_cutoff: float = 5.0,
146+
threshold: float = 0.05,
147+
):
148+
"""
149+
Checks to make sure there are no outliers using z score cutoff.
150+
"""
151+
# Get numerical columns
152+
num_df = df.select_dtypes(include=["number"])
153+
154+
z_scores = (
155+
(num_df - num_df.mean(axis=0, skipna=True))
156+
/ num_df.std(axis=0, skipna=True)
157+
).abs()
158+
159+
if (z_scores > stdev_cutoff).to_numpy().sum() > threshold * len(df):
160+
print(
161+
f"Number of outliers: {(z_scores > stdev_cutoff).to_numpy().sum()}"
162+
)
163+
print(f"Outlier threshold: {threshold * len(df)}")
164+
raise Exception("There are outlier values!")
165+
166+
167+
Any function you expect to execute as a test must be prefixed with the name ``test`` in lowercase, like ``testSomething``. Arguments to test functions must be defined in the decorated component run function signature if the tests will be run before the component run function; otherwise the arguments to test functions must be defined as variables somewhere in the decorated component run function. You can integrate the tests into components in the constructor:
124168

125-
File that cleans data.
126-
"""
169+
.. code-block :: python
127170
128-
from datetime import datetime
129-
from mltrace import create_component, register
171+
from mltrace import Component
130172
import pandas as pd
131173
132-
@register(
133-
component_name="cleaning", input_vars=["filename"], output_vars=["clean_version"]
174+
c = Component(
175+
name="cleaning",
176+
owner="plumber",
177+
description="Cleans raw NYC taxicab data",
178+
beforeTests=[OutliersTest],
134179
)
135-
def clean_data(filename: str) -> str:
136-
df = pd.read_csv(filename)
137-
# Do some cleaning
138-
...
139-
# Save cleaned dataframe
140-
clean_version = filename + '_clean_{datetime.utcnow().strftime("%m%d%Y%H%M%S")}.csv'
141-
df.to_csv(clean_version)
142-
return clean_version
143-
144-
if __name__ == "__main__"::
145-
# Optional set hostname if you have not set DB_SERVER env var: mltrace.set_address("localhost")
146180
147-
# Create component
148-
create_component(
149-
name="cleaning",
150-
description="Removes records with data out of bounds",
151-
owner="shreya",
152-
tags=["etl"],
153-
)
181+
@c.run(auto_log=True)
182+
def clean_data(df: pd.DataFrame) -> str:
183+
# Do some cleaning
184+
clean_df = ...
185+
return clean_df
154186
155-
# Run cleaning function
156-
clean_data("raw_data.csv")
187+
At runtime, the ``OutliersTest`` test functions will run before the ``clean_data`` function. Note that all arguments to the test functions executed in ``beforeTests`` must be arguments to ``clean_data``. All arguments to the test functions executed in ``afterTests`` must be variables defined somewhere in ``clean_data``.
157188

158-
That's it! Now, every time this file is run, a new run for the cleaning component is logged.
189+
End-to-end example
190+
^^^^^^^^^^^^^^^^^^
159191

160-
To see an example of ``mltrace`` integrated in a toy ML pipeline, check out the ``db`` branch of [this repo](https://github.com/shreyashankar/toy-ml-pipeline/tree/shreyashankar/db). The next step will demonstrate how to query and use the UI.
192+
To see an example of ``mltrace`` integrated into a Python pipeline, check out this `tutorial <https://github.com/loglabs/mltrace-demo>`_. The full pipeline with ``mltrace`` integrations is defined in ``solutions/main.py``.

0 commit comments

Comments
 (0)