You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/changelog.rst
+9Lines changed: 9 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -2,6 +2,15 @@
2
2
Changelog
3
3
=========
4
4
5
+
- :release:`0.17 <2021-11-04>`
6
+
- :support:`-` Added ability to create tests and execute them before and after components are run. Also, the web app has a React Router refactor, thanks to `@Boyuan-Deng`.
7
+
8
+
.. warning::
9
+
This change is requries a DB migration. You can follow the documentation to perform the migration_ if you are using a release prior to this one.
10
+
11
+
- :feature:`226` Adds functionality to run triggers before and after components are run. Thanks `@aditim1359` for taking this on!
12
+
13
+
5
14
- :release:`0.16 <2021-07-08>`
6
15
- :support:`-` Added the review feature to aid in debugging erroneous outputs and functionality to log git tags to integrate with DVC.
Copy file name to clipboardExpand all lines: docs/source/concepts.rst
+65-4Lines changed: 65 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -14,9 +14,31 @@ Knowing data flow is a precursor to debugging issues in data pipelines. ``mltrac
14
14
Data model
15
15
^^^^^^^^^^
16
16
17
-
The two prominent client-facing abstractions are the :py:class:`~mltrace.entities.Component` and :py:class:`~mltrace.entities.ComponentRun` abstractions.
17
+
The two prominent client-facing abstractions are the :py:class:`~mltrace.Component` and :py:class:`~mltrace.ComponentRun` abstractions.
18
18
19
-
:py:class:`mltrace.entities.Component`
19
+
:py:class:`~mltrace.Test`
20
+
"""""""""
21
+
22
+
The ``Test`` abstraction represents some reusable computation to perform on component inputs and outputs. Defining a ``Test`` is similar to writing a unit test:
23
+
24
+
.. code-block :: python
25
+
26
+
from mltrace import Test
27
+
28
+
class OutliersTest(Test):
29
+
def __init__(self):
30
+
super().__init__(name='outliers')
31
+
32
+
def testSomething(self; df: pd.DataFrame):
33
+
....
34
+
35
+
def testSomethingElse(self; df: pd.DataFrame):
36
+
....
37
+
38
+
39
+
Tests can be defined and passed to components as arguments, as described in the section below.
40
+
41
+
:py:class:`mltrace.Component`
20
42
"""""""""
21
43
22
44
The ``Component`` abstraction represents a stage in a pipeline and its static metadata, such as:
@@ -25,10 +47,49 @@ The ``Component`` abstraction represents a stage in a pipeline and its static me
25
47
* description
26
48
* owner
27
49
* tags (optional list of string values to reference the component by)
50
+
* tests
28
51
29
52
Tags are generally useful when you have multiple components in a higher-level stage. For example, ETL computation could consist of different components such as "cleaning" or "feature generation." You could create the "cleaning" and "feature generation" components with the tag ``etl`` and then easily query component runs with the ``etl`` tag in the UI.
30
53
31
-
:py:class:`mltrace.entities.ComponentRun`
54
+
Components have a life-cycle:
55
+
56
+
* ``c = Component(...)``: construction of the component object
57
+
* ``c.beforeTests``: a list of ``Tests`` to run before the component is run
58
+
* ``c.run``: a decorator for a user-defined function that represents the component's computation
59
+
* ``c.afterTests``: a list of ``Tests`` to run after the component is run
60
+
61
+
Putting it all together, we can define our own component:
description="Generates features for high tip prediction problem",
74
+
tags=["nyc-taxicab"],
75
+
beforeTests=beforeTests,
76
+
afterTests=afterTests,
77
+
)
78
+
79
+
80
+
And in our main application code, we can decorate any feature generation function:
81
+
82
+
.. code-block :: python
83
+
84
+
@Featuregen().run
85
+
def generateFeatures(df: pd.DataFrame):
86
+
# Generate features
87
+
df = ...
88
+
return df
89
+
90
+
See the next page for a more in-depth tutorial on instrumenting a pipeline.
91
+
92
+
:py:class:`mltrace.ComponentRun`
32
93
"""""""""
33
94
34
95
The ``ComponentRun`` abstraction represents an instance of a ``Component`` being run. Think of a ``ComponentRun`` instance as an object storing *dynamic* metadata for a ``Component``, such as:
@@ -41,7 +102,7 @@ The ``ComponentRun`` abstraction represents an instance of a ``Component`` being
41
102
* source code
42
103
* dependencies (you do not need to manually declare)
43
104
44
-
If you dig into the codebase, you will find another abstraction, the :py:class:`~mltrace.entities.IOPointer`. Inputs and outputs to a ``ComponentRun`` are stored as ``IOPointer`` objects. You do not need to explicitly create an ``IOPointer`` -- the abstraction exists so that ``mltrace`` can easily find and store dependencies between ``ComponentRun`` objects.
105
+
If you dig into the codebase, you will find another abstraction, the :py:class:`~mltrace.IOPointer`. Inputs and outputs to a ``ComponentRun`` are stored as ``IOPointer`` objects. You do not need to explicitly create an ``IOPointer`` -- the abstraction exists so that ``mltrace`` can easily find and store dependencies between ``ComponentRun`` objects.
45
106
46
107
You will not need to explicitly define all of these variables, nor do you have to create instances of a ``ComponentRun`` yourself. See the next section for logging functions and an example.
Copy file name to clipboardExpand all lines: docs/source/logging.rst
+97-65Lines changed: 97 additions & 65 deletions
Original file line number
Diff line number
Diff line change
@@ -17,21 +17,30 @@ For this example, we will add logging functions to a hypothetical ``cleaning.py`
17
17
18
18
where ``SERVER_IP_ADDRESS`` is your server's IP address or "localhost" if you are running locally. You can also call ``mltrace.set_address(SERVER_IP_ADDRESS)`` in your Python script instead if you do not want to set the environment variable.
19
19
20
+
If you plan to use the auto logging functionalities for component run inputs and outputs (turned off by default), you will need to set the environment variable ``SAVE_DIR`` to the directory you want to save versions of your inputs and outputs to. The default is ``.mltrace`` in the user directory.
21
+
20
22
Component creation
21
23
^^^^^^^^^^^^^^^^^^
22
24
23
-
For runs of components to be logged, you must first create the components themselves using :py:func:`mltrace.create_component`. For example:
25
+
For runs of components to be logged, you must first create the components themselves using :py:class:`mltrace.Component`. You can subclass the main Component class if you want to make a custom Component, for example:
24
26
25
27
.. code-block :: python
26
28
27
-
mltrace.create_component(
28
-
name="cleaning",
29
-
description="Removes records with data out of bounds",
You only need to do this once; however nothing happens if you run this code snippet more than once. It is fine to leave it in your Python file to run every time this file is run. If the component hasn't been created, you cannot have any runs of this component name. This is to enforce users to enter static metadata about a component, such as the description and owner, to better facilitate collaboration.
43
+
Components are intended to be defined once and reused throughout your application. You can define them in a separate file or folder and import them into your main Python application. If you do not want a custom component, you can also just use the default Component class, as shown below.
35
44
36
45
Logging runs
37
46
^^^^^^^^^^^^
@@ -43,49 +52,45 @@ Suppose we have a function ``clean`` in our ``cleaning.py`` file:
We will refer to ``clean_data`` as the clean_data as the decorated component run function. The ``auto_log`` parameter is set to False by default, but you can set it to True to automatically log inputs and outputs. If ``auto_log`` is True, ``mltrace`` will save and log paths to any dataframes, variables with "data" or "model" in their names, and any other variables greater than 1MB. ``mltrace`` will save to the directory defined by the environment variable ``SAVE_DIR``. If ``MLTRACE_DIR`` is not set, ``mltrace`` will save to a ``.mltrace`` folder in the user directory.
77
82
78
-
Note that ``input_vars`` and ``output_vars`` correspond to variables in the function. Their values at the time of return are logged. The start and end times, git hash, and source code snapshots are automatically captured. The dependencies are also automatically captured based on the values of the input variables.
83
+
If you do not set ``auto_log`` to True, then you will need to manually define your input and output variables in the :py:func:`~mltrace.Component.run` function. Note that ``input_vars`` and ``output_vars`` correspond to variables in the function. Their values at the time of return are logged. The start and end times, git hash, and source code snapshots are automatically captured. The dependencies are also automatically captured based on the values of the input variables.
79
84
80
85
Python approach
81
86
"""""""""
82
87
83
-
You can also create an instance of a :py:class:`~mltrace.entities.ComponentRun` and log it using :py:func:`mltrace.log_component_run` yourself for greater flexibility. An example of this is as follows:
88
+
You can also create an instance of a :py:class:`~mltrace.ComponentRun` and log it using :py:func:`mltrace.log_component_run` yourself for greater flexibility. An example of this is as follows:
84
89
85
90
.. code-block :: python
86
91
87
92
from datetime import datetime
88
-
from mltrace.entities import ComponentRun
93
+
from mltrace import ComponentRun
89
94
from mltrace import get_git_hash, log_component_run
90
95
import pandas as pd
91
96
@@ -110,51 +115,78 @@ You can also create an instance of a :py:class:`~mltrace.entities.ComponentRun`
110
115
111
116
return clean_version
112
117
113
-
Note that in :py:func:`~mltrace.log_component_run`, ``set_dependencies_from_inputs`` is set to ``True`` by default. You can set it to False if you want to manually specify the names of the components that this component run depends on. To manually specify a dependency, you can call :py:func:`~mltrace.entities.ComponentRun.set_upstream` with the dependent component name or list of component names before you call :py:func:`~mltrace.log_component_run`.
118
+
Note that in :py:func:`~mltrace.log_component_run`, ``set_dependencies_from_inputs`` is set to ``True`` by default. You can set it to False if you want to manually specify the names of the components that this component run depends on. To manually specify a dependency, you can call :py:func:`~mltrace.ComponentRun.set_upstream` with the dependent component name or list of component names before you call :py:func:`~mltrace.log_component_run`.
114
119
115
-
End-to-end example
116
-
^^^^^^^^^^^^^^^^^^
120
+
Testing
121
+
^^^^^^^
117
122
118
-
To put it all together, here's an end to end example of ``cleaning.py``:
123
+
You can define Tests, or reusable blocks of computation, to run before and after components are run. To define a test, you need to subclass the :py:class:`~mltrace.Test` class. Defining a test is similar to defining a Python unittest, for example:
119
124
120
125
.. code-block :: python
121
126
122
-
"""
123
-
cleaning.py
127
+
from mltrace import Test
128
+
129
+
class OutliersTest(Test):
130
+
def __init__(self):
131
+
super().__init__(name='outliers')
132
+
133
+
def testComputeStats(self; df: pd.DataFrame):
134
+
# Get numerical columns
135
+
num_df = df.select_dtypes(include=["number"])
136
+
137
+
# Compute stats
138
+
stats = num_df.describe()
139
+
print("Dataframe statistics:")
140
+
print(stats)
141
+
142
+
def testZScore(
143
+
self,
144
+
df: pd.DataFrame,
145
+
stdev_cutoff: float = 5.0,
146
+
threshold: float = 0.05,
147
+
):
148
+
"""
149
+
Checks to make sure there are no outliers using z score cutoff.
150
+
"""
151
+
# Get numerical columns
152
+
num_df = df.select_dtypes(include=["number"])
153
+
154
+
z_scores = (
155
+
(num_df - num_df.mean(axis=0, skipna=True))
156
+
/ num_df.std(axis=0, skipna=True)
157
+
).abs()
158
+
159
+
if (z_scores > stdev_cutoff).to_numpy().sum() > threshold * len(df):
160
+
print(
161
+
f"Number of outliers: {(z_scores > stdev_cutoff).to_numpy().sum()}"
Any function you expect to execute as a test must be prefixed with the name ``test`` in lowercase, like ``testSomething``. Arguments to test functions must be defined in the decorated component run function signature if the tests will be run before the component run function; otherwise the arguments to test functions must be defined as variables somewhere in the decorated component run function. You can integrate the tests into components in the constructor:
# Optional set hostname if you have not set DB_SERVER env var: mltrace.set_address("localhost")
146
180
147
-
# Create component
148
-
create_component(
149
-
name="cleaning",
150
-
description="Removes records with data out of bounds",
151
-
owner="shreya",
152
-
tags=["etl"],
153
-
)
181
+
@c.run(auto_log=True)
182
+
def clean_data(df: pd.DataFrame) -> str:
183
+
# Do some cleaning
184
+
clean_df = ...
185
+
return clean_df
154
186
155
-
# Run cleaning function
156
-
clean_data("raw_data.csv")
187
+
At runtime, the ``OutliersTest`` test functions will run before the ``clean_data`` function. Note that all arguments to the test functions executed in ``beforeTests`` must be arguments to ``clean_data``. All arguments to the test functions executed in ``afterTests`` must be variables defined somewhere in ``clean_data``.
157
188
158
-
That's it! Now, every time this file is run, a new run for the cleaning component is logged.
189
+
End-to-end example
190
+
^^^^^^^^^^^^^^^^^^
159
191
160
-
To see an example of ``mltrace`` integrated in a toy ML pipeline, check out the ``db`` branch of [this repo](https://github.com/shreyashankar/toy-ml-pipeline/tree/shreyashankar/db). The next step will demonstrate how to query and use the UI.
192
+
To see an example of ``mltrace`` integrated into a Python pipeline, check out this `tutorial <https://github.com/loglabs/mltrace-demo>`_. The full pipeline with ``mltrace`` integrations is defined in ``solutions/main.py``.
0 commit comments