Skip to content

Commit 58dd44d

Browse files
authored
Added doctest helpers to namespace and used ruff for linting / pre-commit over other older tools. (#179)
* Added doctest helpers to namespace and used ruff for linting * Use yaml instead of yml and remove unneeded exclusions from pre-commit config. * Update ruff * Adding back in erroneously deleted lines. * Adding back in erroneously deleted lines. * Correct admonition syntax * Correcting numbering mismatch * Removed more no-longer-needed imports in doctests * Removed unnecessary variable definitions in doctests * More small stylist cleanups for config doctests * Removed DTZ001 lint rule and UTC stuff to avoid 3.10 version issue * Removed unnecessary noqa comment * Added coverage exclusion configs
1 parent eb45f80 commit 58dd44d

29 files changed

+327
-421
lines changed

.github/workflows/tests.yml renamed to .github/workflows/tests.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ jobs:
2626

2727
- name: Install packages
2828
run: |
29-
pip install .[dev]
29+
pip install -e .[dev,tests]
3030
3131
#----------------------------------------------
3232
# run test suite

.pre-commit-config.yaml

Lines changed: 25 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,5 @@
11
default_language_version:
2-
python: python3
3-
4-
exclude: "to_organize"
2+
python: python3.12
53

64
repos:
75
- repo: https://github.com/pre-commit/pre-commit-hooks
@@ -20,32 +18,15 @@ repos:
2018
- id: check-added-large-files
2119
args: [--maxkb, "800"]
2220

23-
# python code formatting
24-
- repo: https://github.com/psf/black
25-
rev: 23.7.0
26-
hooks:
27-
- id: black
28-
args: [--line-length, "110"]
29-
30-
# python import sorting
31-
- repo: https://github.com/PyCQA/isort
32-
rev: 5.12.0
33-
hooks:
34-
- id: isort
35-
args: ["--profile", "black", "--filter-files", "-o", "wandb"]
36-
37-
- repo: https://github.com/PyCQA/autoflake
38-
rev: v2.2.0
39-
hooks:
40-
- id: autoflake
41-
args: [--in-place, --remove-all-unused-imports]
42-
43-
# python upgrading syntax to newer version
44-
- repo: https://github.com/asottile/pyupgrade
45-
rev: v3.10.1
21+
# python code formatting, linting, and import sorting using ruff
22+
- repo: https://github.com/astral-sh/ruff-pre-commit
23+
rev: v0.11.7
4624
hooks:
47-
- id: pyupgrade
48-
args: [--py310-plus]
25+
# Run the formatter
26+
- id: ruff-format
27+
# Run the linter
28+
- id: ruff
29+
args: ["--fix", "--exit-non-zero-on-fix"]
4930

5031
# python docstring formatting
5132
- repo: https://github.com/myint/docformatter
@@ -54,78 +35,55 @@ repos:
5435
- id: docformatter
5536
args: [--in-place, --wrap-summaries=110, --wrap-descriptions=110]
5637

57-
# python check (PEP8), programming errors and code complexity
58-
- repo: https://github.com/PyCQA/flake8
59-
rev: 6.1.0
60-
hooks:
61-
- id: flake8
62-
args:
63-
[
64-
"--max-complexity=10",
65-
"--extend-ignore",
66-
"E402,E701,E251,E226,E302,W504,E704,E402,E401,C901,E203",
67-
"--max-line-length=110",
68-
"--exclude",
69-
"logs/*,data/*",
70-
"--per-file-ignores",
71-
"__init__.py:F401",
72-
]
73-
7438
# yaml formatting
7539
- repo: https://github.com/pre-commit/mirrors-prettier
76-
rev: v3.0.3
40+
rev: v4.0.0-alpha.8
7741
hooks:
7842
- id: prettier
7943
types: [yaml]
80-
exclude: "environment.yaml"
8144

8245
# shell scripts linter
8346
- repo: https://github.com/shellcheck-py/shellcheck-py
84-
rev: v0.9.0.5
47+
rev: v0.10.0.1
8548
hooks:
8649
- id: shellcheck
8750

8851
# md formatting
8952
- repo: https://github.com/executablebooks/mdformat
90-
rev: 0.7.17
53+
rev: 0.7.22
9154
hooks:
9255
- id: mdformat
9356
args: ["--number"]
9457
additional_dependencies:
58+
- mdformat-ruff
9559
- mdformat-gfm
60+
- mdformat-gfm-alerts
9661
- mdformat-tables
9762
- mdformat_frontmatter
98-
- mdformat-myst
99-
- mdformat-black
10063
- mdformat-config
101-
- mdformat-shfmt
64+
- mdformat-myst
65+
- mdformat-toc
10266

10367
# word spelling linter
10468
- repo: https://github.com/codespell-project/codespell
105-
rev: v2.2.5
69+
rev: v2.4.1
10670
hooks:
10771
- id: codespell
10872
args:
109-
- --skip=logs/**,data/**,*.ipynb,*.bib,env.yml,env_cpu.yml,*.svg,poetry.lock
110-
- --ignore-words-list=ehr,nd
73+
- --skip=*.ipynb,*.bib,*.svg,pyproject.toml,docs/source/usage.md
74+
- --ignore-words-list=ehr,crate
11175

11276
# jupyter notebook cell output clearing
11377
- repo: https://github.com/kynan/nbstripout
114-
rev: 0.6.1
78+
rev: 0.8.1
11579
hooks:
11680
- id: nbstripout
11781

118-
# jupyter notebook linting
82+
# jupyter notebook linting with ruff
11983
- repo: https://github.com/nbQA-dev/nbQA
120-
rev: 1.7.0
84+
rev: 1.9.1
12185
hooks:
122-
- id: nbqa-black
86+
- id: nbqa-ruff
87+
args: ["--fix"]
88+
- id: nbqa-ruff-format
12389
args: ["--line-length=110"]
124-
- id: nbqa-isort
125-
args: ["--profile=black"]
126-
- id: nbqa-flake8
127-
args:
128-
[
129-
"--extend-ignore=E203,E402,E501,F401,F841",
130-
"--exclude=logs/*,data/*",
131-
]

README.md

Lines changed: 18 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
<a href="https://pypi.org/project/es-aces/"><img alt="PyPI" src="https://img.shields.io/pypi/v/es-aces"></a>
88
<a href="https://hydra.cc/"><img alt="Hydra" src="https://img.shields.io/badge/Config-Hydra_1.3-89b8cd"></a>
99
<a href="https://codecov.io/gh/justin13601/ACES"><img alt="Codecov" src="https://codecov.io/gh/justin13601/ACES/graph/badge.svg?token=6EA84VFXOV"></a>
10-
<a href="https://github.com/justin13601/ACES/actions/workflows/tests.yml"><img alt="Tests" src="https://github.com/justin13601/ACES/actions/workflows/tests.yml/badge.svg"></a>
10+
<a href="https://github.com/justin13601/ACES/actions/workflows/tests.yaml"><img alt="Tests" src="https://github.com/justin13601/ACES/actions/workflows/tests.yaml/badge.svg"></a>
1111
<a href="https://github.com/justin13601/ACES/actions/workflows/code-quality-main.yaml"><img alt="Code Quality" src="https://github.com/justin13601/ACES/actions/workflows/code-quality-main.yaml/badge.svg"></a>
1212
<a href="https://eventstreamaces.readthedocs.io/en/latest/?badge=latest"><img alt="Documentation" src="https://readthedocs.org/projects/eventstreamaces/badge/?version=latest"/></a>
1313
<a href="https://github.com/justin13601/ACES/graphs/contributors"><img alt="Contributors" src="https://img.shields.io/github/contributors/justin13601/ACES.svg"></a>
@@ -19,14 +19,14 @@
1919

2020
**Updates**
2121

22-
- **\[2025-01-22\]** ACES accepted to ICLR'25!
23-
- **\[2024-12-10\]** Latest `polars` version (`1.17.1`) is now supported.
24-
- **\[2024-10-28\]** Nested derived predicates and derived predicates between static variables and plain predicates can now be defined.
25-
- **\[2024-09-01\]** Predicates can now be defined in a configuration file separate to task criteria files.
26-
- **\[2024-08-29\]** Latest `MEDS` version (`0.3.3`) is now supported.
27-
- **\[2024-08-10\]** Expanded predicates configuration language to support regular expressions, multi-column constraints, and multi-value constraints.
28-
- **\[2024-07-30\]** Added ability to place constraints on static variables, such as patient demographics.
29-
- **\[2024-06-28\]** Paper available at [arXiv:2406.19653](https://arxiv.org/abs/2406.19653).
22+
- **[2025-01-22]** ACES accepted to ICLR'25!
23+
- **[2024-12-10]** Latest `polars` version (`1.17.1`) is now supported.
24+
- **[2024-10-28]** Nested derived predicates and derived predicates between static variables and plain predicates can now be defined.
25+
- **[2024-09-01]** Predicates can now be defined in a configuration file separate to task criteria files.
26+
- **[2024-08-29]** Latest `MEDS` version (`0.3.3`) is now supported.
27+
- **[2024-08-10]** Expanded predicates configuration language to support regular expressions, multi-column constraints, and multi-value constraints.
28+
- **[2024-07-30]** Added ability to place constraints on static variables, such as patient demographics.
29+
- **[2024-06-28]** Paper available at [arXiv:2406.19653](https://arxiv.org/abs/2406.19653).
3030

3131
Automatic Cohort Extraction System (ACES) is a library that streamlines the extraction of task-specific cohorts from time series datasets formatted as event-streams, such as Electronic Health Records (EHR). ACES is designed to query these EHR datasets for valid subjects, guided by various constraints and requirements defined in a YAML task configuration file. This offers a powerful and user-friendly solution to researchers and developers. The use of a human-readable YAML configuration file also eliminates the need for users to be proficient in complex dataframe querying, making the extraction process accessible to a broader audience.
3232

@@ -60,7 +60,9 @@ Install with dependencies from the root directory of the cloned repo:
6060
pip install -e .
6161
```
6262

63-
**Note**: To avoid potential dependency conflicts, please install ESGPT first before installing ACES. This ensures compatibility with the `polars` version required by ACES.
63+
> [!NOTE]
64+
> To avoid potential dependency conflicts, please install ESGPT first before installing ACES. This ensures
65+
> compatibility with the `polars` version required by ACES.
6466
6567
## Instructions for Use
6668

@@ -229,7 +231,12 @@ Fields for a "plain" predicate:
229231
- `value_max_inclusive` (optional): Must be a boolean specifying whether `value_max` is inclusive or not.
230232
- `other_cols` (optional): Must be a 1-to-1 dictionary of column name and column value, which places additional constraints on further columns.
231233

232-
**Note**: For memory optimization, we strongly recommend using either the List of Values or Regular Expression formats whenever possible, especially when needing to match multiple values. Defining each code as an individual string will increase memory usage significantly, as each code generates a separate predicate column. Using a list or regex consolidates multiple matching codes under a single column, reducing the overall memory footprint.
234+
> [!NOTE]
235+
> For memory optimization, we strongly recommend using either the List of Values or Regular Expression formats
236+
> whenever possible, especially when needing to match multiple values. Defining each code as an individual
237+
> string will increase memory usage significantly, as each code generates a separate predicate column. Using a
238+
> list or regex consolidates multiple matching codes under a single column, reducing the overall memory
239+
> footprint.
233240

234241
#### Derived Predicates
235242

conftest.py

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
"""Test set-up and fixtures code."""
2+
3+
import json
4+
import sys
5+
import tempfile
6+
from datetime import datetime, timedelta
7+
from pathlib import Path
8+
from typing import Any
9+
from unittest.mock import MagicMock, patch
10+
11+
import polars as pl
12+
import pytest
13+
import yaml
14+
15+
16+
@pytest.fixture(autouse=True)
17+
def _setup_doctest_namespace(doctest_namespace: dict[str, Any], caplog: pytest.LogCaptureFixture) -> None:
18+
doctest_namespace.update(
19+
{
20+
"caplog": caplog,
21+
"MagicMock": MagicMock,
22+
"sys": sys,
23+
"Path": Path,
24+
"patch": patch,
25+
"json": json,
26+
"pl": pl,
27+
"datetime": datetime,
28+
"timedelta": timedelta,
29+
"tempfile": tempfile,
30+
"yaml": yaml,
31+
}
32+
)

docs/source/conf.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@
44
import sys
55
from pathlib import Path
66

7+
from sphinx.ext import apidoc
8+
79
# Configuration file for the Sphinx documentation builder.
810
#
911
# This file only contains a selection of the most common options. For a full
@@ -55,8 +57,6 @@ def ensure_pandoc_installed(_):
5557

5658
# TODO: use https://github.com/sphinx-extensions2/sphinx-autodoc2
5759

58-
from sphinx.ext import apidoc
59-
6060
output_dir = __location__ / "api"
6161
module_dir = __src__ / "src/aces"
6262
if output_dir.is_dir():

docs/source/configuration.md

Lines changed: 30 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -50,9 +50,10 @@ These configs consist of the following four fields:
5050
expression (satisfied if the regular expression evaluates to True), or a `any` key and the value being a
5151
list of strings (satisfied if there is an occurrence for any code in the list).
5252

53-
**Note**: Each individual definition of `PlainPredicateConfig` and `code` will generate a separate predicate
54-
column. Thus, for memory optimization, it is strongly recommended to match multiple values using either the
55-
List of Values or Regular Expression formats whenever possible.
53+
> [!NOTE]
54+
> Each individual definition of `PlainPredicateConfig` and `code` will generate a separate predicate
55+
> column. Thus, for memory optimization, it is strongly recommended to match multiple values using either
56+
> the List of Values or Regular Expression formats whenever possible.
5657
5758
- `value_min`: If specified, an observation will only satisfy this predicate if the occurrence of the
5859
underlying `code` with a reported numerical value that is either greater than or greater than or equal to
@@ -82,10 +83,11 @@ on its source format.
8283
(recommended), then the `code` will be checked directly against MEDS' `code` field and the `value_min`
8384
and `value_max` constraints will be compared against MEDS' `numeric_value` field.
8485

85-
**Note**: This syntax does not currently support defining predicates that also rely on matching other,
86-
optional fields in the MEDS syntax; if this is a desired feature for you, please let us know by filing a
87-
GitHub issue or pull request or upvoting any existing issue/PR that requests/implements this feature,
88-
and we will add support for this capability.
86+
> [!NOTE]
87+
> This syntax does not currently support defining predicates that also rely on matching other, optional
88+
> fields in the MEDS syntax; if this is a desired feature for you, please let us know by filing a GitHub
89+
> issue or pull request or upvoting any existing issue/PR that requests/implements this feature, and we
90+
> will add support for this capability.
8991
9092
2. If the source data is in [ESGPT](https://eventstreamml.readthedocs.io/en/latest/) format, then the
9193
`code` will be interpreted in the following manner:
@@ -109,8 +111,9 @@ accepted operations that can be applied to other predicates, containing precisel
109111
- `and(pred_1_name, pred_2_name, ...)`: Asserts that all of the specified predicates must be true.
110112
- `or(pred_1_name, pred_2_name, ...)`: Asserts that any of the specified predicates must be true.
111113

112-
**Note**: Currently, `and`'s and `or`'s cannot be nested. Upon user request, we may support further advanced
113-
analytic operations over predicates.
114+
> [!NOTE]
115+
> Currently, `and`'s and `or`'s cannot be nested. Upon user request, we may support further advanced
116+
> analytic operations over predicates.
114117
115118
______________________________________________________________________
116119

@@ -153,20 +156,22 @@ following rules:
153156
exactly `$TIME_DELTA` either after or before the event being referenced (either the external event or the
154157
end or start of the window).
155158

156-
**Note**: If `$REFERENCED` is the `start` field, then `$TIME_DELTA` must be positive, and if
157-
`$REFERENCED` is the `end` field, then `$TIME_DELTA` must be negative to preserve the time ordering of
158-
the window fields.
159+
> [!NOTE]
160+
> If `$REFERENCED` is the `start` field, then `$TIME_DELTA` must be positive, and if
161+
> `$REFERENCED` is the `end` field, then `$TIME_DELTA` must be negative to preserve the time ordering of
162+
> the window fields.
159163
160164
2. `$REFERENCING = $REFERENCED -> $PREDICATE`, `$REFERENCING = $REFERENCED <- $PREDICATE`
161165
In this case, the referencing event will be defined as the next or previous event satisfying the
162166
predicate, `$PREDICATE`.
163167

164-
**Note**: If the `$REFERENCED` is the `start` field, then the "next predicate
165-
ordering" (`$REFERENCED -> $PREDICATE`) must be used, and if the `$REFERENCED` is the `end` field, then the
166-
"previous predicate ordering" (`$REFERENCED <- $PREDICATE`) must be used to preserve the time ordering of
167-
the window fields. These forms can lead to windows being defined as single point events, if the
168-
`$REFERENCED` event itself satisfies `$PREDICATE` and the appropriate constraints are satisfied and
169-
inclusive values are set.
168+
> [!NOTE]
169+
> If the `$REFERENCED` is the `start` field, then the "next predicate
170+
> ordering" (`$REFERENCED -> $PREDICATE`) must be used, and if the `$REFERENCED` is the `end` field, then
171+
> the "previous predicate ordering" (`$REFERENCED <- $PREDICATE`) must be used to preserve the time
172+
> ordering of the window fields. These forms can lead to windows being defined as single point events, if
173+
> the `$REFERENCED` event itself satisfies `$PREDICATE` and the appropriate constraints are satisfied and
174+
> inclusive values are set.
170175
171176
3. `$REFERENCING = $REFERENCED`
172177
In this case, the referencing event will be defined as the same event as the referenced event.
@@ -196,9 +201,10 @@ that define the valid range the count of observations of the named predicate tha
196201
for it to be considered valid. Either `min_valid` or `max_valid` constraints can be `None`, in which case
197202
those endpoints are left unconstrained. Likewise, unreferenced predicates are also left unconstrained.
198203

199-
**Note**: As predicate counts are always integral, this specification does not need an additional
200-
inclusive/exclusive endpoint field, as one can simply increment the bound by one in the appropriate direction
201-
to achieve the result. Instead, this bound is always interpreted to be inclusive, so a window would satisfy
202-
the constraint for predicate `name` with constraint `name: (1, 2)` if the count of observations of predicate
203-
`name` in a window was either 1 or 2. All constraints in the dictionary must be satisfied on a window for it
204-
to be included.
204+
> [!NOTE]
205+
> As predicate counts are always integral, this specification does not need an additional
206+
> inclusive/exclusive endpoint field, as one can simply increment the bound by one in the appropriate direction
207+
> to achieve the result. Instead, this bound is always interpreted to be inclusive, so a window would satisfy
208+
> the constraint for predicate `name` with constraint `name: (1, 2)` if the count of observations of predicate
209+
> `name` in a window was either 1 or 2. All constraints in the dictionary must be satisfied on a window for it
210+
> to be included.

docs/source/notebooks/examples.ipynb

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -33,9 +33,6 @@
3333
"metadata": {},
3434
"outputs": [],
3535
"source": [
36-
"import json\n",
37-
"\n",
38-
"import yaml\n",
3936
"from bigtree import print_tree\n",
4037
"\n",
4138
"from aces import config"

docs/source/notebooks/tutorial_esgpt.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@
8686
"metadata": {},
8787
"outputs": [],
8888
"source": [
89-
"with open(config_path, \"r\") as stream:\n",
89+
"with open(config_path) as stream:\n",
9090
" data_loaded = yaml.safe_load(stream)\n",
9191
" print(json.dumps(data_loaded, indent=4))"
9292
]

docs/source/notebooks/tutorial_meds.ipynb

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,6 @@
3030
"outputs": [],
3131
"source": [
3232
"import json\n",
33-
"from pathlib import Path\n",
3433
"\n",
3534
"import pandas as pd\n",
3635
"import yaml\n",
@@ -88,7 +87,7 @@
8887
"metadata": {},
8988
"outputs": [],
9089
"source": [
91-
"with open(config_path, \"r\") as stream:\n",
90+
"with open(config_path) as stream:\n",
9291
" data_loaded = yaml.safe_load(stream)\n",
9392
" print(json.dumps(data_loaded, indent=4))"
9493
]

0 commit comments

Comments
 (0)