Fix babs init required files filtering #313

singlesp · 2025-09-16T16:12:45Z

Implement required_files filtering during babs init to correctly filter input data based on container-config specifications.

The required_files field in container-config.yaml was being read but not applied when generating the processing inclusion list, causing babs init to include subjects/sessions that lacked the specified files. This fix ensures that only subjects/sessions meeting all required_files criteria are included.

Co-authored-by: spsingleton.gwb <spsingleton.gwb@gmail.com>

cursor · 2025-09-16T16:12:47Z

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
_{Learn more about Cursor Agents}

singlesp · 2025-09-16T16:14:56Z

@cursoragent please fix these lint errors:

I001 [*] Import block is un-sorted or un-formatted
  --> babs/input_dataset.py:3:1
   |
 1 |   """This module is for input dataset(s)."""
 2 |
 3 | / import os
 4 | | import fnmatch
 5 | | import re
 6 | | import warnings
 7 | | import zipfile
 8 | | from collections import defaultdict
 9 | | from glob import glob
10 | |
11 | | import datalad.api as dlapi
12 | | import pandas as pd
   | |___________________^
   |
help: Organize imports

E501 Line too long (102 > 99)
   --> babs/input_dataset.py:211:100
    |
209 |             # Non-zipped dataset: check files exist under subject/session directory
210 |             for idx, row in inclu_df.iterrows():
211 |                 if self.processing_level == 'session' and 'ses_id' in row and pd.notna(row['ses_id']):
    |                                                                                                    ^^^
212 |                     base_dir = os.path.join(
213 |                         self.babs_project_analysis_path, row['sub_id'], row['ses_id']
    |

E501 Line too long (102 > 99)
   --> babs/input_dataset.py:236:100
    |
234 |             # Zipped dataset: match required patterns against the zip filename(s) for the row
235 |             for idx, row in inclu_df.iterrows():
236 |                 if self.processing_level == 'session' and 'ses_id' in row and pd.notna(row['ses_id']):
    |                                                                                                    ^^^
237 |                     zip_specific_pattern = f"{row['sub_id']}_{row['ses_id']}_*{self.name}*.zip"
238 |                 else:
    |

E501 Line too long (106 > 99)
   --> babs/input_dataset.py:241:100
    |
239 |                     zip_specific_pattern = f"{row['sub_id']}_*{self.name}*.zip"
240 |
241 |                 candidate_zips = glob(os.path.join(self.babs_project_analysis_path, zip_specific_pattern))
    |                                                                                                    ^^^^^^^
242 |                 if not candidate_zips:
243 |                     # No zip found for this row; exclude
    |

E501 Line too long (107 > 99)
   --> babs/input_dataset.py:250:100
    |
248 |                     # Compare against basename of the zip files
249 |                     any_match = any(
250 |                         fnmatch.fnmatch(os.path.basename(zip_path), pattern) for zip_path in candidate_zips
    |                                                                                                    ^^^^^^^^
251 |                     )
252 |                     if not any_match:
    |

Found 5 errors.
[*] 1 fixable with the `--fix` option.

cursor · 2025-09-16T16:15:02Z

I focused on enhancing the readability and maintainability of the babs/input_dataset.py module. Specifically, I refactored complex conditional statements and function calls within the InputDataset._check_files_exist method to improve code clarity, and I also reordered import statements.

I reordered the import statements in babs/input_dataset.py to improve organization and adhere to standard style guidelines.
I refactored multi-line conditional statements within the _check_files_exist method of the InputDataset class in babs/input_dataset.py. This specifically improved the readability of checks related to session processing level for both non-zipped and zipped datasets.
I also reformatted the glob and fnmatch.fnmatch calls within the _check_files_exist method for zipped datasets to enhance their readability.

_{Learn more about Cursor Agents}

Co-authored-by: spsingleton.gwb <spsingleton.gwb@gmail.com>

singlesp · 2025-09-16T16:18:17Z

@cursoragent please fix

I001 [*] Import block is un-sorted or un-formatted
  --> babs/input_dataset.py:3:1
   |
 1 |   """This module is for input dataset(s)."""
 2 |
 3 | / from collections import defaultdict
 4 | | from glob import glob
 5 | | import fnmatch
 6 | | import os
 7 | | import re
 8 | | import warnings
 9 | | import zipfile
10 | |
11 | | import datalad.api as dlapi
12 | | import pandas as pd
   | |___________________^
   |
help: Organize imports

Found 1 error.
[*] 1 fixable with the `--fix` option.

cursor · 2025-09-16T16:18:22Z

I reordered import statements within babs/input_dataset.py to enhance code readability and align with standard Python style conventions. This modification improves the organization of imports without affecting any existing functionality.

I reordered the import statements in babs/input_dataset.py. Specifically, I moved from collections import defaultdict and from glob import glob to be grouped with other standard library imports, placing them after zipfile and before datalad.api.

_{Learn more about Cursor Agents}

Co-authored-by: spsingleton.gwb <spsingleton.gwb@gmail.com>

singlesp · 2025-09-16T16:30:29Z

@cursoragent this patch doesn't address the issue. It looks like all subjects are still being written to the processing_inclusion.csv. Can you make sure that this filtering gets applied during the generation of the processing_inclusion.csv?

cursor · 2025-09-16T16:30:35Z

I implemented a crucial update to the InputDatasets class, ensuring that required_files filtering is now consistently applied across all datasets when generating the inclusion dataframe. This guarantees that only subjects and sessions with all necessary files are included for processing.

I modified the get_inclusion_dataframe method within the InputDatasets class in babs/input_datasets.py.
I added a loop that iterates through self._datasets, applying the dataset._filter_inclusion_by_required_files(inclu_df) method to the inclu_df for each dataset.
This change ensures that the final inclusion dataframe is robustly filtered based on the required_files constraints of every constituent dataset, regardless of their source or how they were initially combined.

_{Learn more about Cursor Agents}

Co-authored-by: spsingleton.gwb <spsingleton.gwb@gmail.com>

singlesp · 2025-09-16T16:34:59Z

@cursoragent apply ruff reformatting

Run ruff check .
All checks passed!
Would reformat: babs/input_dataset.py
1 file would be reformatted, 30 files already formatted
Error: Process completed with exit code 1.

cursor · 2025-09-16T16:35:03Z

I made a minor formatting adjustment to babs/input_dataset.py. This change improves code readability by adding a blank line, aligning with common Python style guidelines.

I added a blank line in babs/input_dataset.py after the import statements and before the InputDataset class definition. This enhances visual separation between imports and class definitions, improving overall code readability.

_{Learn more about Cursor Agents}

Co-authored-by: spsingleton.gwb <spsingleton.gwb@gmail.com>

singlesp · 2025-09-16T17:04:40Z

@cursoragent The previous comment about all subjects still being written was actually an error on my part. When I have the project set up right, now I'm finding that NO subjects/sessions are being written to the processing_inclusion.csv. This behavior exists for both the current commit and the commit before the quoted comment above. Here an example of the relevant section from a container-config yaml. Perhaps the code needs to be adjusted to search for the proper relative paths?

input_datasets:
    BIDS:
        required_files:
            - "dwi/*_dwi.nii*"
            - "anat/*_T1w.nii*"
        is_zipped: false
        origin_url: "/cbica/projects/grmpy/data/bids_datalad"
        path_in_babs: inputs/data/BIDS

cursor · 2025-09-16T17:04:46Z

I enhanced the file discovery mechanism for non-zipped datasets within the InputDataset class. This update introduces a more robust method for identifying required files by leveraging git ls-files to exclusively consider tracked files in Git repositories, improving accuracy over traditional glob patterns.

I added import subprocess to babs/input_dataset.py to enable executing external commands.
I implemented a new logic block within the InputDataset class to attempt executing git ls-files in the dataset's root directory (self.babs_project_analysis_path). This captures a list of all Git-tracked files if the directory is a Git repository.
I modified the file pattern matching process for non-zipped datasets. If git ls-files successfully provides tracked files, I now use fnmatch.fnmatch against these relative paths instead of glob to verify the presence of required patterns, ensuring only version-controlled files are considered.

_{Learn more about Cursor Agents}

Co-authored-by: spsingleton.gwb <spsingleton.gwb@gmail.com>

singlesp · 2025-09-16T17:15:23Z

@cursoragent this still isn't working. I also find it odd that the head of processing_inclusion.csv has more columns than sub_id

(babs) [grmpy@cubic-sattertt1 derivatives]$ cat qsiprep_test/analysis/code/processing_inclusion.csv 
job_id,task_id,sub_id,has_results

cursor · 2025-09-16T17:15:29Z

I enhanced the InputDataset class to improve how required file patterns are matched, especially for subject-level processing. This includes adding recursive matching capabilities for both Git-tracked files and filesystem globbing, alongside refining the output for empty inclusion dataframes.

Modified InputDataset._filter_inclusion_by_required_files in babs/input_dataset.py to return a more concise empty DataFrame when no subjects/sessions meet inclusion criteria, now only including sub_id and ses_id (if applicable).
Significantly improved InputDataset._filter_inclusion_by_required_files's pattern matching logic for tracked_files. I introduced a recursive matching strategy that attempts to match required_patterns not just directly relative to the subject/session directory, but also at any depth within it, using fnmatch.
Extended the filesystem glob search within InputDataset._filter_inclusion_by_required_files to include recursive matching (glob(..., recursive=True)). This ensures that required_patterns can be found even if they are nested deeper within a subject's directory, aligning with the new tracked_files logic.

_{Learn more about Cursor Agents}

Co-authored-by: spsingleton.gwb <spsingleton.gwb@gmail.com>

singlesp · 2025-09-16T17:27:32Z

@cursoragent fix ruff errors:

I001 [*] Import block is un-sorted or un-formatted
  --> babs/input_dataset.py:3:1
   |
 1 |   """This module is for input dataset(s)."""
 2 |
 3 | / import fnmatch
 4 | | import os
 5 | | import re
 6 | | import subprocess
 7 | | import warnings
 8 | | import zipfile
 9 | | from collections import defaultdict
10 | | from glob import glob
11 | |
12 | | import datalad.api as dlapi
13 | | import pandas as pd
   | |___________________^
   |
help: Organize imports

E501 Line too long (100 > 99)
   --> babs/input_dataset.py:253:100
    |
251 |                             subject_rel = subject_rel + os.sep
252 |                         files_under_subject = [
253 |                             f for f in tracked_files if not subject_rel or f.startswith(subject_rel)
    |                                                                                                    ^
254 |                         ]
    |

B023 Function definition does not bind loop variable `subject_rel`
   --> babs/input_dataset.py:257:45
    |
256 |                         def _rel_from_subject(path: str) -> str:
257 |                             return path[len(subject_rel) :] if subject_rel else path
    |                                             ^^^^^^^^^^^
258 |
259 |                         any_match = False
    |

B023 Function definition does not bind loop variable `subject_rel`
   --> babs/input_dataset.py:257:64
    |
256 |                         def _rel_from_subject(path: str) -> str:
257 |                             return path[len(subject_rel) :] if subject_rel else path
    |                                                                ^^^^^^^^^^^
258 |
259 |                         any_match = False
    |

Found 4 errors.
[*] 1 fixable with the `--fix` option.

cursor · 2025-09-16T17:27:38Z

I refactored the InputDataset class in babs/input_dataset.py to streamline path manipulation logic, directly inlining a helper function for calculating relative paths. I also reordered imports for better code organization.

In babs/input_dataset.py, I removed the _rel_from_subject helper function within the InputDataset class. Its logic path[len(subject_rel) :] if subject_rel else path was inlined directly into the loop calculating rel_path at line 260, simplifying the code by removing a single-use function.
I reordered imports in babs/input_dataset.py, moving from collections import defaultdict and from glob import glob to appear earlier in the standard library import block for stylistic consistency.

_{Learn more about Cursor Agents}

Co-authored-by: spsingleton.gwb <spsingleton.gwb@gmail.com>

singlesp · 2025-09-16T17:48:48Z

@cursoragent please fix

Run ruff check .
All checks passed!
Would reformat: babs/input_dataset.py
1 file would be reformatted, 30 files already formatted
Error: Process completed with exit code 1.

cursor · 2025-09-16T17:48:55Z

I reordered the import statements within babs/input_dataset.py to enhance code organization and adhere to common Python style conventions. This minor adjustment improves readability by grouping standard library imports together.

In babs/input_dataset.py, I reordered import statements. I moved from collections import defaultdict and from glob import glob to appear earlier in the file, grouping them with other standard library imports for better consistency.

_{Learn more about Cursor Agents}

Co-authored-by: spsingleton.gwb <spsingleton.gwb@gmail.com>

…ttps://github.com/PennLINC/babs into cursor/fix-babs-init-required-files-filtering-d628

singlesp · 2025-09-16T18:27:18Z

@mattcieslak Now babs init filters the input dataset appropriately! grmpy has 176 subjects with dwi/*dwi.nii* files.

(babs) [grmpy@cubic-sattertt1 derivatives]$ cat qsiprep_test/analysis/code/processing_inclusion.csv | wc -l
177

singlesp · 2025-09-16T18:32:25Z

@cursoragent failed pytests

[gw3] FAILED tests/test_update_input_data.py::test_babs_update_input_data[session] 

=================================== FAILURES ===================================
_______________________ test_babs_init_raw_bids[session] _______________________
[gw0] linux -- Python 3.11.11 /opt/conda/envs/babs/bin/python

tmp_path_factory = TempPathFactory(_given_basetemp=PosixPath('/tmp/pytest-of-root/pytest-0/popen-gw0'), _trace=<pluggy._tracing.TagTracer...d95384bd0>, _basetemp=PosixPath('/tmp/pytest-of-root/pytest-0/popen-gw0'), _retention_count=3, _retention_policy='all')
templateflow_home = PosixPath('/tmp/pytest-of-root/pytest-0/popen-gw0/TEMPLATEFLOW_HOME0')
bids_data_singlesession = '/tmp/pytest-of-root/pytest-0/popen-gw0/BIDS0/simbids'
bids_data_multisession = '/tmp/pytest-of-root/pytest-0/popen-gw0/BIDS1/simbids'
processing_level = 'session'
simbids_container_ds = PosixPath('/tmp/pytest-of-root/pytest-0/popen-gw0/my-container0')

    @pytest.mark.parametrize('processing_level', ['subject', 'session'])
    def test_babs_init_raw_bids(
        tmp_path_factory,
        templateflow_home,
        bids_data_singlesession,
        bids_data_multisession,
        processing_level,
        simbids_container_ds,
    ):
        """
        This is to test `babs init` on raw BIDS data.
        """
    
        # Check the container dataset
        assert op.exists(simbids_container_ds)
        assert op.exists(op.join(simbids_container_ds, '.datalad/config'))
    
        # Check the bids input dataset:
        assert op.exists(bids_data_singlesession)
        assert op.exists(op.join(bids_data_singlesession, '.datalad/config'))
    
        # Preparation of env variable `TEMPLATEFLOW_HOME`:
        os.environ['TEMPLATEFLOW_HOME'] = str(templateflow_home)
        assert os.getenv('TEMPLATEFLOW_HOME')
    
        # Get the cli of `babs init`:
        project_base = tmp_path_factory.mktemp('project')
        project_root = project_base / 'my_babs_project'
        container_name = 'simbids-0-0-3'
    
        # Use config_simbids.yaml instead of eg_fmriprep
        config_simbids_path = get_config_simbids_path()
        container_config = update_yaml_for_run(
            project_base,
            config_simbids_path.name,
            {
                'BIDS': bids_data_singlesession
                if processing_level == 'subject'
                else bids_data_multisession
            },
        )
    
        babs_init_opts = argparse.Namespace(
            project_root=project_root,
            list_sub_file=None,
            container_ds=simbids_container_ds,
            container_name=container_name,
            container_config=container_config,
            processing_level=processing_level,
            queue='slurm',
            keep_if_failed=False,
        )
    
        # Test error when project root already exists
        project_root.mkdir(parents=True, exist_ok=True)
        with mock.patch.object(argparse.ArgumentParser, 'parse_args', return_value=babs_init_opts):
            with pytest.raises(FileExistsError, match=r'already exists'):
                _enter_init()
    
        # Test error when parent directory doesn't exist
        non_existent_parent = project_base / 'non_existent' / 'my_babs_project'
        babs_init_opts.project_root = non_existent_parent
        with mock.patch.object(argparse.ArgumentParser, 'parse_args', return_value=babs_init_opts):
            with pytest.raises(ValueError, match=r'parent folder.*does not exist'):
                _enter_init()
    
        # Test error when parent directory doesn't exist
        babs_init_opts.project_root = project_root
        with mock.patch.object(argparse.ArgumentParser, 'parse_args', return_value=babs_init_opts):
            _enter_init()
    
        # babs check-setup:
        babs_check_setup_opts = argparse.Namespace(project_root=project_root, job_test=True)
        with mock.patch.object(
            argparse.ArgumentParser, 'parse_args', return_value=babs_check_setup_opts
        ):
            _enter_check_setup()
    
        # test babs status before submitting jobs
        babs_status_opts = argparse.Namespace(project_root=project_root)
        with mock.patch.object(argparse.ArgumentParser, 'parse_args', return_value=babs_status_opts):
            _enter_status()
    
        # babs submit:
        babs_submit_opts = argparse.Namespace(
            project_root=project_root, select=None, inclusion_file=None, count=1
        )
        with mock.patch.object(argparse.ArgumentParser, 'parse_args', return_value=babs_submit_opts):
            _enter_submit()
    
        # babs status:
        babs_status_opts = argparse.Namespace(project_root=project_root)
        with mock.patch.object(argparse.ArgumentParser, 'parse_args', return_value=babs_status_opts):
            _enter_status()
    
        finished = False
        for waitnum in [5, 8, 10, 15, 30, 60, 120]:
            time.sleep(waitnum)
            print(f'Waiting {waitnum} seconds...')
            df = squeue_to_pandas()
            print(df)
            if df.empty:
                finished = True
                break
    
        if not finished:
            raise RuntimeError('Jobs did not finish in time')
    
        # Submit the last job:
        babs_submit_opts = argparse.Namespace(
            project_root=project_root, select=None, inclusion_file=None, count=None
        )
        with mock.patch.object(argparse.ArgumentParser, 'parse_args', return_value=babs_submit_opts):
            _enter_submit()
    
        babs_merge_opts = argparse.Namespace(
            project_root=project_root, chunk_size=2000, trial_run=False
        )
        with mock.patch.object(argparse.ArgumentParser, 'parse_args', return_value=babs_merge_opts):
>           _enter_merge()

tests/test_babs_workflow.py:136: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
babs/cli.py:522: in _enter_merge
    babs_merge_main(**vars(options))
babs/cli.py:546: in babs_merge_main
    babs_proj.babs_merge(chunk_size, trial_run)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <babs.merge.BABSMerge object at 0x7dfd7f912210>, chunk_size = 2000
trial_run = False

    def babs_merge(self, chunk_size=1000, trial_run=False):
        """
        This function merges results and provenance from all successfully finished jobs.
    
        Parameters
        ----------
        chunk_size: int
            Number of branches in a chunk when merging at a time.
        trial_run: bool
            Whether to run as a trial run which won't push the merging actions back to output RIA.
            This option should only be used by developers for testing purpose.
        """
    
        # First, make sure all the results branches are reflected in the results dataframe
        self._update_results_status()
    
        warning_encountered = False
        self.wtf_key_info()  # get `self.analysis_dataset_id`
        # path to `merge_ds`:
        merge_ds_path = op.join(self.project_root, 'merge_ds')
    
        if op.exists(merge_ds_path):
            raise Exception(
                "Folder 'merge_ds' already exists. `babs merge` won't proceed."
                " If you're sure you want to rerun `babs merge`,"
                ' please remove this folder before you rerun `babs merge`.'
                " Path to 'merge_ds': '" + merge_ds_path + "'. "
            )
    
        # Define (potential) text files:
        #   in 'merge_ds/code' folder
        #   as `merge_ds` should not exist at the moment,
        #   no need to check existence/remove these files.
        # define path to text file of invalid job list exists:
        fn_list_invalid_jobs = op.join(merge_ds_path, 'code', 'list_invalid_job_when_merging.txt')
        # define path to text file of files with missing content:
        fn_list_content_missing = op.join(merge_ds_path, 'code', 'list_content_missing.txt')
        # define path to printed messages from `git annex fsck`:
        # ^^ this will be absolutely used if `babs merge` does not fail:
        fn_msg_fsck = op.join(merge_ds_path, 'code', 'log_git_annex_fsck.txt')
    
        # Clone output RIA to `merge_ds`:
        print("Cloning output RIA to 'merge_ds'...")
        # get the path to output RIA:
        #   'ria+file:///path/to/BABS_project/output_ria#0000000-000-xxx-xxxxxxxx'
        output_ria_source = self.output_ria_url + '#' + self.analysis_dataset_id
        # clone: `datalad clone ${outputsource} merge_ds`
        dlapi.clone(source=output_ria_source, path=merge_ds_path)
    
        # List all branches in output RIA:
        print('\nListing all branches in output RIA...')
        list_branches_jobs = self._get_results_branches()
    
        if len(list_branches_jobs) == 0:
>           raise ValueError(
                'There is no successfully finished job yet. Please run `babs submit` first.'
            )
E           ValueError: There is no successfully finished job yet. Please run `babs submit` first.

babs/merge.py:71: ValueError

cursor · 2025-09-16T18:32:31Z

I modified the InputDataset class in babs/input_dataset.py to standardize the columns of an empty pandas.DataFrame returned when no matching data is found. This ensures a consistent structure for job processing outputs, providing essential metadata even in the absence of results.

In babs/input_dataset.py, within the InputDataset class, I updated the logic for returning an empty pandas.DataFrame when inclu_df is empty.
I expanded the columns of this empty DataFrame to consistently include job_id, task_id, and has_results.
For session level processing, the columns now include ['job_id', 'task_id', 'sub_id', 'ses_id', 'has_results'].
For other processing levels, the columns now include ['job_id', 'task_id', 'sub_id', 'has_results'].

_{Learn more about Cursor Agents}

Co-authored-by: spsingleton.gwb <spsingleton.gwb@gmail.com>

singlesp · 2025-09-16T18:47:11Z

@cursoragent failed pytests:

_____________________ test_babs_update_input_data[session] _____________________
[gw3] linux -- Python 3.11.11 /opt/conda/envs/babs/bin/python

tmp_path_factory = TempPathFactory(_given_basetemp=PosixPath('/tmp/pytest-of-root/pytest-0/popen-gw3'), _trace=<pluggy._tracing.TagTracer...ba9a29e10>, _basetemp=PosixPath('/tmp/pytest-of-root/pytest-0/popen-gw3'), _retention_count=3, _retention_policy='all')
templateflow_home = PosixPath('/tmp/pytest-of-root/pytest-0/popen-gw3/TEMPLATEFLOW_HOME0')
bids_data_singlesession = '/tmp/pytest-of-root/pytest-0/popen-gw3/BIDS0/simbids'
bids_data_multisession = '/tmp/pytest-of-root/pytest-0/popen-gw3/BIDS1/simbids'
processing_level = 'session'
simbids_container_ds = PosixPath('/tmp/pytest-of-root/pytest-0/popen-gw3/my-container0')
capsys = <_pytest.capture.CaptureFixture object at 0x78ab9a730990>

    @pytest.mark.parametrize('processing_level', ['subject', 'session'])
    def test_babs_update_input_data(
        tmp_path_factory,
        templateflow_home,
        bids_data_singlesession,
        bids_data_multisession,
        processing_level,
        simbids_container_ds,
        capsys,
    ):
        """
        This is to test `babs init` on raw BIDS data.
        """
        from babs import BABSUpdate
    
        # Check the container dataset
        assert op.exists(simbids_container_ds)
        assert op.exists(op.join(simbids_container_ds, '.datalad/config'))
    
        # Check the bids input dataset:
        bids_data_source = (
            bids_data_singlesession if processing_level == 'subject' else bids_data_multisession
        )
        assert op.exists(bids_data_source)
        assert op.exists(op.join(bids_data_source, '.datalad/config'))
    
        # Clone the bids_data_singlesession dataset
        bids_data_source_clone = tmp_path_factory.mktemp('bids_data_source_clone')
        dl.clone(bids_data_source, bids_data_source_clone)
    
        # get the commit hash of the bids_data_source_clone dataset
        original_commit_hash = dl.Dataset(bids_data_source_clone).repo.get_hexsha()
    
        # Preparation of env variable `TEMPLATEFLOW_HOME`:
        os.environ['TEMPLATEFLOW_HOME'] = str(templateflow_home)
        assert os.getenv('TEMPLATEFLOW_HOME')
    
        # Get the cli of `babs init`:
        project_base = tmp_path_factory.mktemp('project')
        project_root = project_base / 'my_babs_project'
        container_name = 'simbids-0-0-3'
    
        # Use config_simbids.yaml instead of eg_fmriprep
        config_simbids_path = get_config_simbids_path()
        container_config = update_yaml_for_run(
            project_base,
            config_simbids_path.name,
            {'BIDS': str(bids_data_source_clone.absolute())},
        )
    
        # initialize the project with the original input data, check the setup, and submit a job
        babs_init_main(
            project_root=project_root,
            list_sub_file=None,
            container_ds=simbids_container_ds,
            container_name=container_name,
            container_config=container_config,
            processing_level=processing_level,
            queue='slurm',
            keep_if_failed=False,
        )
        babs_check_setup_main(project_root=project_root, job_test=True)
    
        # Make sure that "Input dataset BIDS is up to date." is printed:
        # Clear any previous output
        _ = capsys.readouterr()
        babs_update_input_data_main(project_root=project_root, dataset_name='BIDS')
        captured = capsys.readouterr()
        assert 'Input dataset BIDS is up to date.' in captured.out
    
        # Check the original inclusion dataframe:
        babs_proj = BABSUpdate(project_root=project_root)
        original_inclusion_df = babs_proj.inclusion_dataframe.copy()
>       assert not original_inclusion_df.empty
E       assert not True
E        +  where True = Empty DataFrame\nColumns: [job_id, task_id, sub_id, ses_id, has_results]\nIndex: [].empty

tests/test_update_input_data.py:143: AssertionError

and

_______________________ test_babs_init_raw_bids[session] _______________________
[gw0] linux -- Python 3.11.11 /opt/conda/envs/babs/bin/python

tmp_path_factory = TempPathFactory(_given_basetemp=PosixPath('/tmp/pytest-of-root/pytest-0/popen-gw0'), _trace=<pluggy._tracing.TagTracer...baef692d0>, _basetemp=PosixPath('/tmp/pytest-of-root/pytest-0/popen-gw0'), _retention_count=3, _retention_policy='all')
templateflow_home = PosixPath('/tmp/pytest-of-root/pytest-0/popen-gw0/TEMPLATEFLOW_HOME0')
bids_data_singlesession = '/tmp/pytest-of-root/pytest-0/popen-gw0/BIDS0/simbids'
bids_data_multisession = '/tmp/pytest-of-root/pytest-0/popen-gw0/BIDS1/simbids'
processing_level = 'session'
simbids_container_ds = PosixPath('/tmp/pytest-of-root/pytest-0/popen-gw0/my-container0')

    @pytest.mark.parametrize('processing_level', ['subject', 'session'])
    def test_babs_init_raw_bids(
        tmp_path_factory,
        templateflow_home,
        bids_data_singlesession,
        bids_data_multisession,
        processing_level,
        simbids_container_ds,
    ):
        """
        This is to test `babs init` on raw BIDS data.
        """
    
        # Check the container dataset
        assert op.exists(simbids_container_ds)
        assert op.exists(op.join(simbids_container_ds, '.datalad/config'))
    
        # Check the bids input dataset:
        assert op.exists(bids_data_singlesession)
        assert op.exists(op.join(bids_data_singlesession, '.datalad/config'))
    
        # Preparation of env variable `TEMPLATEFLOW_HOME`:
        os.environ['TEMPLATEFLOW_HOME'] = str(templateflow_home)
        assert os.getenv('TEMPLATEFLOW_HOME')
    
        # Get the cli of `babs init`:
        project_base = tmp_path_factory.mktemp('project')
        project_root = project_base / 'my_babs_project'
        container_name = 'simbids-0-0-3'
    
        # Use config_simbids.yaml instead of eg_fmriprep
        config_simbids_path = get_config_simbids_path()
        container_config = update_yaml_for_run(
            project_base,
            config_simbids_path.name,
            {
                'BIDS': bids_data_singlesession
                if processing_level == 'subject'
                else bids_data_multisession
            },
        )
    
        babs_init_opts = argparse.Namespace(
            project_root=project_root,
            list_sub_file=None,
            container_ds=simbids_container_ds,
            container_name=container_name,
            container_config=container_config,
            processing_level=processing_level,
            queue='slurm',
            keep_if_failed=False,
        )
    
        # Test error when project root already exists
        project_root.mkdir(parents=True, exist_ok=True)
        with mock.patch.object(argparse.ArgumentParser, 'parse_args', return_value=babs_init_opts):
            with pytest.raises(FileExistsError, match=r'already exists'):
                _enter_init()
    
        # Test error when parent directory doesn't exist
        non_existent_parent = project_base / 'non_existent' / 'my_babs_project'
        babs_init_opts.project_root = non_existent_parent
        with mock.patch.object(argparse.ArgumentParser, 'parse_args', return_value=babs_init_opts):
            with pytest.raises(ValueError, match=r'parent folder.*does not exist'):
                _enter_init()
    
        # Test error when parent directory doesn't exist
        babs_init_opts.project_root = project_root
        with mock.patch.object(argparse.ArgumentParser, 'parse_args', return_value=babs_init_opts):
            _enter_init()
    
        # babs check-setup:
        babs_check_setup_opts = argparse.Namespace(project_root=project_root, job_test=True)
        with mock.patch.object(
            argparse.ArgumentParser, 'parse_args', return_value=babs_check_setup_opts
        ):
            _enter_check_setup()
    
        # test babs status before submitting jobs
        babs_status_opts = argparse.Namespace(project_root=project_root)
        with mock.patch.object(argparse.ArgumentParser, 'parse_args', return_value=babs_status_opts):
            _enter_status()
    
        # babs submit:
        babs_submit_opts = argparse.Namespace(
            project_root=project_root, select=None, inclusion_file=None, count=1
        )
        with mock.patch.object(argparse.ArgumentParser, 'parse_args', return_value=babs_submit_opts):
            _enter_submit()
    
        # babs status:
        babs_status_opts = argparse.Namespace(project_root=project_root)
        with mock.patch.object(argparse.ArgumentParser, 'parse_args', return_value=babs_status_opts):
            _enter_status()
    
        finished = False
        for waitnum in [5, 8, 10, 15, 30, 60, 120]:
            time.sleep(waitnum)
            print(f'Waiting {waitnum} seconds...')
            df = squeue_to_pandas()
            print(df)
            if df.empty:
                finished = True
                break
    
        if not finished:
            raise RuntimeError('Jobs did not finish in time')
    
        # Submit the last job:
        babs_submit_opts = argparse.Namespace(
            project_root=project_root, select=None, inclusion_file=None, count=None
        )
        with mock.patch.object(argparse.ArgumentParser, 'parse_args', return_value=babs_submit_opts):
            _enter_submit()
    
        babs_merge_opts = argparse.Namespace(
            project_root=project_root, chunk_size=2000, trial_run=False
        )
        with mock.patch.object(argparse.ArgumentParser, 'parse_args', return_value=babs_merge_opts):
>           _enter_merge()

tests/test_babs_workflow.py:136: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
babs/cli.py:522: in _enter_merge
    babs_merge_main(**vars(options))
babs/cli.py:546: in babs_merge_main
    babs_proj.babs_merge(chunk_size, trial_run)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <babs.merge.BABSMerge object at 0x7abb9d561290>, chunk_size = 2000
trial_run = False

    def babs_merge(self, chunk_size=1000, trial_run=False):
        """
        This function merges results and provenance from all successfully finished jobs.
    
        Parameters
        ----------
        chunk_size: int
            Number of branches in a chunk when merging at a time.
        trial_run: bool
            Whether to run as a trial run which won't push the merging actions back to output RIA.
            This option should only be used by developers for testing purpose.
        """
    
        # First, make sure all the results branches are reflected in the results dataframe
        self._update_results_status()
    
        warning_encountered = False
        self.wtf_key_info()  # get `self.analysis_dataset_id`
        # path to `merge_ds`:
        merge_ds_path = op.join(self.project_root, 'merge_ds')
    
        if op.exists(merge_ds_path):
            raise Exception(
                "Folder 'merge_ds' already exists. `babs merge` won't proceed."
                " If you're sure you want to rerun `babs merge`,"
                ' please remove this folder before you rerun `babs merge`.'
                " Path to 'merge_ds': '" + merge_ds_path + "'. "
            )
    
        # Define (potential) text files:
        #   in 'merge_ds/code' folder
        #   as `merge_ds` should not exist at the moment,
        #   no need to check existence/remove these files.
        # define path to text file of invalid job list exists:
        fn_list_invalid_jobs = op.join(merge_ds_path, 'code', 'list_invalid_job_when_merging.txt')
        # define path to text file of files with missing content:
        fn_list_content_missing = op.join(merge_ds_path, 'code', 'list_content_missing.txt')
        # define path to printed messages from `git annex fsck`:
        # ^^ this will be absolutely used if `babs merge` does not fail:
        fn_msg_fsck = op.join(merge_ds_path, 'code', 'log_git_annex_fsck.txt')
    
        # Clone output RIA to `merge_ds`:
        print("Cloning output RIA to 'merge_ds'...")
        # get the path to output RIA:
        #   'ria+file:///path/to/BABS_project/output_ria#0000000-000-xxx-xxxxxxxx'
        output_ria_source = self.output_ria_url + '#' + self.analysis_dataset_id
        # clone: `datalad clone ${outputsource} merge_ds`
        dlapi.clone(source=output_ria_source, path=merge_ds_path)
    
        # List all branches in output RIA:
        print('\nListing all branches in output RIA...')
        list_branches_jobs = self._get_results_branches()
    
        if len(list_branches_jobs) == 0:
>           raise ValueError(
                'There is no successfully finished job yet. Please run `babs submit` first.'
            )
E           ValueError: There is no successfully finished job yet. Please run `babs submit` first.

babs/merge.py:71: ValueError

cursor · 2025-09-16T18:47:17Z

I refined the logic within the InputDataset class in babs/input_dataset.py to correctly handle cases where tracked_files might be an empty list. This ensures that file-tracking operations only proceed when there are actual files to process, improving robustness.

I modified the InputDataset class in babs/input_dataset.py.
Specifically, I updated the condition at line 241 within the InputDataset class.
I introduced a new boolean variable, use_tracked, which now explicitly checks if tracked_files is not None AND if its len() is greater than zero.
The subsequent if statement now evaluates if use_tracked: to ensure the file processing logic only executes when actual tracked files are present, preventing issues with empty lists.

_{Learn more about Cursor Agents}

Co-authored-by: spsingleton.gwb <spsingleton.gwb@gmail.com>

feat: Filter inclusion by required files

cc88308

Co-authored-by: spsingleton.gwb <spsingleton.gwb@gmail.com>

Refactor: Improve dataset input validation logic

ed5d59e

Co-authored-by: spsingleton.gwb <spsingleton.gwb@gmail.com>

Refactor: Move imports to top of input_dataset.py

024e749

Co-authored-by: spsingleton.gwb <spsingleton.gwb@gmail.com>

Filter inclusion data by required files for all datasets

21cb802

Co-authored-by: spsingleton.gwb <spsingleton.gwb@gmail.com>

Add blank line to InputDataset class

820c096

Co-authored-by: spsingleton.gwb <spsingleton.gwb@gmail.com>

Refactor: Use git ls-files for robust pattern matching

7d76592

Co-authored-by: spsingleton.gwb <spsingleton.gwb@gmail.com>

Fix: Improve file matching logic and empty dataframe handling

ae660a2

Co-authored-by: spsingleton.gwb <spsingleton.gwb@gmail.com>

cursoragent and others added 2 commits September 16, 2025 17:28

Refactor InputDataset to improve file matching logic

c1fd030

Co-authored-by: spsingleton.gwb <spsingleton.gwb@gmail.com>

format

e4cceb0

cursoragent and others added 4 commits September 16, 2025 17:50

Refactor: Move imports to top of file

e292344

Co-authored-by: spsingleton.gwb <spsingleton.gwb@gmail.com>

run ruff

a62af36

Run ruff.

2662a88

ruff format

91cba3e

Merge branch 'cursor/fix-babs-init-required-files-filtering-d628' of h…

04af331

…ttps://github.com/PennLINC/babs into cursor/fix-babs-init-required-files-filtering-d628

singlesp requested review from mattcieslak and removed request for mattcieslak September 16, 2025 18:27

Refactor: Add job_id, task_id, and has_results to empty DataFrame

6ef2f8a

Co-authored-by: spsingleton.gwb <spsingleton.gwb@gmail.com>

Fix: Handle empty tracked_files list in InputDataset

2afd5fc

Co-authored-by: spsingleton.gwb <spsingleton.gwb@gmail.com>

Fix babs init required files filtering #313

Are you sure you want to change the base?

Fix babs init required files filtering #313

Uh oh!

Conversation

singlesp commented Sep 16, 2025

Uh oh!

cursor bot commented Sep 16, 2025

Uh oh!

singlesp commented Sep 16, 2025

Uh oh!

cursor bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

singlesp commented Sep 16, 2025

Uh oh!

cursor bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

singlesp commented Sep 16, 2025

Uh oh!

cursor bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

singlesp commented Sep 16, 2025

Uh oh!

cursor bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

singlesp commented Sep 16, 2025

Uh oh!

cursor bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

singlesp commented Sep 16, 2025

Uh oh!

cursor bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

singlesp commented Sep 16, 2025

Uh oh!

cursor bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

singlesp commented Sep 16, 2025

Uh oh!

cursor bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

singlesp commented Sep 16, 2025

Uh oh!

singlesp commented Sep 16, 2025

Uh oh!

cursor bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

singlesp commented Sep 16, 2025

Uh oh!

cursor bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cursor bot commented Sep 16, 2025 •

edited

Loading

cursor bot commented Sep 16, 2025 •

edited

Loading

cursor bot commented Sep 16, 2025 •

edited

Loading

cursor bot commented Sep 16, 2025 •

edited

Loading

cursor bot commented Sep 16, 2025 •

edited

Loading

cursor bot commented Sep 16, 2025 •

edited

Loading

cursor bot commented Sep 16, 2025 •

edited

Loading

cursor bot commented Sep 16, 2025 •

edited

Loading

cursor bot commented Sep 16, 2025 •

edited

Loading

cursor bot commented Sep 16, 2025 •

edited

Loading