Skip to content

Consider using os.walk or os.scandir rather than pathlib #116

@jarshwah

Description

@jarshwah

Trailrunner uses pathlib to iteratively walk the tree. Internally pathlib uses os.scandir but, crucially, pathlib throws away all of the extra metadata and just returns the Path objects.

That means that calls to child.is_file() and child.is_dir() will perform extra syscalls.

If we switch to using os.walk then we can eliminate the child/directory additional syscalls, but still incur a cost of an additional syscall to check if a file is a symlink (2 per file).

If we switch to using os.scandir directly and ignore symlinks altogether, then we're able to hit 1 syscall per file.

With my testing on a tree containing 41111 python files (in nested directories 6 deep), only returning python files, I get the following results (note exact number of syscalls is arbitrary, the relative magnitude is important):

  1. Using Path.iterdir: 0.38s (user) @ 399980 file system syscalls
  2. Using os.walk: 0.21s @ 104018 file system syscalls
  3. Using os.scandir (ignoring symlinks!): 0.10s @ file system syscalls 64877

To measure syscalls on osx I'm using: sudo ktrace trace -S -f C3 -c python <thetest>.py | wc -l.

Ultimately, does this matter? Probably not for most people. Switching to scandir is somewhere between 3-4x faster, but unless you're on a system where syscalls are expensive (remote/shares) and you have a large directory structure, .3 of a second isn't much to write home about.

The test functions themselves:

import os
from pathlib import Path

def pathlib_iter(root_path):
    root = Path(root_path).resolve()
    def gen(children):
        for child in children:
            if child.is_file():
                if child.suffix == ".py":
                    yield child
            elif child.is_dir():
                yield from gen(child.iterdir())
    yield from gen([root])

def os_walk(root_path):
    for root, dirs, files in os.walk(root_path):
        for file in files:
            if file.endswith(".py"):
                yield os.path.join(root, file)
    
def os_scandir(root_path):
    scandir = os.scandir
    def scan(dirs):
        for dir in dirs:
            subdirs = []
            scanner = scandir(dir)
            for dir_entry in scanner:
                if dir_entry.is_file():
                    if dir_entry.name.endswith(".py"):
                        yield dir_entry.path
                elif dir_entry.is_dir():
                    subdirs.append(dir_entry.path)
            yield from scan(subdirs)
    yield from scan([root_path])

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions