Skip to content

Katachi is a Python framework for validating and processing hierarchical directory structures using YAML-based schemas. It ensures your folders and files follow expected shapes, naming rules, and relationships—before any processing begins. Use it to enforce structure, catch issues early, and keep your data pipelines reliable.

License

Notifications You must be signed in to change notification settings

nmicovic/katachi

Repository files navigation

katachi

Release Build status codecov Commit activity License

Logo

Katachi is a Python package for validating, processing, and parsing directory structures against defined schemas.

Note: Katachi is currently under active development and should be considered a work in progress. APIs may change in future releases.

Features

  • 📐 Schema-based validation - Define expected directory structures using YAML
  • 🧩 Extensible architecture - Create custom validators and actions
  • 🔄 Relationship validation - Validate relationships between files (like paired files)
  • 🚀 Command-line interface - Easy to use CLI with rich formatting
  • 📋 Detailed reports - Get comprehensive validation reports

Installation

Install from PyPI:

pip install katachi

For development:

git clone https://github.com/nmicovic/katachi.git
cd katachi
make install

Quick Start

Define a schema (schema.yaml)

semantical_name: data
type: directory
pattern_name: data
children:
  - semantical_name: image
    pattern_name: "img\\d+"
    type: file
    extension: .jpg
    description: "Image files with numeric identifiers"
  - semantical_name: metadata
    pattern_name: "img\\d+"
    type: file
    extension: .json
    description: "Metadata for image files"
  - semantical_name: file_pairs_check
    type: predicate
    predicate_type: pair_comparison
    description: "Check if images have matching metadata files"
    elements:
      - image
      - metadata

Validate a directory structure

katachi validate schema.yaml target_directory

Command-Line Examples

Validate a simple directory structure:

katachi validate "tests/schema_tests/test_sanity/schema.yaml" "tests/schema_tests/test_sanity/dataset"

Validate a nested directory structure:

katachi validate "tests/schema_tests/test_depth_1/schema.yaml" "tests/schema_tests/test_depth_1/dataset"

Validate paired files (e.g., ensure each .jpg has a matching .json file):

katachi validate "tests/schema_tests/test_paired_files/schema.yaml" "tests/schema_tests/test_paired_files/data"

Validate Azure Blob Storage:

# Set Azure credentials in environment variables
export AZURE_STORAGE_ACCOUNT="your_storage_account"
export AZURE_STORAGE_ACCESS_KEY="your_access_key"
# Or use SAS token
export AZURE_STORAGE_SAS_TOKEN="your_sas_token"

# Validate local schema against Azure Blob Storage
katachi validate "schema.yaml" "abfs://container/path"

# Validate schema in Azure Blob Storage against another Azure Blob Storage path
katachi validate "abfs://container/schema.yaml" "abfs://container/path"

Python API

from pathlib import Path
from katachi.schema.importer import load_yaml
from katachi.schema.validate import validate_schema

# Load schema from YAML
schema = load_yaml(Path("schema.yaml"), Path("data_directory"))

# Validate directory against schema
report = validate_schema(schema, Path("data_directory"))

# Check if validation passed
if report.is_valid():
    print("Validation successful!")
else:
    print("Validation failed with the following issues:")
    for result in report.results:
        if not result.is_valid:
            print(f"- {result.path}: {result.message}")

Using Azure Blob Storage

import os
from katachi.schema.importer import load_yaml
from katachi.schema.validate import validate_schema
from katachi.utils.fs_utils import get_filesystem

# Set Azure credentials
os.environ["AZURE_STORAGE_ACCOUNT"] = "your_storage_account"
os.environ["AZURE_STORAGE_ACCESS_KEY"] = "your_access_key"
# Or use SAS token
# os.environ["AZURE_STORAGE_SAS_TOKEN"] = "your_sas_token"

# Get filesystem for Azure Blob Storage
target_fs = get_filesystem("abfs://container/path")
schema_fs = get_filesystem("abfs://container/schema.yaml")

# Load schema from Azure Blob Storage
schema = load_yaml("schema.yaml", "path", schema_fs, target_fs)

# Validate Azure Blob Storage path against schema
report = validate_schema(schema, "path", target_fs)

# Check validation results
if report.is_valid():
    print("Validation successful!")
else:
    print("Validation failed with the following issues:")
    for result in report.results:
        if not result.is_valid:
            print(f"- {result.path}: {result.message}")

Extending Katachi

Custom validators

from pathlib import Path
from katachi.schema.schema_node import SchemaNode
from katachi.validation.core import ValidationResult, ValidatorRegistry

def my_custom_validator(node: SchemaNode, path: Path) -> ValidationResult:
    # Custom validation logic
    return ValidationResult(
        is_valid=True,
        message="Custom validation passed",
        path=path,
        validator_name="custom_validator"
    )

# Register the validator
ValidatorRegistry.register("custom_validator", my_custom_validator)

Custom file processing

from pathlib import Path
from typing import Any
from katachi.schema.actions import register_action, NodeContext

def process_image(node, path: Path, parent_contexts: list[NodeContext], context: dict[str, Any]) -> None:
    # Custom image processing logic
    print(f"Processing image: {path}")
    # Access parent context if needed
    for parent_node, parent_path in parent_contexts:
        if parent_node.semantical_name == "timestamp":
            print(f"Image from date: {parent_path.name}")
            break

# Register the action
register_action("image", process_image)

Contributing

Contributions are welcome! See CONTRIBUTING.md for details.

License

This project is licensed under the terms of the MIT License.

About

Katachi is a Python framework for validating and processing hierarchical directory structures using YAML-based schemas. It ensures your folders and files follow expected shapes, naming rules, and relationships—before any processing begins. Use it to enforce structure, catch issues early, and keep your data pipelines reliable.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages