Skip to content

alexhallam/entiny

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

19 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Test PyPI version

entiny

entiny takes larger than memory data and makes it small

entiny is a subset selection package which uses the Information-Based Optimal Subdata Selection (IBOSS) algorithm.

Features

  • ๐Ÿ˜ Larger than memory implementation suitable for large datasets
  • ๐Ÿฐ Automatic detection and handling of stratification variables
  • ๐Ÿฅ— Support for both CSV and Parquet file formats
  • ๐Ÿค– Command-line interface for easy usage

Installation

# Install the package with pip
pip install entiny

# The CLI command 'entiny' will be automatically installed
# Verify the installation
entiny --help

The installation will automatically add the entiny command to your system. You can verify the installation by running entiny --help to see the available options.

Quick Start

import polars as pl
from entiny import entiny

a = pl.int_range(1, 30, eager=True)
df = pl.DataFrame({"a": a})

b = df.select(pl.col("a").shuffle(seed=1))
c = df.select(pl.col("a").shuffle(seed=2))

df = df.with_columns(
    b=b.to_series(),
    c=c.to_series()
)

print(df)

# "1" will select the row with the largest and smallest value from each column.
# The height of the final dataframe will be n * 2 * number of columns
df_entiny= entiny(df, n=1).collect()

print(df_entiny)
import polars as pl
import numpy as np
from entiny import entiny

# Create or load your data
df = pl.DataFrame({
    "category": ["A", "A", "B", "B"] * 250,
    "value1": np.random.normal(0, 1, 1000),
    "value2": np.random.uniform(-5, 5, 1000)
})

# Sample extreme values
# This will automatically detect "category" as a stratum
# and sample extreme values within each category
result = entiny(df, n=10).collect()

Usage

Python API

from entiny import entiny

# From a DataFrame
result = entiny(df, n=10).collect()

# From a CSV file
result = entiny("data.csv", n=10).collect()

# From a Parquet file
result = entiny("data.parquet", n=10).collect()

# With custom options
result = entiny(
    data=df,
    n=10,                    # Number of extreme values to select from each end
    seed=42,                 # For reproducibility
    show_progress=True       # Show progress bars
).collect()

Command Line Interface

# Basic usage
entiny -i input.csv -o output.csv -n 10

# With all options
entiny \
    --input data.csv \
    --output sampled.csv \
    --n 10 \
    --seed 42 \
    --no-progress  # Optional: disable progress bars

How It Works

  1. Automatic Feature Detection:

    • Numeric columns are used for sampling extreme values
    • String/categorical columns are automatically detected as strata
  2. Stratified Sampling:

    • If categorical columns are present, sampling is performed within each stratum
    • For each numeric variable in each stratum:
      • Selects n highest values
      • Selects n lowest values
  3. Memory Efficiency:

    • Uses Polars' lazy evaluation
    • Processes data in chunks
    • Minimizes memory usage for large datasets

Example with Stratification

import polars as pl
import numpy as np
from entiny import entiny

# Create a dataset with multiple strata
df = pl.DataFrame({
    "region": ["North", "South"] * 500,
    "category": ["A", "B", "A", "B"] * 250,
    "sales": np.random.lognormal(0, 1, 1000),
    "quantity": np.random.poisson(5, 1000)
})

# Sample extreme values
# Will automatically detect "region" and "category" as strata
result = entiny(df, n=5).collect()

Performance Considerations

  • Uses Polars for high-performance data operations
  • Lazy evaluation minimizes memory usage
  • Progress bars show operation status
  • Efficient handling of large datasets through streaming

About IBOSS

IBOSS is a very simple subset selection option that works well in regression like situations.

Information Gain

Algorithm IBOSS(data D, num_min_max k)

  // Initialize empty sample set
  iboss_sample = {}

  // Iterate over each column (parameter)
  for each column c in D:
    // Sort column c in ascending order
    sorted_c = sort(c)

    // Select k smallest values
    min_values = sorted_c[0:k]  

    // Select k largest values
    max_values = sorted_c[length(sorted_c)-k: length(sorted_c)]

    // Add selected values to the sample
    iboss_sample.add(min_values)
    iboss_sample.add(max_values)


  return iboss_sample

Note: If the majority of your columns are numeric then this is a great fit. For tabular data which is categorical look at Data Nuggets.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License

About

๐Ÿค High-performance larger-than-memory subdata selection

Resources

License

Stars

Watchers

Forks

Languages