entiny

entiny takes larger than memory data and makes it small

entiny is a subset selection package which uses the Information-Based Optimal Subdata Selection (IBOSS) algorithm.

Features

🐘 Larger than memory implementation suitable for large datasets
🍰 Automatic detection and handling of stratification variables
🥗 Support for both CSV and Parquet file formats
🤖 Command-line interface for easy usage

Installation

# Install the package with pip
pip install entiny

# The CLI command 'entiny' will be automatically installed
# Verify the installation
entiny --help

The installation will automatically add the entiny command to your system. You can verify the installation by running entiny --help to see the available options.

Quick Start

import polars as pl
from entiny import entiny

a = pl.int_range(1, 30, eager=True)
df = pl.DataFrame({"a": a})

b = df.select(pl.col("a").shuffle(seed=1))
c = df.select(pl.col("a").shuffle(seed=2))

df = df.with_columns(
    b=b.to_series(),
    c=c.to_series()
)

print(df)

# "1" will select the row with the largest and smallest value from each column.
# The height of the final dataframe will be n * 2 * number of columns
df_entiny= entiny(df, n=1).collect()

print(df_entiny)

import polars as pl
import numpy as np
from entiny import entiny

# Create or load your data
df = pl.DataFrame({
    "category": ["A", "A", "B", "B"] * 250,
    "value1": np.random.normal(0, 1, 1000),
    "value2": np.random.uniform(-5, 5, 1000)
})

# Sample extreme values
# This will automatically detect "category" as a stratum
# and sample extreme values within each category
result = entiny(df, n=10).collect()

Usage

Python API

from entiny import entiny

# From a DataFrame
result = entiny(df, n=10).collect()

# From a CSV file
result = entiny("data.csv", n=10).collect()

# From a Parquet file
result = entiny("data.parquet", n=10).collect()

# With custom options
result = entiny(
    data=df,
    n=10,                    # Number of extreme values to select from each end
    seed=42,                 # For reproducibility
    show_progress=True       # Show progress bars
).collect()

Command Line Interface

# Basic usage
entiny -i input.csv -o output.csv -n 10

# With all options
entiny \
    --input data.csv \
    --output sampled.csv \
    --n 10 \
    --seed 42 \
    --no-progress  # Optional: disable progress bars

How It Works

Automatic Feature Detection:
- Numeric columns are used for sampling extreme values
- String/categorical columns are automatically detected as strata
Stratified Sampling:
- If categorical columns are present, sampling is performed within each stratum
- For each numeric variable in each stratum:
  - Selects n highest values
  - Selects n lowest values
Memory Efficiency:
- Uses Polars' lazy evaluation
- Processes data in chunks
- Minimizes memory usage for large datasets

Example with Stratification

import polars as pl
import numpy as np
from entiny import entiny

# Create a dataset with multiple strata
df = pl.DataFrame({
    "region": ["North", "South"] * 500,
    "category": ["A", "B", "A", "B"] * 250,
    "sales": np.random.lognormal(0, 1, 1000),
    "quantity": np.random.poisson(5, 1000)
})

# Sample extreme values
# Will automatically detect "region" and "category" as strata
result = entiny(df, n=5).collect()

Performance Considerations

Uses Polars for high-performance data operations
Lazy evaluation minimizes memory usage
Progress bars show operation status
Efficient handling of large datasets through streaming

About IBOSS

IBOSS is a very simple subset selection option that works well in regression like situations.

Algorithm IBOSS(data D, num_min_max k)

  // Initialize empty sample set
  iboss_sample = {}

  // Iterate over each column (parameter)
  for each column c in D:
    // Sort column c in ascending order
    sorted_c = sort(c)

    // Select k smallest values
    min_values = sorted_c[0:k]  

    // Select k largest values
    max_values = sorted_c[length(sorted_c)-k: length(sorted_c)]

    // Add selected values to the sample
    iboss_sample.add(min_values)
    iboss_sample.add(max_values)


  return iboss_sample

Note: If the majority of your columns are numeric then this is a great fit. For tabular data which is categorical look at Data Nuggets.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
examples		examples
src/entiny		src/entiny
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

entiny

Features

Installation

Quick Start

Usage

Python API

Command Line Interface

How It Works

Example with Stratification

Performance Considerations

About IBOSS

Contributing

License

About

Uh oh!

Releases 3

Uh oh!

Languages

License

alexhallam/entiny

Folders and files

Latest commit

History

Repository files navigation

entiny

Features

Installation

Quick Start

Usage

Python API

Command Line Interface

How It Works

Example with Stratification

Performance Considerations

About IBOSS

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Uh oh!

Languages