Skip to content

A single common PyTableProvider that can be created either via a pycapsule or into_view #1239

@kosiew

Description

@kosiew

Description

Background

Users of datafusion-python can create table providers from at least two different pathways:

  1. A table provider created via PyCapsule (i.e. custom providers implemented in Rust or elsewhere, exposed to Python via __datafusion_table_provider__).
  2. A view-based provider created via into_view() on an existing DataFrame, then registered with a SessionContext.

These two types serve similar roles (they supply data / logical plans to DataFusion), but currently they behave differently, which can lead to confusion.


Problem / Confusion

  • A user might reasonably expect that a table provider object created with into_view() could be registered with a session context in the same way as a PyCapsule‐exposed provider, but that may not always work (or may not be documented).
  • There is risk of mismatch in how the internals treat providers from the two sources (views vs PyCapsules).
  • Without a unified type or interface, it’s unclear whether certain operations should/can be supported for both.
  • The divergence might cause unexpected errors or surprising behavior for the user, especially around registration, reuse, or compatibility of providers.

Desired Behavior / Suggestion

  • Define a single common PyTableProvider (or similarly named abstraction) that works identically whether created via into_view() or via a PyCapsule / external source.

  • Ensure that the SessionContext.register_table_provider(...) accepts this common type regardless of source.

  • Document clearly:

    • what kinds of table providers are accepted (views, PyCapsules, external)
    • how to obtain them from each path
    • equivalence or limitations (if any)
  • Possibly enhance the implementation so that a view-based provider can be converted (or wrapped) into the same internal abstraction that a PyCapsule provider uses.


Benefits

  • Reduced confusion for users.
  • More consistency in the API.
  • Easier to reason about table providers across different parts of a codebase.
  • Potential fewer bugs when mixing providers from different sources.

Context

This issue is motivated by this comment
#1016 (comment)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions