Skip to content

Extension properties docs #1246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 13, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/StardustDocs/resources/example.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
name,info
Alice,"{""age"":23,""height"":175.5}"
Bob,"{""age"":27,""height"":160.2}"
170 changes: 155 additions & 15 deletions docs/StardustDocs/topics/extensionPropertiesApi.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,164 @@
[//]: # (title: Extension Properties API)

<!---IMPORT org.jetbrains.kotlinx.dataframe.samples.api.ApiLevels-->

Auto-generated extension properties are the safest and easiest way to access columns in a [`DataFrame`](DataFrame.md).
They are generated based on a [dataframe schema](schemas.md),
When working with a [`DataFrame`](DataFrame.md), the most convenient and reliable way
to access its columns — including for operations and retrieving column values
in row expressions — is through auto-generated extension properties.
They are generated based on a [dataframe schema](schemas.md),
with the name and type of properties inferred from the name and type of the corresponding columns.
It also works for all types of hierarchical dataframes.

> The behavior of data schema generation differs between the
> [Compiler Plugin](Compiler-Plugin.md) and [Kotlin Notebook](gettingStartedKotlinNotebook.md).
>
> * In **Kotlin Notebook**, a schema is generated *only after cell execution* for
> `DataFrame` variables defined within that cell.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*and `DataRow`. Or do you mean "DataFrame" the library? as in "DataFrame variables". Then don't use backticks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but DataRow not defined within that cell.
I mean, we can omit it here, it's described further.

> * With the **Compiler Plugin**, a new schema is generated *after every operation*
> — but support for all operations is still in progress.
> Retrieving the schema for `DataFrame` read from a file or URL is *not yet supported* either.
>
> This behavior may change in future releases. See the [example](#example) below that demonstrates these differences.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe mention that we're working to bring the compiler plugin to notebooks as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning is "ALL of these points will be improved in the future"

{style="warning"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe warning is a bit aggressive, how about "info"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's very crucial, so set a warning.


## Example

Consider a simple hierarchical dataframe from
<resource src="example.csv"></resource>.

This table consists of two columns: `name`, which is a `String` column, and `info`,
which is a [**column group**](DataColumn.md#columngroup) containing two nested
[value columns](DataColumn.md#valuecolumn) —
`age` of type `Int`, and `height` of type `Double`.

<table>
<thead>
<tr>
<th>name</th>
<th colspan="2">info</th>
</tr>
<tr>
<th></th>
<th>age</th>
<th>height</th>
</tr>
</thead>
<tbody>
<tr>
<td>Alice</td>
<td>23</td>
<td>175.5</td>
</tr>
<tr>
<td>Bob</td>
<td>27</td>
<td>160.2</td>
</tr>
</tbody>
</table>

<tabs>
<tab title="Kotlin Notebook">
Read the [`DataFrame`](DataFrame.md) from the CSV file:

```kotlin
val df = DataFrame.readCsv("example.csv")
```

*After cell execution* data schema and extensions for this `DataFrame` will be generated
so you can use extensions for accessing columns,
using it in operations inside the [Column Selector DSL](ColumnSelectors.md)
and [DataRow API](DataRow.md):


```kotlin
// Get nested column
df.info.age
// Sort by multiple columns
df.sortBy { name and info.height }
// Filter rows using a row condition.
// These extensions express the exact value in the row
// with the corresponding type:
df.filter { name.startsWith("A") && info.age >= 16 }
```

If you change the dataframe's schema by changing any column [name](rename.md),
or [type](convert.md) or [add](add.md) a new one, you need to
run a cell with a new [`DataFrame`](DataFrame.md) declaration first.
For example, rename the `name` column into "firstName":

```kotlin
val dfRenamed = df.rename { name }.into("firstName")
```

Having these, it allows you to work with your dataframe like:
After running the cell with the code above, you can use `firstName` extensions in the following cells:

```kotlin
dfRenamed.firstName
dfRenamed.rename { firstName }.into("name")
dfRenamed.filter { firstName == "Nikita" }
```

See the [](quickstart.md) in Kotlin Notebook with basic Extension Properties API examples.

</tab>
<tab title="Compiler Plugin">

For now, if you read [`DataFrame`](DataFrame.md) from a file or URL, you need to define its schema manually.
You can do it quickly with [`generate..()` methods](DataSchema-Data-Classes-Generation.md).

Define schemas:
```kotlin
val peopleDf /* : DataFrame<Person> */ = DataFrame.read("people.csv").cast<Person>()
val nameColumn /* : DataColumn<String> */ = peopleDf.name
val ageColumn /* : DataColumn<Int> */ = peopleDf.personData.age
@DataSchema
data class PersonInfo(
val age: Int,
val height: Float
)

@DataSchema
data class Person(
val info: PersonInfo,
val name: String
)
```
and of course

Read the `DataFrame` from the CSV file and specify the schema with
[`.convertTo()`](convertTo.md) or [`cast()`](cast.md):

```kotlin
val df = DataFrame.readCsv("example.csv").convertTo<Person>()
```

Extensions for this `DataFrame` will be generated automatically by the plugin,
so you can use extensions for accessing columns,
using it in operations inside the [Column Selector DSL](ColumnSelectors.md)
and [DataRow API](DataRow.md).


```kotlin
// Get nested column
df.info.age
// Sort by multiple columns
df.sortBy { name and info.height }
// Filter rows using a row condition.
// These extensions express the exact value in the row
// with the corresponding type:
df.filter { name.startsWith("A") && info.age >= 16 }
```

Moreover, new extensions will be generated on-the-fly after each schema change:
by changing any column [name](rename.md),
or [type](convert.md) or [add](add.md) a new one.
For example, rename the `name` column into "firstName" and then we can use `firstName` extensions
in the following operations:

```kotlin
peopleDf.add("lastName") { name.split(",").last() }
.dropNulls { personData.age }
.filter { survived && home.endsWith("NY") && personData.age in 10..20 }
// Rename "name" column into "firstName"
df.rename { name }.into("firstName")
// Can use `firstName` extension in the row condition
// right after renaming
.filter { firstName == "Nikita" }
```

To find out how to use this API in your environment, check out [Working with Data Schemas](schemas.md)
or jump straight to [Data Schemas in Gradle projects](schemasGradle.md),
or [Data Schemas in Jupyter notebooks](schemasJupyter.md).
See [Compiler Plugin Example](https://github.com/Kotlin/dataframe/tree/plugin_example/examples/kotlin-dataframe-plugin-example)
IDEA project with basic Extension Properties API examples.
</tab>
</tabs>
3 changes: 3 additions & 0 deletions docs/StardustDocs/topics/guides/Guides-And-Examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@ Explore our structured, in-depth guides to steadily improve your Kotlin DataFram

<img src="quickstart_preview.png" border-effect="rounded" width="705"/>

* [](extensionPropertiesApi.md) — learn about extension properties for [`DataFrame`](DataFrame.md)
and make working with your data both convenient and type-safe.

* [Enhanced Column Selection DSL](https://blog.jetbrains.com/kotlin/2024/07/enhanced-column-selection-dsl-in-kotlin-dataframe/)
— explore powerful DSL for typesafe and flexible column selection in Kotlin DataFrame.
* [](Kotlin-DataFrame-Features-in-Kotlin-Notebook.md)
Expand Down
21 changes: 13 additions & 8 deletions docs/StardustDocs/topics/guides/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,8 +88,8 @@ columns.
Column selectors are widely used across operations — one of the simplest examples is `.select { }`, which returns a new
DataFrame with only the columns chosen in Columns Selection expression.

After executing the cell where a `DataFrame` variable is declared, an extension with properties for its columns is
automatically generated.
*After executing the cell* where a `DataFrame` variable is declared,
[extension properties](extensionPropertiesApi.md) for its columns are automatically generated.
These properties can then be used in the Columns Selection DSL expression for typesafe and convenient column access.

Select some columns:
Expand All @@ -104,18 +104,20 @@ dfSelected

<!---END-->

<inline-frame src="./resources/notebook_test_quickstart_5.html" width="705px" height="500px"></inline-frame>

> With a [Kotlin DataFrame Compiler Plugin](Compiler-Plugin.md) enabled,
> you can use auto-generated properties in your IntelliJ IDEA projects.

<inline-frame src="./resources/notebook_test_quickstart_5.html" width="705px" height="500px"></inline-frame>
## Row Filtering

## Raw Filtering

Some operations use `RowExpression`, i.e., expression that applies for all `DataFrame` rows. For example `.filter { }`
that returns a new `DataFrame` with rows that satisfy a condition given by row expression.
Some operations use the [DataRow API](DataRow.md), with expressions and conditions
that apply for all `DataFrame` rows.
For example, `.filter { }` that returns a new `DataFrame` with rows that satisfy a condition given by row expression.

Inside a row expression, you can access the values of the current row by column names through auto-generated properties.
Similar to the Columns Selection DSL, but in this case the properties represent actual values, not column references.
Similar to the [Columns Selection DSL](ColumnSelectors.md),
but in this case the properties represent actual values, not column references.

Filter rows by "stargazers_count" value:

Expand Down Expand Up @@ -349,6 +351,9 @@ Ready to go deeper? Check out what’s next:

- 🧠 **Understand the design** and core concepts in the [library overview](overview.md).

- 🔤 **[Learn more about Extension Properties](extensionPropertiesApi.md)**
and make working with your data both convenient and type-safe.

- 💡 **[Use Kotlin DataFrame Compiler Plugin](Compiler-Plugin.md)**
for auto-generated column access in your IntelliJ IDEA projects.

Expand Down