Skip to content

Extension properties docs #1246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 13, 2025
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/StardustDocs/resources/example.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
name,info
Alice,"{""age"":23,""height"":175.5}"
Bob,"{""age"":27,""height"":160.2}"
167 changes: 152 additions & 15 deletions docs/StardustDocs/topics/extensionPropertiesApi.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,161 @@
[//]: # (title: Extension Properties API)

<!---IMPORT org.jetbrains.kotlinx.dataframe.samples.api.ApiLevels-->

Auto-generated extension properties are the safest and easiest way to access columns in a [`DataFrame`](DataFrame.md).
They are generated based on a [dataframe schema](schemas.md),
When working with a DataFrame, the most convenient and reliable way
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*"a dataframe" (the concept), or *"a [`DataFrame`](DataFrame.md)" (the instance of the type, with link)

to access its columns — including for operations and retrieving column values
in row expressions — is through auto-generated extension properties.
They are generated based on a [dataframe schema](schemas.md),
with the name and type of properties inferred from the name and type of the corresponding columns.
It also works for all types of hierarchical dataframes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.


> The behavior of data schema generation differs between the
> [Compiler Plugin](Compiler-Plugin.md) and [Kotlin Notebook](gettingStartedKotlinNotebook.md).
>
> * In the **Kotlin Notebook**, a schema is generated *only after cell execution* for
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're talking about the concept, it's "in notebooks", "in the notebook", or "in the Kotlin notebook". If you're talking about the product, it's a name, so "in Kotlin Notebook".

> `DataFrame` variables defined within that cell.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*and `DataRow`. Or do you mean "DataFrame" the library? as in "DataFrame variables". Then don't use backticks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but DataRow not defined within that cell.
I mean, we can omit it here, it's described further.

> * With the **Compiler Plugin**, a new schema is generated *after every operation*
> — but support for all operations is still in progress.
> Retrieving the schema for `DataFrame` read from a file or URL is *not yet supported* either.
>
> This behavior may change in future releases. See the [example](#example) below that demonstrates these differences.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe mention that we're working to bring the compiler plugin to notebooks as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning is "ALL of these points will be improved in the future"

{style="warning"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe warning is a bit aggressive, how about "info"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's very crucial, so set a warning.


## Example

Consider
<resource src="example.csv"></resource>.
This table consists of two columns: `name`, which is a `String` column, and `info`,
which is a **column group** containing two nested value columns —
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can link to column group, value columns etc.

`age` of type `Int`, and `height` of type `Double`.

<table>
<thead>
<tr>
<th>name</th>
<th colspan="2">info</th>
</tr>
<tr>
<th></th>
<th>age</th>
<th>height</th>
</tr>
</thead>
<tbody>
<tr>
<td>Alice</td>
<td>23</td>
<td>175.5</td>
</tr>
<tr>
<td>Bob</td>
<td>27</td>
<td>160.2</td>
</tr>
</tbody>
</table>

<tabs>
<tab title="Kotlin Notebook">
Read the `DataFrame` from the CSV file:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backticks don't render this close to html in markdown/writerside. But In this case I'd write *dataframe, the concept

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see what I wrote here some time ago #661

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But here it's an object!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in the sentence you treat it like a concept "Read the DataFrame from the CSV file". If you mean the object/type, you could write "Create a DataFrame (instance) by reading from the CSV file". I always imagine the word "instance" being after it. In the old sentence that doesn't make sense, because there's no DataFrame instance inside the csv file.


```kotlin
val df = DataFrame.readCsv("example.csv")
```

*After cell execution* data schema and extensions for this `DataFrame` will be generated
so you can use extensions for accessing columns,
using it in operations inside the [Column Selector DSL](ColumnSelectors.md)
and [DataRow API](DataRow.md):


```kotlin
// Get nested column
df.info.age
// Sort by multiple columns
df.sortBy { name and info.height }
// Filter rows using a row condition.
// These extensions express the exact value in the row
// with the corresponding type:
df.filter { name.startsWith("A") && info.age >= 16 }
```

If you change DataFrame schema by changing any column [name](rename.md)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*the dataframe's schema

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the Oxford comma way of writing it is "changing any column name, or type or add a new one...". So adding a comma between summations but not before other or/ands

or [type](convert.md), or [add](add.md) a new one, you need to
run a cell with a new DataFrame declaration first.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*dataframe

For example, rename the "name" column into "firstName":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renaming the `name` column into "firstName"


```kotlin
val dfRenamed = df.rename { name }.into("firstName")
```

After running the cell with the code above, you can use `firstName` extensions in the following cells:

Having these, it allows you to work with your dataframe like:
```kotlin
val peopleDf /* : DataFrame<Person> */ = DataFrame.read("people.csv").cast<Person>()
val nameColumn /* : DataColumn<String> */ = peopleDf.name
val ageColumn /* : DataColumn<Int> */ = peopleDf.personData.age
dfRenamed.firstName
dfRenamed.rename { firstName }.into("name")
dfRenamed.filter { firstName == "Nikita" }
```
and of course

See [](quickstart.md) in the Kotlin Notebook with basic Extension Properties API examples.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the


</tab>
<tab title="Compiler Plugin">

For now, if you read `DatFrame` from a file or URL, you need to define its schema manually.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you know the gist :)

You can do it fast with [`generate..()` methods](DataSchema-Data-Classes-Generation.md).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*quickly?


Define schemas:
```kotlin
@DataSchema
data class PersonInfo(
val age: Int,
val height: Float
)

@DataSchema
data class Person(
val info: PersonInfo,
val name: String
)
```

Read the `DataFrame` from the CSV file and specify the schema with `convertTo`:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm maybe also mention cast<>()? convertTo is safer but also heavier


```kotlin
val df = DataFrame.readCsv("example.csv").convertTo<Person>()
```

Extensions for this `DataFrame` will be generated automatically by plugin,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*the plugin

so you can use extensions for accessing columns,
using it in operations inside the [Column Selector DSL](ColumnSelectors.md)
and [DataRow API](DataRow.md).


```kotlin
// Get nested column
df.info.age
// Sort by multiple columns
df.sortBy { name and info.height }
// Filter rows using a row condition.
// These extensions express the exact value in the row
// with the corresponding type:
df.filter { name.startsWith("A") && info.age >= 16 }
```

Moreover, new extensions will be generated on-the-fly after each schema change:
by changing any column [name](rename.md)
or [type](convert.md), or [add](add.md) a new one.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oxford comma

For example, rename the "name" column into "firstName" and then we can use `firstName` extensions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`name` column

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My logic is simple:
name is an extension property (DataColumn/row value).
"name" is a column.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm but then "" could both refer to a column and a column name, which is also confusing IMO. I'd be okay with the column and the column accessor being written the same.

in the following operations:

```kotlin
peopleDf.add("lastName") { name.split(",").last() }
.dropNulls { personData.age }
.filter { survived && home.endsWith("NY") && personData.age in 10..20 }
// Rename "name" column into "firstName"
df.rename { name }.into("firstName")
// Can use `firstName` extension in the row condition
// right after renaming
.filter { firstName == "Nikita" }
```

To find out how to use this API in your environment, check out [Working with Data Schemas](schemas.md)
or jump straight to [Data Schemas in Gradle projects](schemasGradle.md),
or [Data Schemas in Jupyter notebooks](schemasJupyter.md).
See [Kotlin DataFrame Compiler Plugin Example](https://github.com/Kotlin/dataframe/tree/plugin_example/examples/kotlin-dataframe-plugin-example)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make the name a bit shorter here as well?

IDEA project with basic Extension Properties API examples.
</tab>
</tabs>
3 changes: 3 additions & 0 deletions docs/StardustDocs/topics/guides/Guides-And-Examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@ Explore our structured, in-depth guides to steadily improve your Kotlin DataFram

<img src="quickstart_preview.png" border-effect="rounded" width="705"/>

* [](extensionPropertiesApi.md) — learn about extension properties for `DataFrame`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you're referring to the library "DataFrame", if you're referring to the type "DataFrame"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What? DataFrame is a type/object here, and there are extensions for it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then please add a link :)

and make working with your data both convenient and type-safe.

* [Enhanced Column Selection DSL](https://blog.jetbrains.com/kotlin/2024/07/enhanced-column-selection-dsl-in-kotlin-dataframe/)
— explore powerful DSL for typesafe and flexible column selection in Kotlin DataFrame.
* [](Kotlin-DataFrame-Features-in-Kotlin-Notebook.md)
Expand Down
19 changes: 12 additions & 7 deletions docs/StardustDocs/topics/guides/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,8 +88,8 @@ columns.
Column selectors are widely used across operations — one of the simplest examples is `.select { }`, which returns a new
DataFrame with only the columns chosen in Columns Selection expression.

After executing the cell where a `DataFrame` variable is declared, an extension with properties for its columns is
automatically generated.
After executing the cell where a `DataFrame` variable is declared,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or DataRow

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary information IMHO. I mean there's mention about Row API below, I think it's enough for begin.

[extension properties](extensionPropertiesApi.md) for its columns are automatically generated.
These properties can then be used in the Columns Selection DSL expression for typesafe and convenient column access.

Select some columns:
Expand All @@ -104,15 +104,17 @@ dfSelected

<!---END-->

<inline-frame src="./resources/notebook_test_quickstart_5.html" width="705px" height="500px"></inline-frame>

> With a [Kotlin DataFrame Compiler Plugin](Compiler-Plugin.md) enabled,
> you can use auto-generated properties in your IntelliJ IDEA projects.

<inline-frame src="./resources/notebook_test_quickstart_5.html" width="705px" height="500px"></inline-frame>
## Row Filtering

## Raw Filtering

Some operations use `RowExpression`, i.e., expression that applies for all `DataFrame` rows. For example `.filter { }`
that returns a new `DataFrame` with rows that satisfy a condition given by row expression.
Some operations use [DataRow API](DataRow.md), with expressions and conditions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the

that apply for all `DataFrame` rows.
For example, `.filter { }` that returns a new `DataFrame` with rows \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random backslash?

that satisfy a condition given by row expression.

Inside a row expression, you can access the values of the current row by column names through auto-generated properties.
Similar to the Columns Selection DSL, but in this case the properties represent actual values, not column references.
Expand Down Expand Up @@ -349,6 +351,9 @@ Ready to go deeper? Check out what’s next:

- 🧠 **Understand the design** and core concepts in the [library overview](overview.md).

- 🔤 **[Learn more about Extension Properties](extensionPropertiesApi.md)**
and make working with your data both convenient and type-safe.

- 💡 **[Use Kotlin DataFrame Compiler Plugin](Compiler-Plugin.md)**
for auto-generated column access in your IntelliJ IDEA projects.

Expand Down