From 71c360143a8e3e89177d6b8b37ecc097894ff93f Mon Sep 17 00:00:00 2001 From: "andrei.kislitsyn" Date: Wed, 11 Jun 2025 18:25:59 +0400 Subject: [PATCH 1/5] improve extensionPropertiesApi.md --- docs/StardustDocs/resources/example.csv | 3 + .../topics/extensionPropertiesApi.md | 167 ++++++++++++++++-- 2 files changed, 155 insertions(+), 15 deletions(-) create mode 100644 docs/StardustDocs/resources/example.csv diff --git a/docs/StardustDocs/resources/example.csv b/docs/StardustDocs/resources/example.csv new file mode 100644 index 0000000000..029c836f3a --- /dev/null +++ b/docs/StardustDocs/resources/example.csv @@ -0,0 +1,3 @@ +name,info +Alice,"{""age"":23,""height"":175.5}" +Bob,"{""age"":27,""height"":160.2}" diff --git a/docs/StardustDocs/topics/extensionPropertiesApi.md b/docs/StardustDocs/topics/extensionPropertiesApi.md index 8bbd73b096..1fb8cde140 100644 --- a/docs/StardustDocs/topics/extensionPropertiesApi.md +++ b/docs/StardustDocs/topics/extensionPropertiesApi.md @@ -1,24 +1,161 @@ [//]: # (title: Extension Properties API) - - -Auto-generated extension properties are the safest and easiest way to access columns in a [`DataFrame`](DataFrame.md). -They are generated based on a [dataframe schema](schemas.md), +When working with a DataFrame, the most convenient and reliable way +to access its columns — including for operations and retrieving column values +in row expressions — is through auto-generated extension properties. +They are generated based on a [dataframe schema](schemas.md), with the name and type of properties inferred from the name and type of the corresponding columns. +It also works for all types of hierarchical dataframes + +> The behavior of data schema generation differs between the +> [Compiler Plugin](Compiler-Plugin.md) and [Kotlin Notebook](gettingStartedKotlinNotebook.md). +> +> * In the **Kotlin Notebook**, a schema is generated *only after cell execution* for +> `DataFrame` variables defined within that cell. +> * With the **Compiler Plugin**, a new schema is generated *after every operation* +> — but support for all operations is still in progress. +> Retrieving the schema for `DataFrame` read from a file or URL is *not yet supported* either. +> +> This behavior may change in future releases. See the [example](#example) below that demonstrates these differences. +{style="warning"} + +## Example + +Consider +. +This table consists of two columns: `name`, which is a `String` column, and `info`, +which is a **column group** containing two nested value columns — +`age` of type `Int`, and `height` of type `Double`. + + + + + + + + + + + + + + + + + + + + + + + + + +
nameinfo
ageheight
Alice23175.5
Bob27160.2
+ + + +Read the `DataFrame` from the CSV file: + +```kotlin +val df = DataFrame.readCsv("example.csv") +``` + +*After cell execution* data schema and extensions for this `DataFrame` will be generated +so you can use extensions for accessing columns, +using it in operations inside the [Column Selector DSL](ColumnSelectors.md) +and [DataRow API](DataRow.md): + + +```kotlin +// Get nested column +df.info.age +// Sort by multiple columns +df.sortBy { name and info.height } +// Filter rows using a row condition. +// These extensions express the exact value in the row +// with the corresponding type: +df.filter { name.startsWith("A") && info.age >= 16 } +``` + +If you change DataFrame schema by changing any column [name](rename.md) +or [type](convert.md), or [add](add.md) a new one, you need to +run a cell with a new DataFrame declaration first. +For example, rename the "name" column into "firstName": + +```kotlin +val dfRenamed = df.rename { name }.into("firstName") +``` + +After running the cell with the code above, you can use `firstName` extensions in the following cells: -Having these, it allows you to work with your dataframe like: ```kotlin -val peopleDf /* : DataFrame */ = DataFrame.read("people.csv").cast() -val nameColumn /* : DataColumn */ = peopleDf.name -val ageColumn /* : DataColumn */ = peopleDf.personData.age +dfRenamed.firstName +dfRenamed.rename { firstName }.into("name") +dfRenamed.filter { firstName == "Nikita" } ``` -and of course + +See [](quickstart.md) in the Kotlin Notebook with basic Extension Properties API examples. + + + + +For now, if you read `DatFrame` from a file or URL, you need to define its schema manually. +You can do it fast with [`generate..()` methods](DataSchema-Data-Classes-Generation.md). + +Define schemas: +```kotlin +@DataSchema +data class PersonInfo( + val age: Int, + val height: Float +) + +@DataSchema +data class Person( + val info: PersonInfo, + val name: String +) +``` + +Read the `DataFrame` from the CSV file and specify the schema with `convertTo`: + +```kotlin +val df = DataFrame.readCsv("example.csv").convertTo() +``` + +Extensions for this `DataFrame` will be generated automatically by plugin, +so you can use extensions for accessing columns, +using it in operations inside the [Column Selector DSL](ColumnSelectors.md) +and [DataRow API](DataRow.md). + + +```kotlin +// Get nested column +df.info.age +// Sort by multiple columns +df.sortBy { name and info.height } +// Filter rows using a row condition. +// These extensions express the exact value in the row +// with the corresponding type: +df.filter { name.startsWith("A") && info.age >= 16 } +``` + +Moreover, new extensions will be generated on-the-fly after each schema change: +by changing any column [name](rename.md) +or [type](convert.md), or [add](add.md) a new one. +For example, rename the "name" column into "firstName" and then we can use `firstName` extensions +in the following operations: + ```kotlin -peopleDf.add("lastName") { name.split(",").last() } - .dropNulls { personData.age } - .filter { survived && home.endsWith("NY") && personData.age in 10..20 } +// Rename "name" column into "firstName" +df.rename { name }.into("firstName") + // Can use `firstName` extension in the row condition + // right after renaming + .filter { firstName == "Nikita" } ``` -To find out how to use this API in your environment, check out [Working with Data Schemas](schemas.md) -or jump straight to [Data Schemas in Gradle projects](schemasGradle.md), -or [Data Schemas in Jupyter notebooks](schemasJupyter.md). +See [Kotlin DataFrame Compiler Plugin Example](https://github.com/Kotlin/dataframe/tree/plugin_example/examples/kotlin-dataframe-plugin-example) +IDEA project with basic Extension Properties API examples. + + From 6ed9e24f28aae487d589fbc94e235880ff3c99d8 Mon Sep 17 00:00:00 2001 From: "andrei.kislitsyn" Date: Wed, 11 Jun 2025 18:32:17 +0400 Subject: [PATCH 2/5] quickstart improvements --- docs/StardustDocs/topics/guides/quickstart.md | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/docs/StardustDocs/topics/guides/quickstart.md b/docs/StardustDocs/topics/guides/quickstart.md index 5840671b03..046656f191 100644 --- a/docs/StardustDocs/topics/guides/quickstart.md +++ b/docs/StardustDocs/topics/guides/quickstart.md @@ -88,8 +88,8 @@ columns. Column selectors are widely used across operations — one of the simplest examples is `.select { }`, which returns a new DataFrame with only the columns chosen in Columns Selection expression. -After executing the cell where a `DataFrame` variable is declared, an extension with properties for its columns is -automatically generated. +After executing the cell where a `DataFrame` variable is declared, +[extension properties](extensionPropertiesApi.md) for its columns are automatically generated. These properties can then be used in the Columns Selection DSL expression for typesafe and convenient column access. Select some columns: @@ -104,15 +104,17 @@ dfSelected + + > With a [Kotlin DataFrame Compiler Plugin](Compiler-Plugin.md) enabled, > you can use auto-generated properties in your IntelliJ IDEA projects. - - -## Raw Filtering +## Row Filtering -Some operations use `RowExpression`, i.e., expression that applies for all `DataFrame` rows. For example `.filter { }` -that returns a new `DataFrame` with rows that satisfy a condition given by row expression. +Some operations use [DataRow API](DataRow.md), with expressions and conditions +that apply for all `DataFrame` rows. +For example, `.filter { }` that returns a new `DataFrame` with rows \ +that satisfy a condition given by row expression. Inside a row expression, you can access the values of the current row by column names through auto-generated properties. Similar to the Columns Selection DSL, but in this case the properties represent actual values, not column references. From 3b4994c0ec54735ecd61fd692e77a87e0e954979 Mon Sep 17 00:00:00 2001 From: "andrei.kislitsyn" Date: Wed, 11 Jun 2025 18:38:21 +0400 Subject: [PATCH 3/5] more extension properties topic refs --- docs/StardustDocs/topics/guides/Guides-And-Examples.md | 3 +++ docs/StardustDocs/topics/guides/quickstart.md | 3 +++ 2 files changed, 6 insertions(+) diff --git a/docs/StardustDocs/topics/guides/Guides-And-Examples.md b/docs/StardustDocs/topics/guides/Guides-And-Examples.md index 49dfed9434..ff22712f91 100644 --- a/docs/StardustDocs/topics/guides/Guides-And-Examples.md +++ b/docs/StardustDocs/topics/guides/Guides-And-Examples.md @@ -24,6 +24,9 @@ Explore our structured, in-depth guides to steadily improve your Kotlin DataFram +* [](extensionPropertiesApi.md) — learn about extension properties for `DataFrame` +and make working with your data both convenient and type-safe. + * [Enhanced Column Selection DSL](https://blog.jetbrains.com/kotlin/2024/07/enhanced-column-selection-dsl-in-kotlin-dataframe/) — explore powerful DSL for typesafe and flexible column selection in Kotlin DataFrame. * [](Kotlin-DataFrame-Features-in-Kotlin-Notebook.md) diff --git a/docs/StardustDocs/topics/guides/quickstart.md b/docs/StardustDocs/topics/guides/quickstart.md index 046656f191..b0323e9a38 100644 --- a/docs/StardustDocs/topics/guides/quickstart.md +++ b/docs/StardustDocs/topics/guides/quickstart.md @@ -351,6 +351,9 @@ Ready to go deeper? Check out what’s next: - 🧠 **Understand the design** and core concepts in the [library overview](overview.md). +- 🔤 **[Learn more about Extension Properties](extensionPropertiesApi.md)** + and make working with your data both convenient and type-safe. + - 💡 **[Use Kotlin DataFrame Compiler Plugin](Compiler-Plugin.md)** for auto-generated column access in your IntelliJ IDEA projects. From ba442a1b6d64627575fa0344a674bf9e2c85a7eb Mon Sep 17 00:00:00 2001 From: "andrei.kislitsyn" Date: Thu, 12 Jun 2025 17:10:17 +0400 Subject: [PATCH 4/5] fixes in extensionPropertiesApi.md --- .../topics/extensionPropertiesApi.md | 39 ++++++++++--------- 1 file changed, 21 insertions(+), 18 deletions(-) diff --git a/docs/StardustDocs/topics/extensionPropertiesApi.md b/docs/StardustDocs/topics/extensionPropertiesApi.md index 1fb8cde140..3a98fe1037 100644 --- a/docs/StardustDocs/topics/extensionPropertiesApi.md +++ b/docs/StardustDocs/topics/extensionPropertiesApi.md @@ -1,16 +1,16 @@ [//]: # (title: Extension Properties API) -When working with a DataFrame, the most convenient and reliable way +When working with a [`DataFrame`](DataFrame.md), the most convenient and reliable way to access its columns — including for operations and retrieving column values in row expressions — is through auto-generated extension properties. They are generated based on a [dataframe schema](schemas.md), with the name and type of properties inferred from the name and type of the corresponding columns. -It also works for all types of hierarchical dataframes +It also works for all types of hierarchical dataframes. > The behavior of data schema generation differs between the > [Compiler Plugin](Compiler-Plugin.md) and [Kotlin Notebook](gettingStartedKotlinNotebook.md). > -> * In the **Kotlin Notebook**, a schema is generated *only after cell execution* for +> * In **Kotlin Notebook**, a schema is generated *only after cell execution* for > `DataFrame` variables defined within that cell. > * With the **Compiler Plugin**, a new schema is generated *after every operation* > — but support for all operations is still in progress. @@ -21,10 +21,12 @@ It also works for all types of hierarchical dataframes ## Example -Consider +Consider a simple hierarchical dataframe from . + This table consists of two columns: `name`, which is a `String` column, and `info`, -which is a **column group** containing two nested value columns — +which is a [**column group**](DataColumn.md#columngroup) containing two nested +[value columns](DataColumn.md#valuecolumn) — `age` of type `Int`, and `height` of type `Double`. @@ -55,7 +57,7 @@ which is a **column group** containing two nested value columns — -Read the `DataFrame` from the CSV file: +Read the [`DataFrame`](DataFrame.md) from the CSV file: ```kotlin val df = DataFrame.readCsv("example.csv") @@ -78,10 +80,10 @@ df.sortBy { name and info.height } df.filter { name.startsWith("A") && info.age >= 16 } ``` -If you change DataFrame schema by changing any column [name](rename.md) -or [type](convert.md), or [add](add.md) a new one, you need to -run a cell with a new DataFrame declaration first. -For example, rename the "name" column into "firstName": +If you change the dataframe's schema by changing any column [name](rename.md), +or [type](convert.md) or [add](add.md) a new one, you need to +run a cell with a new [`DataFrame`](DataFrame.md) declaration first. +For example, rename the `name` column into "firstName": ```kotlin val dfRenamed = df.rename { name }.into("firstName") @@ -95,13 +97,13 @@ dfRenamed.rename { firstName }.into("name") dfRenamed.filter { firstName == "Nikita" } ``` -See [](quickstart.md) in the Kotlin Notebook with basic Extension Properties API examples. +See the [](quickstart.md) in Kotlin Notebook with basic Extension Properties API examples. -For now, if you read `DatFrame` from a file or URL, you need to define its schema manually. -You can do it fast with [`generate..()` methods](DataSchema-Data-Classes-Generation.md). +For now, if you read [`DataFrame`](DataFrame.md) from a file or URL, you need to define its schema manually. +You can do it quickly with [`generate..()` methods](DataSchema-Data-Classes-Generation.md). Define schemas: ```kotlin @@ -118,13 +120,14 @@ data class Person( ) ``` -Read the `DataFrame` from the CSV file and specify the schema with `convertTo`: +Read the `DataFrame` from the CSV file and specify the schema with +[`.convertTo()`](convertTo.md) or [`cast()`](cast.md): ```kotlin val df = DataFrame.readCsv("example.csv").convertTo() ``` -Extensions for this `DataFrame` will be generated automatically by plugin, +Extensions for this `DataFrame` will be generated automatically by the plugin, so you can use extensions for accessing columns, using it in operations inside the [Column Selector DSL](ColumnSelectors.md) and [DataRow API](DataRow.md). @@ -142,8 +145,8 @@ df.filter { name.startsWith("A") && info.age >= 16 } ``` Moreover, new extensions will be generated on-the-fly after each schema change: -by changing any column [name](rename.md) -or [type](convert.md), or [add](add.md) a new one. +by changing any column [name](rename.md), +or [type](convert.md) or [add](add.md) a new one. For example, rename the "name" column into "firstName" and then we can use `firstName` extensions in the following operations: @@ -155,7 +158,7 @@ df.rename { name }.into("firstName") .filter { firstName == "Nikita" } ``` -See [Kotlin DataFrame Compiler Plugin Example](https://github.com/Kotlin/dataframe/tree/plugin_example/examples/kotlin-dataframe-plugin-example) +See [Compiler Plugin Example](https://github.com/Kotlin/dataframe/tree/plugin_example/examples/kotlin-dataframe-plugin-example) IDEA project with basic Extension Properties API examples. From 8bc936f4837d3025af616e0396795e52bc536187 Mon Sep 17 00:00:00 2001 From: "andrei.kislitsyn" Date: Thu, 12 Jun 2025 18:26:01 +0400 Subject: [PATCH 5/5] several topics fixes --- docs/StardustDocs/topics/extensionPropertiesApi.md | 2 +- docs/StardustDocs/topics/guides/Guides-And-Examples.md | 2 +- docs/StardustDocs/topics/guides/quickstart.md | 10 +++++----- 3 files changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/StardustDocs/topics/extensionPropertiesApi.md b/docs/StardustDocs/topics/extensionPropertiesApi.md index 3a98fe1037..83c7976552 100644 --- a/docs/StardustDocs/topics/extensionPropertiesApi.md +++ b/docs/StardustDocs/topics/extensionPropertiesApi.md @@ -147,7 +147,7 @@ df.filter { name.startsWith("A") && info.age >= 16 } Moreover, new extensions will be generated on-the-fly after each schema change: by changing any column [name](rename.md), or [type](convert.md) or [add](add.md) a new one. -For example, rename the "name" column into "firstName" and then we can use `firstName` extensions +For example, rename the `name` column into "firstName" and then we can use `firstName` extensions in the following operations: ```kotlin diff --git a/docs/StardustDocs/topics/guides/Guides-And-Examples.md b/docs/StardustDocs/topics/guides/Guides-And-Examples.md index ff22712f91..82e92f98a1 100644 --- a/docs/StardustDocs/topics/guides/Guides-And-Examples.md +++ b/docs/StardustDocs/topics/guides/Guides-And-Examples.md @@ -24,7 +24,7 @@ Explore our structured, in-depth guides to steadily improve your Kotlin DataFram -* [](extensionPropertiesApi.md) — learn about extension properties for `DataFrame` +* [](extensionPropertiesApi.md) — learn about extension properties for [`DataFrame`](DataFrame.md) and make working with your data both convenient and type-safe. * [Enhanced Column Selection DSL](https://blog.jetbrains.com/kotlin/2024/07/enhanced-column-selection-dsl-in-kotlin-dataframe/) diff --git a/docs/StardustDocs/topics/guides/quickstart.md b/docs/StardustDocs/topics/guides/quickstart.md index b0323e9a38..5b8ad63a74 100644 --- a/docs/StardustDocs/topics/guides/quickstart.md +++ b/docs/StardustDocs/topics/guides/quickstart.md @@ -88,7 +88,7 @@ columns. Column selectors are widely used across operations — one of the simplest examples is `.select { }`, which returns a new DataFrame with only the columns chosen in Columns Selection expression. -After executing the cell where a `DataFrame` variable is declared, +*After executing the cell* where a `DataFrame` variable is declared, [extension properties](extensionPropertiesApi.md) for its columns are automatically generated. These properties can then be used in the Columns Selection DSL expression for typesafe and convenient column access. @@ -111,13 +111,13 @@ dfSelected ## Row Filtering -Some operations use [DataRow API](DataRow.md), with expressions and conditions +Some operations use the [DataRow API](DataRow.md), with expressions and conditions that apply for all `DataFrame` rows. -For example, `.filter { }` that returns a new `DataFrame` with rows \ -that satisfy a condition given by row expression. +For example, `.filter { }` that returns a new `DataFrame` with rows that satisfy a condition given by row expression. Inside a row expression, you can access the values of the current row by column names through auto-generated properties. -Similar to the Columns Selection DSL, but in this case the properties represent actual values, not column references. +Similar to the [Columns Selection DSL](ColumnSelectors.md), +but in this case the properties represent actual values, not column references. Filter rows by "stargazers_count" value: