-
Notifications
You must be signed in to change notification settings - Fork 73
Extension properties docs #1246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 4 commits
71c3601
8f784cd
6ed9e24
3b4994c
ba442a1
8bc936f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
name,info | ||
Alice,"{""age"":23,""height"":175.5}" | ||
Bob,"{""age"":27,""height"":160.2}" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,24 +1,161 @@ | ||
[//]: # (title: Extension Properties API) | ||
|
||
<!---IMPORT org.jetbrains.kotlinx.dataframe.samples.api.ApiLevels--> | ||
|
||
Auto-generated extension properties are the safest and easiest way to access columns in a [`DataFrame`](DataFrame.md). | ||
They are generated based on a [dataframe schema](schemas.md), | ||
When working with a DataFrame, the most convenient and reliable way | ||
to access its columns — including for operations and retrieving column values | ||
in row expressions — is through auto-generated extension properties. | ||
They are generated based on a [dataframe schema](schemas.md), | ||
with the name and type of properties inferred from the name and type of the corresponding columns. | ||
It also works for all types of hierarchical dataframes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. . |
||
|
||
> The behavior of data schema generation differs between the | ||
> [Compiler Plugin](Compiler-Plugin.md) and [Kotlin Notebook](gettingStartedKotlinNotebook.md). | ||
> | ||
> * In the **Kotlin Notebook**, a schema is generated *only after cell execution* for | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you're talking about the concept, it's "in notebooks", "in the notebook", or "in the Kotlin notebook". If you're talking about the product, it's a name, so "in Kotlin Notebook". |
||
> `DataFrame` variables defined within that cell. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. * There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. but |
||
> * With the **Compiler Plugin**, a new schema is generated *after every operation* | ||
> — but support for all operations is still in progress. | ||
> Retrieving the schema for `DataFrame` read from a file or URL is *not yet supported* either. | ||
> | ||
> This behavior may change in future releases. See the [example](#example) below that demonstrates these differences. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe mention that we're working to bring the compiler plugin to notebooks as well? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The meaning is "ALL of these points will be improved in the future" |
||
{style="warning"} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe warning is a bit aggressive, how about "info"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe it's very crucial, so set a warning. |
||
|
||
## Example | ||
|
||
Consider | ||
<resource src="example.csv"></resource>. | ||
This table consists of two columns: `name`, which is a `String` column, and `info`, | ||
which is a **column group** containing two nested value columns — | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we can link to column group, value columns etc. |
||
`age` of type `Int`, and `height` of type `Double`. | ||
|
||
<table> | ||
<thead> | ||
<tr> | ||
<th>name</th> | ||
<th colspan="2">info</th> | ||
</tr> | ||
<tr> | ||
<th></th> | ||
<th>age</th> | ||
<th>height</th> | ||
</tr> | ||
</thead> | ||
<tbody> | ||
<tr> | ||
<td>Alice</td> | ||
<td>23</td> | ||
<td>175.5</td> | ||
</tr> | ||
<tr> | ||
<td>Bob</td> | ||
<td>27</td> | ||
<td>160.2</td> | ||
</tr> | ||
</tbody> | ||
</table> | ||
|
||
<tabs> | ||
<tab title="Kotlin Notebook"> | ||
Read the `DataFrame` from the CSV file: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. backticks don't render this close to html in markdown/writerside. But In this case I'd write *dataframe, the concept There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. see what I wrote here some time ago #661 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But here it's an object! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But in the sentence you treat it like a concept "Read the |
||
|
||
```kotlin | ||
val df = DataFrame.readCsv("example.csv") | ||
``` | ||
|
||
*After cell execution* data schema and extensions for this `DataFrame` will be generated | ||
so you can use extensions for accessing columns, | ||
using it in operations inside the [Column Selector DSL](ColumnSelectors.md) | ||
and [DataRow API](DataRow.md): | ||
|
||
|
||
```kotlin | ||
// Get nested column | ||
df.info.age | ||
// Sort by multiple columns | ||
df.sortBy { name and info.height } | ||
// Filter rows using a row condition. | ||
// These extensions express the exact value in the row | ||
// with the corresponding type: | ||
df.filter { name.startsWith("A") && info.age >= 16 } | ||
``` | ||
|
||
If you change DataFrame schema by changing any column [name](rename.md) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. *the dataframe's schema There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the Oxford comma way of writing it is "changing any column name, or type or add a new one...". So adding a comma between summations but not before other or/ands |
||
or [type](convert.md), or [add](add.md) a new one, you need to | ||
run a cell with a new DataFrame declaration first. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. *dataframe |
||
For example, rename the "name" column into "firstName": | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
```kotlin | ||
val dfRenamed = df.rename { name }.into("firstName") | ||
``` | ||
|
||
After running the cell with the code above, you can use `firstName` extensions in the following cells: | ||
|
||
Having these, it allows you to work with your dataframe like: | ||
```kotlin | ||
val peopleDf /* : DataFrame<Person> */ = DataFrame.read("people.csv").cast<Person>() | ||
val nameColumn /* : DataColumn<String> */ = peopleDf.name | ||
val ageColumn /* : DataColumn<Int> */ = peopleDf.personData.age | ||
dfRenamed.firstName | ||
dfRenamed.rename { firstName }.into("name") | ||
dfRenamed.filter { firstName == "Nikita" } | ||
``` | ||
and of course | ||
|
||
See [](quickstart.md) in the Kotlin Notebook with basic Extension Properties API examples. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
</tab> | ||
<tab title="Compiler Plugin"> | ||
|
||
For now, if you read `DatFrame` from a file or URL, you need to define its schema manually. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you know the gist :) |
||
You can do it fast with [`generate..()` methods](DataSchema-Data-Classes-Generation.md). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. *quickly? |
||
|
||
Define schemas: | ||
```kotlin | ||
@DataSchema | ||
data class PersonInfo( | ||
val age: Int, | ||
val height: Float | ||
) | ||
|
||
@DataSchema | ||
data class Person( | ||
val info: PersonInfo, | ||
val name: String | ||
) | ||
``` | ||
|
||
Read the `DataFrame` from the CSV file and specify the schema with `convertTo`: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm maybe also mention cast<>()? convertTo is safer but also heavier |
||
|
||
```kotlin | ||
val df = DataFrame.readCsv("example.csv").convertTo<Person>() | ||
``` | ||
|
||
Extensions for this `DataFrame` will be generated automatically by plugin, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. *the plugin |
||
so you can use extensions for accessing columns, | ||
using it in operations inside the [Column Selector DSL](ColumnSelectors.md) | ||
and [DataRow API](DataRow.md). | ||
|
||
|
||
```kotlin | ||
// Get nested column | ||
df.info.age | ||
// Sort by multiple columns | ||
df.sortBy { name and info.height } | ||
// Filter rows using a row condition. | ||
// These extensions express the exact value in the row | ||
// with the corresponding type: | ||
df.filter { name.startsWith("A") && info.age >= 16 } | ||
``` | ||
|
||
Moreover, new extensions will be generated on-the-fly after each schema change: | ||
by changing any column [name](rename.md) | ||
or [type](convert.md), or [add](add.md) a new one. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oxford comma |
||
For example, rename the "name" column into "firstName" and then we can use `firstName` extensions | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My logic is simple: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm but then |
||
in the following operations: | ||
|
||
```kotlin | ||
peopleDf.add("lastName") { name.split(",").last() } | ||
.dropNulls { personData.age } | ||
.filter { survived && home.endsWith("NY") && personData.age in 10..20 } | ||
// Rename "name" column into "firstName" | ||
df.rename { name }.into("firstName") | ||
// Can use `firstName` extension in the row condition | ||
// right after renaming | ||
.filter { firstName == "Nikita" } | ||
``` | ||
|
||
To find out how to use this API in your environment, check out [Working with Data Schemas](schemas.md) | ||
or jump straight to [Data Schemas in Gradle projects](schemasGradle.md), | ||
or [Data Schemas in Jupyter notebooks](schemasJupyter.md). | ||
See [Kotlin DataFrame Compiler Plugin Example](https://github.com/Kotlin/dataframe/tree/plugin_example/examples/kotlin-dataframe-plugin-example) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you make the name a bit shorter here as well? |
||
IDEA project with basic Extension Properties API examples. | ||
</tab> | ||
</tabs> |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,6 +24,9 @@ Explore our structured, in-depth guides to steadily improve your Kotlin DataFram | |
|
||
<img src="quickstart_preview.png" border-effect="rounded" width="705"/> | ||
|
||
* [](extensionPropertiesApi.md) — learn about extension properties for `DataFrame` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if you're referring to the library "DataFrame", if you're referring to the type " There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. then please add a link :) |
||
and make working with your data both convenient and type-safe. | ||
|
||
* [Enhanced Column Selection DSL](https://blog.jetbrains.com/kotlin/2024/07/enhanced-column-selection-dsl-in-kotlin-dataframe/) | ||
— explore powerful DSL for typesafe and flexible column selection in Kotlin DataFrame. | ||
* [](Kotlin-DataFrame-Features-in-Kotlin-Notebook.md) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -88,8 +88,8 @@ columns. | |
Column selectors are widely used across operations — one of the simplest examples is `.select { }`, which returns a new | ||
DataFrame with only the columns chosen in Columns Selection expression. | ||
|
||
After executing the cell where a `DataFrame` variable is declared, an extension with properties for its columns is | ||
automatically generated. | ||
After executing the cell where a `DataFrame` variable is declared, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. or DataRow There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. unnecessary information IMHO. I mean there's mention about Row API below, I think it's enough for begin. |
||
[extension properties](extensionPropertiesApi.md) for its columns are automatically generated. | ||
These properties can then be used in the Columns Selection DSL expression for typesafe and convenient column access. | ||
|
||
Select some columns: | ||
|
@@ -104,15 +104,17 @@ dfSelected | |
|
||
<!---END--> | ||
|
||
<inline-frame src="./resources/notebook_test_quickstart_5.html" width="705px" height="500px"></inline-frame> | ||
|
||
> With a [Kotlin DataFrame Compiler Plugin](Compiler-Plugin.md) enabled, | ||
> you can use auto-generated properties in your IntelliJ IDEA projects. | ||
|
||
<inline-frame src="./resources/notebook_test_quickstart_5.html" width="705px" height="500px"></inline-frame> | ||
## Row Filtering | ||
|
||
## Raw Filtering | ||
|
||
Some operations use `RowExpression`, i.e., expression that applies for all `DataFrame` rows. For example `.filter { }` | ||
that returns a new `DataFrame` with rows that satisfy a condition given by row expression. | ||
Some operations use [DataRow API](DataRow.md), with expressions and conditions | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the |
||
that apply for all `DataFrame` rows. | ||
For example, `.filter { }` that returns a new `DataFrame` with rows \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. random backslash? |
||
that satisfy a condition given by row expression. | ||
|
||
Inside a row expression, you can access the values of the current row by column names through auto-generated properties. | ||
Similar to the Columns Selection DSL, but in this case the properties represent actual values, not column references. | ||
Jolanrensen marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
@@ -349,6 +351,9 @@ Ready to go deeper? Check out what’s next: | |
|
||
- 🧠 **Understand the design** and core concepts in the [library overview](overview.md). | ||
|
||
- 🔤 **[Learn more about Extension Properties](extensionPropertiesApi.md)** | ||
and make working with your data both convenient and type-safe. | ||
|
||
- 💡 **[Use Kotlin DataFrame Compiler Plugin](Compiler-Plugin.md)** | ||
for auto-generated column access in your IntelliJ IDEA projects. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*"a dataframe" (the concept), or *"
a [`DataFrame`](DataFrame.md)
" (the instance of the type, with link)