-
Notifications
You must be signed in to change notification settings - Fork 73
Extension properties docs #1246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
71c3601
8f784cd
6ed9e24
3b4994c
ba442a1
8bc936f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
name,info | ||
Alice,"{""age"":23,""height"":175.5}" | ||
Bob,"{""age"":27,""height"":160.2}" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,24 +1,164 @@ | ||
[//]: # (title: Extension Properties API) | ||
|
||
<!---IMPORT org.jetbrains.kotlinx.dataframe.samples.api.ApiLevels--> | ||
|
||
Auto-generated extension properties are the safest and easiest way to access columns in a [`DataFrame`](DataFrame.md). | ||
They are generated based on a [dataframe schema](schemas.md), | ||
When working with a [`DataFrame`](DataFrame.md), the most convenient and reliable way | ||
to access its columns — including for operations and retrieving column values | ||
in row expressions — is through auto-generated extension properties. | ||
They are generated based on a [dataframe schema](schemas.md), | ||
with the name and type of properties inferred from the name and type of the corresponding columns. | ||
It also works for all types of hierarchical dataframes. | ||
|
||
> The behavior of data schema generation differs between the | ||
> [Compiler Plugin](Compiler-Plugin.md) and [Kotlin Notebook](gettingStartedKotlinNotebook.md). | ||
> | ||
> * In **Kotlin Notebook**, a schema is generated *only after cell execution* for | ||
> `DataFrame` variables defined within that cell. | ||
> * With the **Compiler Plugin**, a new schema is generated *after every operation* | ||
> — but support for all operations is still in progress. | ||
> Retrieving the schema for `DataFrame` read from a file or URL is *not yet supported* either. | ||
> | ||
> This behavior may change in future releases. See the [example](#example) below that demonstrates these differences. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe mention that we're working to bring the compiler plugin to notebooks as well? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The meaning is "ALL of these points will be improved in the future" |
||
{style="warning"} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe warning is a bit aggressive, how about "info"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe it's very crucial, so set a warning. |
||
|
||
## Example | ||
|
||
Consider a simple hierarchical dataframe from | ||
<resource src="example.csv"></resource>. | ||
|
||
This table consists of two columns: `name`, which is a `String` column, and `info`, | ||
which is a [**column group**](DataColumn.md#columngroup) containing two nested | ||
[value columns](DataColumn.md#valuecolumn) — | ||
`age` of type `Int`, and `height` of type `Double`. | ||
|
||
<table> | ||
<thead> | ||
<tr> | ||
<th>name</th> | ||
<th colspan="2">info</th> | ||
</tr> | ||
<tr> | ||
<th></th> | ||
<th>age</th> | ||
<th>height</th> | ||
</tr> | ||
</thead> | ||
<tbody> | ||
<tr> | ||
<td>Alice</td> | ||
<td>23</td> | ||
<td>175.5</td> | ||
</tr> | ||
<tr> | ||
<td>Bob</td> | ||
<td>27</td> | ||
<td>160.2</td> | ||
</tr> | ||
</tbody> | ||
</table> | ||
|
||
<tabs> | ||
<tab title="Kotlin Notebook"> | ||
Read the [`DataFrame`](DataFrame.md) from the CSV file: | ||
|
||
```kotlin | ||
val df = DataFrame.readCsv("example.csv") | ||
``` | ||
|
||
*After cell execution* data schema and extensions for this `DataFrame` will be generated | ||
so you can use extensions for accessing columns, | ||
using it in operations inside the [Column Selector DSL](ColumnSelectors.md) | ||
and [DataRow API](DataRow.md): | ||
|
||
|
||
```kotlin | ||
// Get nested column | ||
df.info.age | ||
// Sort by multiple columns | ||
df.sortBy { name and info.height } | ||
// Filter rows using a row condition. | ||
// These extensions express the exact value in the row | ||
// with the corresponding type: | ||
df.filter { name.startsWith("A") && info.age >= 16 } | ||
``` | ||
|
||
If you change the dataframe's schema by changing any column [name](rename.md), | ||
or [type](convert.md) or [add](add.md) a new one, you need to | ||
run a cell with a new [`DataFrame`](DataFrame.md) declaration first. | ||
For example, rename the `name` column into "firstName": | ||
|
||
```kotlin | ||
val dfRenamed = df.rename { name }.into("firstName") | ||
``` | ||
|
||
Having these, it allows you to work with your dataframe like: | ||
After running the cell with the code above, you can use `firstName` extensions in the following cells: | ||
|
||
```kotlin | ||
dfRenamed.firstName | ||
dfRenamed.rename { firstName }.into("name") | ||
dfRenamed.filter { firstName == "Nikita" } | ||
``` | ||
|
||
See the [](quickstart.md) in Kotlin Notebook with basic Extension Properties API examples. | ||
|
||
</tab> | ||
<tab title="Compiler Plugin"> | ||
|
||
For now, if you read [`DataFrame`](DataFrame.md) from a file or URL, you need to define its schema manually. | ||
You can do it quickly with [`generate..()` methods](DataSchema-Data-Classes-Generation.md). | ||
|
||
Define schemas: | ||
```kotlin | ||
val peopleDf /* : DataFrame<Person> */ = DataFrame.read("people.csv").cast<Person>() | ||
val nameColumn /* : DataColumn<String> */ = peopleDf.name | ||
val ageColumn /* : DataColumn<Int> */ = peopleDf.personData.age | ||
@DataSchema | ||
data class PersonInfo( | ||
val age: Int, | ||
val height: Float | ||
) | ||
|
||
@DataSchema | ||
data class Person( | ||
val info: PersonInfo, | ||
val name: String | ||
) | ||
``` | ||
and of course | ||
|
||
Read the `DataFrame` from the CSV file and specify the schema with | ||
[`.convertTo()`](convertTo.md) or [`cast()`](cast.md): | ||
|
||
```kotlin | ||
val df = DataFrame.readCsv("example.csv").convertTo<Person>() | ||
``` | ||
|
||
Extensions for this `DataFrame` will be generated automatically by the plugin, | ||
so you can use extensions for accessing columns, | ||
using it in operations inside the [Column Selector DSL](ColumnSelectors.md) | ||
and [DataRow API](DataRow.md). | ||
|
||
|
||
```kotlin | ||
// Get nested column | ||
df.info.age | ||
// Sort by multiple columns | ||
df.sortBy { name and info.height } | ||
// Filter rows using a row condition. | ||
// These extensions express the exact value in the row | ||
// with the corresponding type: | ||
df.filter { name.startsWith("A") && info.age >= 16 } | ||
``` | ||
|
||
Moreover, new extensions will be generated on-the-fly after each schema change: | ||
by changing any column [name](rename.md), | ||
or [type](convert.md) or [add](add.md) a new one. | ||
For example, rename the `name` column into "firstName" and then we can use `firstName` extensions | ||
in the following operations: | ||
|
||
```kotlin | ||
peopleDf.add("lastName") { name.split(",").last() } | ||
.dropNulls { personData.age } | ||
.filter { survived && home.endsWith("NY") && personData.age in 10..20 } | ||
// Rename "name" column into "firstName" | ||
df.rename { name }.into("firstName") | ||
// Can use `firstName` extension in the row condition | ||
// right after renaming | ||
.filter { firstName == "Nikita" } | ||
``` | ||
|
||
To find out how to use this API in your environment, check out [Working with Data Schemas](schemas.md) | ||
or jump straight to [Data Schemas in Gradle projects](schemasGradle.md), | ||
or [Data Schemas in Jupyter notebooks](schemasJupyter.md). | ||
See [Compiler Plugin Example](https://github.com/Kotlin/dataframe/tree/plugin_example/examples/kotlin-dataframe-plugin-example) | ||
IDEA project with basic Extension Properties API examples. | ||
</tab> | ||
</tabs> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*
and `DataRow`
. Or do you mean "DataFrame" the library? as in "DataFrame variables". Then don't use backticks.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but
DataRow
not defined within that cell.I mean, we can omit it here, it's described further.