|
1 | 1 | [//]: # (title: Extension Properties API)
|
2 | 2 |
|
3 |
| -<!---IMPORT org.jetbrains.kotlinx.dataframe.samples.api.ApiLevels--> |
4 |
| - |
5 |
| -Auto-generated extension properties are the safest and easiest way to access columns in a [`DataFrame`](DataFrame.md). |
6 |
| -They are generated based on a [dataframe schema](schemas.md), |
| 3 | +When working with a [`DataFrame`](DataFrame.md), the most convenient and reliable way |
| 4 | +to access its columns — including for operations and retrieving column values |
| 5 | +in row expressions — is through auto-generated extension properties. |
| 6 | +They are generated based on a [dataframe schema](schemas.md), |
7 | 7 | with the name and type of properties inferred from the name and type of the corresponding columns.
|
| 8 | +It also works for all types of hierarchical dataframes. |
| 9 | + |
| 10 | +> The behavior of data schema generation differs between the |
| 11 | +> [Compiler Plugin](Compiler-Plugin.md) and [Kotlin Notebook](gettingStartedKotlinNotebook.md). |
| 12 | +> |
| 13 | +> * In **Kotlin Notebook**, a schema is generated *only after cell execution* for |
| 14 | +> `DataFrame` variables defined within that cell. |
| 15 | +> * With the **Compiler Plugin**, a new schema is generated *after every operation* |
| 16 | +> — but support for all operations is still in progress. |
| 17 | +> Retrieving the schema for `DataFrame` read from a file or URL is *not yet supported* either. |
| 18 | +> |
| 19 | +> This behavior may change in future releases. See the [example](#example) below that demonstrates these differences. |
| 20 | +{style="warning"} |
| 21 | + |
| 22 | +## Example |
| 23 | + |
| 24 | +Consider a simple hierarchical dataframe from |
| 25 | +<resource src="example.csv"></resource>. |
| 26 | + |
| 27 | +This table consists of two columns: `name`, which is a `String` column, and `info`, |
| 28 | +which is a [**column group**](DataColumn.md#columngroup) containing two nested |
| 29 | +[value columns](DataColumn.md#valuecolumn) — |
| 30 | +`age` of type `Int`, and `height` of type `Double`. |
| 31 | + |
| 32 | +<table> |
| 33 | + <thead> |
| 34 | + <tr> |
| 35 | + <th>name</th> |
| 36 | + <th colspan="2">info</th> |
| 37 | + </tr> |
| 38 | + <tr> |
| 39 | + <th></th> |
| 40 | + <th>age</th> |
| 41 | + <th>height</th> |
| 42 | + </tr> |
| 43 | + </thead> |
| 44 | + <tbody> |
| 45 | + <tr> |
| 46 | + <td>Alice</td> |
| 47 | + <td>23</td> |
| 48 | + <td>175.5</td> |
| 49 | + </tr> |
| 50 | + <tr> |
| 51 | + <td>Bob</td> |
| 52 | + <td>27</td> |
| 53 | + <td>160.2</td> |
| 54 | + </tr> |
| 55 | + </tbody> |
| 56 | +</table> |
| 57 | + |
| 58 | +<tabs> |
| 59 | +<tab title="Kotlin Notebook"> |
| 60 | +Read the [`DataFrame`](DataFrame.md) from the CSV file: |
| 61 | + |
| 62 | +```kotlin |
| 63 | +val df = DataFrame.readCsv("example.csv") |
| 64 | +``` |
| 65 | + |
| 66 | +*After cell execution* data schema and extensions for this `DataFrame` will be generated |
| 67 | +so you can use extensions for accessing columns, |
| 68 | +using it in operations inside the [Column Selector DSL](ColumnSelectors.md) |
| 69 | +and [DataRow API](DataRow.md): |
| 70 | + |
| 71 | + |
| 72 | +```kotlin |
| 73 | +// Get nested column |
| 74 | +df.info.age |
| 75 | +// Sort by multiple columns |
| 76 | +df.sortBy { name and info.height } |
| 77 | +// Filter rows using a row condition. |
| 78 | +// These extensions express the exact value in the row |
| 79 | +// with the corresponding type: |
| 80 | +df.filter { name.startsWith("A") && info.age >= 16 } |
| 81 | +``` |
| 82 | + |
| 83 | +If you change the dataframe's schema by changing any column [name](rename.md), |
| 84 | +or [type](convert.md) or [add](add.md) a new one, you need to |
| 85 | +run a cell with a new [`DataFrame`](DataFrame.md) declaration first. |
| 86 | +For example, rename the `name` column into "firstName": |
| 87 | + |
| 88 | +```kotlin |
| 89 | +val dfRenamed = df.rename { name }.into("firstName") |
| 90 | +``` |
8 | 91 |
|
9 |
| -Having these, it allows you to work with your dataframe like: |
| 92 | +After running the cell with the code above, you can use `firstName` extensions in the following cells: |
| 93 | + |
| 94 | +```kotlin |
| 95 | +dfRenamed.firstName |
| 96 | +dfRenamed.rename { firstName }.into("name") |
| 97 | +dfRenamed.filter { firstName == "Nikita" } |
| 98 | +``` |
| 99 | + |
| 100 | +See the [](quickstart.md) in Kotlin Notebook with basic Extension Properties API examples. |
| 101 | + |
| 102 | +</tab> |
| 103 | +<tab title="Compiler Plugin"> |
| 104 | + |
| 105 | +For now, if you read [`DataFrame`](DataFrame.md) from a file or URL, you need to define its schema manually. |
| 106 | +You can do it quickly with [`generate..()` methods](DataSchema-Data-Classes-Generation.md). |
| 107 | + |
| 108 | +Define schemas: |
10 | 109 | ```kotlin
|
11 |
| -val peopleDf /* : DataFrame<Person> */ = DataFrame.read("people.csv").cast<Person>() |
12 |
| -val nameColumn /* : DataColumn<String> */ = peopleDf.name |
13 |
| -val ageColumn /* : DataColumn<Int> */ = peopleDf.personData.age |
| 110 | +@DataSchema |
| 111 | +data class PersonInfo( |
| 112 | + val age: Int, |
| 113 | + val height: Float |
| 114 | +) |
| 115 | + |
| 116 | +@DataSchema |
| 117 | +data class Person( |
| 118 | + val info: PersonInfo, |
| 119 | + val name: String |
| 120 | +) |
14 | 121 | ```
|
15 |
| -and of course |
| 122 | + |
| 123 | +Read the `DataFrame` from the CSV file and specify the schema with |
| 124 | +[`.convertTo()`](convertTo.md) or [`cast()`](cast.md): |
| 125 | + |
| 126 | +```kotlin |
| 127 | +val df = DataFrame.readCsv("example.csv").convertTo<Person>() |
| 128 | +``` |
| 129 | + |
| 130 | +Extensions for this `DataFrame` will be generated automatically by the plugin, |
| 131 | +so you can use extensions for accessing columns, |
| 132 | +using it in operations inside the [Column Selector DSL](ColumnSelectors.md) |
| 133 | +and [DataRow API](DataRow.md). |
| 134 | + |
| 135 | + |
| 136 | +```kotlin |
| 137 | +// Get nested column |
| 138 | +df.info.age |
| 139 | +// Sort by multiple columns |
| 140 | +df.sortBy { name and info.height } |
| 141 | +// Filter rows using a row condition. |
| 142 | +// These extensions express the exact value in the row |
| 143 | +// with the corresponding type: |
| 144 | +df.filter { name.startsWith("A") && info.age >= 16 } |
| 145 | +``` |
| 146 | + |
| 147 | +Moreover, new extensions will be generated on-the-fly after each schema change: |
| 148 | +by changing any column [name](rename.md), |
| 149 | +or [type](convert.md) or [add](add.md) a new one. |
| 150 | +For example, rename the `name` column into "firstName" and then we can use `firstName` extensions |
| 151 | +in the following operations: |
| 152 | + |
16 | 153 | ```kotlin
|
17 |
| -peopleDf.add("lastName") { name.split(",").last() } |
18 |
| - .dropNulls { personData.age } |
19 |
| - .filter { survived && home.endsWith("NY") && personData.age in 10..20 } |
| 154 | +// Rename "name" column into "firstName" |
| 155 | +df.rename { name }.into("firstName") |
| 156 | + // Can use `firstName` extension in the row condition |
| 157 | + // right after renaming |
| 158 | + .filter { firstName == "Nikita" } |
20 | 159 | ```
|
21 | 160 |
|
22 |
| -To find out how to use this API in your environment, check out [Working with Data Schemas](schemas.md) |
23 |
| -or jump straight to [Data Schemas in Gradle projects](schemasGradle.md), |
24 |
| -or [Data Schemas in Jupyter notebooks](schemasJupyter.md). |
| 161 | +See [Compiler Plugin Example](https://github.com/Kotlin/dataframe/tree/plugin_example/examples/kotlin-dataframe-plugin-example) |
| 162 | +IDEA project with basic Extension Properties API examples. |
| 163 | +</tab> |
| 164 | +</tabs> |
0 commit comments