melt() functionality #2677

st-pasha · 2020-10-02T20:16:58Z

st-pasha
Oct 2, 2020

So, I've been thinking about the melt() function as requested in #2499. The equivalent to this function exists in R data.table, pandas, tidyr, SQL Server, Snowflake.

Interestingly, in data.table, pandas and tidyr this function is a full-frame transform. That is, the method is applied to a whole frame, and produces another frame as a result. The method also has a dozen or so various options to control different aspects of the transformation. In SQL, on the other hand, the same functionality is done via a separate clause which looks like this: UNPIVOT (vcol FOR ncol IN collist...). However, this clause is attached to FROM, so in this sense it's also a transformation of the frame.

What would be really nice for us, however, is to implement melting as just another simple building block that can be combined with other functions inside a DT[i,j,...] call. Something like this:

melt(columns)

(where columns is an FExpr that selects 1 or more columns). The melt() function would then produce a 2-column FExpr where the first column contains column labels, and the second column is the values from the table. Internally, the resulting columns will be at a grouping level which expands every row into as many rows as the number of columns in the columnset. This means that melt() cannot be combined with a groupby. But it also means that all regular columns will be auto-expanded to match the shape of the melted columns.

For example, given the dataset

DT = dt.Frame(A=['a','b','c'], B=[1,3,5], C=[2,4,6])

we can write

DT[:, [f.A, melt(f["B":"C"])]]

to produce

   | A   variable  value
-- + --  --------  -----
 0 | a   B             1
 1 | a   C             2
 2 | b   B             3
 3 | b   C             4
 4 | c   B             5
 5 | c   C             6

[6 rows x 3 columns]

pradkrish · 2020-10-03T19:26:55Z

pradkrish
Oct 3, 2020

I agree, the syntax you propose matches the way other functions in datatable are used. shouldn't the output look more like this?

   | A   variable  value
-- + --  --------  -----
 0 | a   B             1
 1 | b   B             3
 2 | c   B             5
 3 | a   C             2
 4 | b   C             4
 5 | c   C             6

[6 rows x 3 columns]

Otherwise it may not agree with how it's done in other packages.

0 replies

st-pasha · 2020-10-05T07:59:04Z

st-pasha
Oct 5, 2020
Author

Hmm, is it how it's done in other packages? This would mean that melting is merely splitting the frame into columns and then r-binding those columns. I thought that melting should preserve local order values: if some values are close in the original frame, they must remain close in the transformed frame. Given that for a typical frame nrows >> ncols, this would imply that we first list all observations for the first row, then observations for the next row, and so on.

I've looked into the packages listed above, and none of them documents the exact order of rows in the molten frame. De facto, however, they use the following orders:

R data.table: columns-first;
pandas: columns-first;
tidyr: columns-first;
SQL Server: rows-first;
Snowflake: rows-first.

I wonder why they chose differently, and whether there is a clearly better alternative here.

2 replies

pradkrish Oct 5, 2020

It is going to be a tricky one then. If people adopting py data table are mainly coming from pandas and R data table background, there is an incentive to keep the structure similar to those packages.

st-pasha Oct 6, 2020
Author

Stata is also rows-first: https://stats.idre.ucla.edu/stata/modules/reshaping-data-wide-to-long/
And so is SAS: https://stats.idre.ucla.edu/sas/modules/how-to-reshape-data-wide-to-long-using-proc-transpose/

tdhock · 2020-10-06T22:15:02Z

tdhock
Oct 6, 2020

I'm not sure if this is a big deal in terms of performance but data.table uses memcpy in the getvaluecols function https://github.com/Rdatatable/data.table/blob/master/src/fmelt.c#L488 to copy entire columns from the input to the outpu so if you want to do that I think you would need the columns-first approach. I guess SQL DBs use the rows-first approach because they have row-wise storage rather than column-wise storage in R/data.table?

2 replies

st-pasha Oct 7, 2020
Author

memcpy is amazing for copying contiguous chunks of data in a single thread, but may be easily outperformed by an element-by-element copy in multiple threads (though this highly depends on hardware specs).
An even better approach (imho) is not to copy any data all, but to create a "virtual column" (a.k.a. ALTREP in R) that would refer to the data in the original dataset without copying that data. Nevertheless, data locality would imply that "columns-first" virtual column may still perform better than "row-first".

PS: do you know why fmelt in data.table performs all data copying manually instead of simply rbinding several columns?

tdhock Oct 7, 2020

not sure why but I bet it's for speed.

tdhock · 2020-10-06T23:22:56Z

tdhock
Oct 6, 2020

I have recently added some advanced features to fmelt Rdatatable/data.table#4731 so you may think about doing something similar or at least planning for those features in your python version.

3 replies

st-pasha Oct 7, 2020
Author

The way I currently see it, the melt() function should be as small as possible, so that it can be used as a building block in order mix-and-match with other datatable expressions.

For example, consider the measure.vars= option in data.table. It allows you to select multiple columns that will be all molten together. However, we already have a mechanism for selecting multiple columns: f[...] symbol. For the developers it's much easier to reuse the existing mechanism; and for users too: they don't have to learn any new functionality. And even if some functionality is missing, e.g. selecting columns based on a pattern, it's much better to implement this as part of f. rather than as a special argument to melt: this way the same pattern-based matching functionality can be used everywhere.

Your PR 4731 is concerned with a different problem (if I understood it correctly): how to split column names, such as Petal.Length, Petal.Width, Sepal.Depth, into components that will become two separate columns. However, the way I see it, it's just a multi-step operation:

first, perform melt(columns...), which will produce 2 columns: names_column, values_column;
select the first column (names_column) for transform;
apply any string function to that column, such as split(), strip(), re_match(), -, etc;
apply any subsequent transform. For example if the suffixes are integer, you can cast them to an int column;
etc.

This way we don't need to have extra arguments on melt - we'll just create a larger set of reusable blocks. And of course there are performance considerations here too: the names_column produced will be categorical, so when applying any transform to such a column, it is important to transform each "label" only once instead of doing the same transform over and over again. (cc @oleksiyskononenko )

tdhock Oct 7, 2020

you are absolutely right that it is possible to achieve this using a sequence of operations, but that is less efficient, see melt with custom variable columns using variable_table attribute Rdatatable/data.table#4731 (comment) If you know that you want to melt columns and split the names at the same time, you save time/space because you don't have to compute/save the intermediate column (that you don't actually want anyway).
this approach may be worth considering to make it easier / more convenient to specify these kind of operations, when you want to say "here is a regex/separator, use it for BOTH specifying measure.vars/id.vars (via what columns match / don't match), and for converting the column names to the desired output format (via the capture groups / separator / conversion function)." (avoid repetition in user code)

st-pasha Oct 8, 2020
Author

In terms of efficiency, we are creating a virtual column here. For example when melting three columns a, b, c in a dataset with 1,000,000 rows, we don't have to create a million copies of each name -- it's sufficient to have a virtual column that "knows" that the first 1M of values are a, the second 1M is b and the last 1M is c.
The trick, though, is to make sure that whenever any function is applied to such a column, then it would be "pushed back" to the underlying column of labels ['a', 'b', 'c']. It may not be exactly trivial, but such functionality would be generically useful, not just for this particular function, but for any manipulation of a categorical column.
I agree that specifying a single regex to both select and transform the columns is very convenient. I guess it is always possible to have the standalone melt() function to provide the basic interface, while the frame method .melt() (or .pivot_longer()) can include the more traditional API with many options. This API would essentially be a syntactic sugar for the simple melt().

samukweku · 2020-11-05T07:37:27Z

samukweku
Nov 5, 2020

While considering this, it would be useful if a transpose function could be implemented - the user can transform the rows into columns or columns into rows. In Pandas, it is the transpose function, and it also exists in data.table.

0 replies

lafet1 · 2021-01-26T13:25:40Z

lafet1
Jan 26, 2021

Hi, I'm sorry if there's a better place to ask this question and if so, please feel free to tell me where to move it.

I came across this package and I'm really excited about it! But one of the functionalities that I've noticed is missing are the reshaping functions (melt/dcast/pivot_table).

Is there a timetable for this kind of functionality? I've checked the expected functionality in the 1.0.0 release that is in the documentation and haven't noticed it.

4 replies

melt() functionality #2677

Uh oh!

Replies: 6 comments · 11 replies

Uh oh!

Uh oh!

st-pasha Oct 5, 2020 Author

Uh oh!

Uh oh!

st-pasha Oct 6, 2020 Author

Uh oh!

Uh oh!

st-pasha Oct 7, 2020 Author

Uh oh!

Uh oh!

Uh oh!

st-pasha Oct 7, 2020 Author

Uh oh!

Uh oh!

Uh oh!

st-pasha Oct 8, 2020 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 6 comments 11 replies

st-pasha
Oct 5, 2020
Author

st-pasha Oct 6, 2020
Author

st-pasha Oct 7, 2020
Author

st-pasha Oct 7, 2020
Author

st-pasha Oct 8, 2020
Author