API: Change the handling of IDs in from_dataframe constructors #477

martinfleis · 2022-09-03T11:04:22Z

As a follow-up of the discussion we had yesterday during the dev call, I am starting this PR more as a place for a discussion on top of a code rather than an actual code change.

I have created a from_dataframe_new method to match the ideal outcome of the transition from ids, idVariable and id_order to a cleaner API. What needs to be figured out is a transitioning phase.

Looking at the code, I also think that @ljwolf's suggestion to deprecate id_order should be done at the same time as the rest of the changes as these are quite closely linked together.

There is one question we haven't discussed though. What shall we do with the actual dataframe index? We currently completely ignore it and when no ids are passed use positional indexing to encode weights. But the most sensible default would be to use the dataframe index instead imho. However, this is quite a big breaking change, so I am not sure what would be the best on that front. edit: the docstring now says If nothing is provided, the dataframe index is used but the reality does not follow that.

The API change proposed here on the example of Rook should be then mirrored to all relevant from_dataframe methods. In some cases (KNN, DistanceBand), it involves nearly no change as ids behave as we want it and there is no idVariable. So the main goal here is to ensure the API is consistent.

Relevant discussions: #473, #183, #184 (which does essentially the same thing as this PR), #102

martinfleis · 2022-09-03T11:04:57Z

libpysal/weights/contiguity.py

+            geom_col = df.geometry.name
+
+        if isinstance(ids, str):
+            ids = df[ids]


Not using get here as if you pass a string that does not match a column name we should raise an informative warning. The current code above gives AttributeError: 'NoneType' object has no attribute 'tolist', this one gives KeyError: 'missing', which is sensible enough.

libpysal/weights/contiguity.py

codecov · 2022-09-03T11:11:21Z

Codecov Report

Merging #477 (c796673) into master (d291e01) will increase coverage by 0.0%.
The diff coverage is 69.9%.

@@          Coverage Diff           @@
##           master    #477   +/-   ##
======================================
  Coverage    78.8%   78.8%           
======================================
  Files         122     122           
  Lines       13100   13161   +61     
======================================
+ Hits        10320   10371   +51     
- Misses       2780    2790   +10

Impacted Files	Coverage Δ
libpysal/weights/contiguity.py	`77.2% <64.4%> (-1.9%)`	⬇️
libpysal/weights/gabriel.py	`47.1% <65.2%> (+3.3%)`	⬆️
libpysal/weights/distance.py	`85.4% <77.8%> (ø)`
libpysal/weights/tests/test_contiguity.py	`95.1% <100.0%> (+0.3%)`	⬆️
libpysal/_version.py	`40.7% <0.0%> (+2.7%)`	⬆️

sjsrey · 2022-09-08T22:39:43Z

I have created a from_dataframe_new method to match the ideal outcome of the transition from ids, idVariable and id_order to a cleaner API. What needs to be figured out is a transitioning phase.

So the plan would be to deprecate idVariable and id_order, while refactoring ids for a new api, correct?

If yes, I can start exploring deprecation.

sjsrey · 2022-09-09T03:19:23Z

On a related note, turning off the sorting of the ids on init can be done. I explored this 13c89b6 and two tests broke in all of weights. These had to do with block_weights in both cases. I updated the doctests and unittests to correct for the case of turning off sort on init fc0f8b6. So removing the behavior of sorting on init could be done in the no sort branch.

martinfleis · 2022-09-09T14:36:23Z

So the plan would be to deprecate idVariable and id_order, while refactoring ids for a new api, correct?

Yes. And figure out what to do with positional index vs the dataframe index of when nothing else is passed.

sjsrey · 2022-09-10T00:49:57Z

So the plan would be to deprecate idVariable and id_order, while refactoring ids for a new api, correct?

Yes. And figure out what to do with positional index vs the dataframe index of when nothing else is passed.

Since 3.6, dicts order the keys based on insertion order, so the positional index for our w would be w.neighbors.keys(). Prior to 3.6 we had to do all the id_sort business to support alignment between w and the arrays they would be applied to (in spatial lags).

So one thought is the new w.ids could be a property with a getter that returns w.neighborhs.keys().

Users could then sort the array to align with the w.ids order and then pass the sorted array into lag_spatial or we provide a way to reorder the w.ids and pass that into the lag_spatial function with the original (unsorted) array.

sjsrey · 2022-09-19T23:12:52Z

The timing for the changes might be something like this:

For meta 23.01 released Jan 2023, deprecation warnings have been added for the old id-related api.

We then target 23.07 (July release) for the new api.

Maintaining two api's seems like a painful route, so the above is suggested as a way to avoid this.

What do devs think?

martinfleis · 2022-09-20T15:30:34Z

Not sure I fully understand. Esp. this

Maintaining two api's seems like a painful route, so the above is suggested as a way to avoid this.

I agree that we should not have two APIs but we need a transition period. We cannot just raise a warning in January and make the change in June. Everything breaks in June if you don't give a way of adjusting to the new API in January.

sjsrey · 2022-09-20T22:47:05Z

Not sure I fully understand. Esp. this

Maintaining two api's seems like a painful route, so the above is suggested as a way to avoid this.

I agree that we should not have two APIs but we need a transition period. We cannot just raise a warning in January and make the change in June. Everything breaks in June if you don't give a way of adjusting to the new API in January.

What I am trying to understand is how we go about this. Given the nature of the changes we are proposing, I can't see a way (yet) where say; [1] we have deprecation warnings around ids and id_order which would permit legacy code to work, but alert the user to upcoming changes; and [2] also allow for the new functionality of ids in order to point the user to the functionality to adopt.

Its easier if we have a mapping of: old_property -> new_property with different names. For ids we don't have this to leverage as its ids -> ids.

Maybe this suggests a different name for the property/method in the new API?
old is ids new maybe something like iloc and loc?

The other problem is that if we deprecate idVariable and the sort of ids on init, then that functionality has to coexist peacefully with the new API (which we agree should not sort on init).

Fleshing all this out here is really helpful :->

martinfleis · 2022-09-21T09:23:42Z

We don't have to deprecate anything around ids, we just make that variable smarter to consume either string representing the columns name, or any array-like of the matching length. That is fully backwards compatible and can be done immediately.

At the same time, we can raise a deprecation warning on idVariable but pass its value temporarily to enhanced ids.

The id_order can now raise a warning and since we want to remove it without a replacement, that is it.

One thing I still don't know how to approach is the issue of index vs positional index. We currently claim in the docstring that index is used automatically but it is not. If we just start doing that instead of using the position (iloc), but may break some things (e.g. it would break small bits in momepy). So I would say we should add a new keyword to control that behaviour.

Something like this

    @classmethod
    def from_dataframe(
        cls, df, geom_col=None, idVariable=None, ids=None, id_order=None, use_index=None, **kwargs
    ):
        """
        Construct a weights object from a pandas dataframe with a geometry
        column. This will cast the polygons to PySAL polygons, then build the W
        using ids from the dataframe.

        Parameters
        ----------
        df          : DataFrame
                      a :class: `pandas.DataFrame` containing geometries to use
                      for spatial weights
        geom_col    : string
                      the name of the column in `df` that contains the
                      geometries. Defaults to active geometry column.
        idVariable  : string
                      the name of the column to use as IDs. If nothing is
                      provided, the dataframe index is used DEPRECATED - use ids
        ids         : list
                      a list of ids to use to index the spatial weights object.
                      Order is not respected from this list.
        id_order    : list
                      an ordered list of ids to use to index the spatial weights
                      object. If used, the resulting weights object will iterate
                      over results in the order of the names provided in this
                      argument. DEPRECATED
        use_index   : bool
                      use index of `df` as ids to index the spatial weights object

        See Also
        --------
        :class:`libpysal.weights.weights.W`
        :class:`libpysal.weights.contiguity.Rook`
        """
        if geom_col is None:
            geom_col = df.geometry.name

        if id_order is not None:
            warnings.warn("deprecated, don't use")
            if id_order is True and ((idVariable is not None) or (ids is not None)):
                # if idVariable is None, we want ids. Otherwise, we want the
                # idVariable column
                id_order = list(df.get(idVariable, ids))
            else:
                id_order = df.get(id_order, ids)
        
        if idVariable is not None:
            if ids is None:
                warnings.warn("deprecated, use ids")
                ids = idVariable
            else:
                warnings.warn("both idVariable and ids passed, using ids")

        if ids is None:
            if use_index is None:
                warnings.warn("use_index defaults to False but will default to True in future. Set True/False directly to control this behaviour and silence this warning")
                use_index = False
            if use_index:
                ids = df.index.tolist()

        if isinstance(ids, str):
            ids = df[ids]

        if not isinstance(ids, list):
            ids = ids.tolist()


        return cls.from_iterable(
            df[geom_col].tolist(), ids=ids, id_order=id_order, **kwargs

This already supports all we want in the future but is backwards compatible and provides a deprecation period for transition.

martinfleis · 2022-10-28T19:10:21Z

I have updated from_dataframe in Rook, and Queen, adding the deprecation and ensuring all is controllable (use of index or not) and matching the docstring. I have also updated from_dataframe in KNN, Kernel, and DistanceBand by adding use_index keyword to ensure consistency between all methods.

All changes are backwards compatible. In contiguity, we will want to change the default in use_index to True to match the default in distance weights (that has already been used).

@ljwolf how are Delaunay weights indexed? It would be good to have the same options there if possible.

ljwolf · 2022-10-28T19:35:19Z

They only take input, and do no id processing iirc, so they create a range index. They're in weights.gabriel

martinfleis · 2022-10-28T19:52:11Z

Okay, I added the same options to handle index to Delaunay.from_dataframe as well. The default on use_index will need to change there as well, the situation is the same as we have in contiguity right now.

jGaboardi

LGTM

ljwolf · 2022-11-03T10:49:42Z

OK, still some odd behavior around weights ordering here, akin to the issues I outlined in the test bench from #184. I think that part of this fix has to be stopping sorting ids by default.

An example using FIPS codes from the south example data:

import geopandas
from libpysal import weights, examples
examples.load_example("south.shp") # I can't locate south if I don't load it first, even if it's downloaded?
south = geopandas.read_file(examples.get_path("south.shp"))
south_shuffled = south.sample(frac=1, replace=False)
print(weights.Rook.from_dataframe(south, ids="FIPS").id_order[:5])
['01001', '01003', '01005', '01007', '01009']
print(weights.DistanceBand.from_dataframe(south, threshold=2, ids="FIPS").id_order[:5])
['54029', '54009', '54069', '54051', '10003']

Sorting of the ids is happening for contiguity weights, but not distance weights. The passing isn't working at all for the Delaunay-family weights:

print(weights.Delaunay.from_dataframe(south.set_geometry(south.centroid), ids="FIPS").id_order[:5])
[0, 1, 2, 3, 4, 5]
print(weights.Gabriel.from_dataframe(south.set_geometry(south.centroid), ids="FIPS").id_order[:5])
[0, 1, 2, 3, 4, 5]

And this fails entirely, since the columns aren't propagated down to the FIPS

weights.Voronoi(south.set_geometry(south.centroid), ids='FIPS')

because columns aren't correctly propagated through to the Rook.from_dataframe() call.

ljwolf · 2022-11-03T10:58:13Z

libpysal/weights/gabriel.py

+                "The numba package is used extensively in this module"
+                " to accelerate the computation of graphs. Without numba,"
+                " these computations may become unduly slow on large data."
+            )
        edges, _ = self._voronoi_edges(coordinates)
        voronoi_neighbors = pandas.DataFrame(edges).groupby(0)[1].apply(list).to_dict()


I think this is where we have to intervene to get ids to "work":

Build pandas.DataFrame(edges) and store as adjlist

re-assign the head/tail values for each edge according to the id list (e.g. edges.loc[:,0] = ids[edges.loc[:,0])

then proceed with the group by and dict to get the weights into W input format.

I believe this would have to happen at the same location in Gabriel neighbors (line 181) and I don't yet understand how to do this at 223 (maybe pass WSP(sp, ids=...?)

libpysal/weights/contiguity.py

ljwolf

I think, too, the Voronoi fix would be to set the index of the voronoi cell dataframe in the Voronoi function...

martinfleis · 2022-11-03T11:08:09Z

@ljwolf whoops. I'll have a look at Delaunay/Voronoi stuff later unless you want to commit some changes yourself (which may be faster).

some odd behavior around weights ordering here

Can we solve the ordering issue in another PR and have this focused on passing the ids through? Since this hasn't changed from what we have on master.

ljwolf · 2022-11-03T11:56:48Z

Sure! I'm drafting a fix for these inconsistencies now, will send

ljwolf · 2022-11-03T12:20:37Z

I've sent a PR upstream (martinfleis/libpysal#35), since I'm not sure if this counts as "fixing" sorting by default.... What I'm referring to is this:

.from_dataframe() methods should always provide weights that are aligned to the input dataframe.

The current state of things is:

contiguity weights' .from_dataframe() will break the ordering
distance weights' .from_dataframe() does not break the ordering
With martinfleis/libpysal#35, Delaunay-based weights will not break the ordering.

I'd encourage us to not break order in from_dataframe(), which means that we consider the input to from_dataframe as having a "set" id_order, rather than sorting it.

propagate relabelling for gabriel weights

martinfleis · 2022-11-03T13:54:43Z

The order should now be preserved for contiguity as well.

I think I'll add some tests for this before merging.

martinfleis · 2023-01-01T21:31:32Z

Should be ready now.

knaaptime

sweet

sjsrey · 2023-01-06T19:47:07Z

@martinfleis should we merge this?

martinfleis · 2023-01-06T21:13:15Z

Unless there are any other comments it is ready to merge, yes.

make a new method as a target

1bbf014

martinfleis commented Sep 3, 2022

View reviewed changes

libpysal/weights/contiguity.py Outdated Show resolved Hide resolved

martinfleis mentioned this pull request Sep 13, 2022

API: deprecation decorators for transition to functional API pysal/momepy#406

Merged

jGaboardi mentioned this pull request Oct 28, 2022

items for deprecation #484

Open

50 tasks

martinfleis added 7 commits October 28, 2022 20:37

contiguity

936d5de

rm idVariable from test

ba6eea3

distance

2e2ef64

Merge remote-tracking branch 'upstream/master' into ids

39259a5

avoid NonType

3e8e353

check length

47ae204

redo to minimise checks

081f96d

martinfleis marked this pull request as ready for review October 28, 2022 19:06

add the same options to Delaunay

c4f00f2

martinfleis requested review from sjsrey, jGaboardi and ljwolf November 2, 2022 14:08

jGaboardi approved these changes Nov 2, 2022

View reviewed changes

jGaboardi requested a review from knaaptime November 2, 2022 14:45

sjsrey approved these changes Nov 3, 2022

View reviewed changes

ljwolf reviewed Nov 3, 2022

View reviewed changes

libpysal/weights/contiguity.py Outdated Show resolved Hide resolved

ljwolf reviewed Nov 3, 2022

View reviewed changes

propagate relabelling for gabriel weights

a61ae97

ljwolf and others added 3 commits November 3, 2022 12:27

get, do not pop, from kwargs

5338bec

Merge pull request #35 from ljwolf/ids

91240dc

propagate relabelling for gabriel weights

preserve order in contiguity

f527126

martinfleis mentioned this pull request Nov 15, 2022

New release? #489

Closed

martinfleis added 2 commits January 1, 2023 21:57

Merge branch 'master' into ids

f334af6

test order

c796673

knaaptime approved these changes Jan 2, 2023

View reviewed changes

sjsrey merged commit ad1afcb into pysal:master Jan 10, 2023

martinfleis mentioned this pull request Jan 13, 2023

[WIP] shared perimeter-weighted contiguity #507

Closed

sjsrey mentioned this pull request Jan 23, 2023

[ENH]: option to return series from lag_spatial #512

Closed

martinfleis deleted the ids branch June 29, 2024 19:30

API: Change the handling of IDs in from_dataframe constructors #477

API: Change the handling of IDs in from_dataframe constructors #477

Uh oh!

Conversation

martinfleis commented Sep 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martinfleis Sep 3, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Sep 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sjsrey commented Sep 8, 2022

Uh oh!

sjsrey commented Sep 9, 2022

Uh oh!

martinfleis commented Sep 9, 2022

Uh oh!

sjsrey commented Sep 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjsrey commented Sep 19, 2022

Uh oh!

martinfleis commented Sep 20, 2022

Uh oh!

sjsrey commented Sep 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martinfleis commented Sep 21, 2022

Uh oh!

martinfleis commented Oct 28, 2022

Uh oh!

ljwolf commented Oct 28, 2022

Uh oh!

martinfleis commented Oct 28, 2022

Uh oh!

jGaboardi left a comment

Choose a reason for hiding this comment

Uh oh!

ljwolf commented Nov 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ljwolf Nov 3, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ljwolf left a comment

Choose a reason for hiding this comment

Uh oh!

martinfleis commented Nov 3, 2022

Uh oh!

ljwolf commented Nov 3, 2022

Uh oh!

ljwolf commented Nov 3, 2022

Uh oh!

martinfleis commented Nov 3, 2022

Uh oh!

martinfleis commented Jan 1, 2023

Uh oh!

knaaptime left a comment

Choose a reason for hiding this comment

Uh oh!

sjsrey commented Jan 6, 2023

Uh oh!

martinfleis commented Jan 6, 2023

Uh oh!

Uh oh!

martinfleis commented Sep 3, 2022 •

edited

Loading

codecov bot commented Sep 3, 2022 •

edited

Loading

sjsrey commented Sep 10, 2022 •

edited

Loading

sjsrey commented Sep 20, 2022 •

edited

Loading

ljwolf commented Nov 3, 2022 •

edited

Loading