-
Notifications
You must be signed in to change notification settings - Fork 79
API: Change the handling of IDs in from_dataframe constructors #477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
libpysal/weights/contiguity.py
Outdated
geom_col = df.geometry.name | ||
|
||
if isinstance(ids, str): | ||
ids = df[ids] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not using get
here as if you pass a string that does not match a column name we should raise an informative warning. The current code above gives AttributeError: 'NoneType' object has no attribute 'tolist', this one gives KeyError: 'missing', which is sensible enough.
Codecov Report
@@ Coverage Diff @@
## master #477 +/- ##
======================================
Coverage 78.8% 78.8%
======================================
Files 122 122
Lines 13100 13161 +61
======================================
+ Hits 10320 10371 +51
- Misses 2780 2790 +10
|
So the plan would be to deprecate If yes, I can start exploring deprecation. |
On a related note, turning off the sorting of the ids on init can be done. I explored this 13c89b6 and two tests broke in all of weights. These had to do with block_weights in both cases. I updated the doctests and unittests to correct for the case of turning off sort on init fc0f8b6. So removing the behavior of sorting on init could be done in the no sort branch. |
Yes. And figure out what to do with positional index vs the dataframe index of when nothing else is passed. |
Since 3.6, dicts order the keys based on insertion order, so the positional index for our So one thought is the new Users could then sort the array to align with the |
The timing for the changes might be something like this: For meta 23.01 released Jan 2023, deprecation warnings have been added for the old id-related api. We then target 23.07 (July release) for the new api. Maintaining two api's seems like a painful route, so the above is suggested as a way to avoid this. What do devs think? |
Not sure I fully understand. Esp. this
I agree that we should not have two APIs but we need a transition period. We cannot just raise a warning in January and make the change in June. Everything breaks in June if you don't give a way of adjusting to the new API in January. |
What I am trying to understand is how we go about this. Given the nature of the changes we are proposing, I can't see a way (yet) where say; [1] we have deprecation warnings around Its easier if we have a mapping of: Maybe this suggests a different name for the property/method in the new API? The other problem is that if we deprecate Fleshing all this out here is really helpful :-> |
We don't have to deprecate anything around At the same time, we can raise a deprecation warning on The One thing I still don't know how to approach is the issue of index vs positional index. We currently claim in the docstring that index is used automatically but it is not. If we just start doing that instead of using the position (iloc), but may break some things (e.g. it would break small bits in momepy). So I would say we should add a new keyword to control that behaviour. Something like this @classmethod
def from_dataframe(
cls, df, geom_col=None, idVariable=None, ids=None, id_order=None, use_index=None, **kwargs
):
"""
Construct a weights object from a pandas dataframe with a geometry
column. This will cast the polygons to PySAL polygons, then build the W
using ids from the dataframe.
Parameters
----------
df : DataFrame
a :class: `pandas.DataFrame` containing geometries to use
for spatial weights
geom_col : string
the name of the column in `df` that contains the
geometries. Defaults to active geometry column.
idVariable : string
the name of the column to use as IDs. If nothing is
provided, the dataframe index is used DEPRECATED - use ids
ids : list
a list of ids to use to index the spatial weights object.
Order is not respected from this list.
id_order : list
an ordered list of ids to use to index the spatial weights
object. If used, the resulting weights object will iterate
over results in the order of the names provided in this
argument. DEPRECATED
use_index : bool
use index of `df` as ids to index the spatial weights object
See Also
--------
:class:`libpysal.weights.weights.W`
:class:`libpysal.weights.contiguity.Rook`
"""
if geom_col is None:
geom_col = df.geometry.name
if id_order is not None:
warnings.warn("deprecated, don't use")
if id_order is True and ((idVariable is not None) or (ids is not None)):
# if idVariable is None, we want ids. Otherwise, we want the
# idVariable column
id_order = list(df.get(idVariable, ids))
else:
id_order = df.get(id_order, ids)
if idVariable is not None:
if ids is None:
warnings.warn("deprecated, use ids")
ids = idVariable
else:
warnings.warn("both idVariable and ids passed, using ids")
if ids is None:
if use_index is None:
warnings.warn("use_index defaults to False but will default to True in future. Set True/False directly to control this behaviour and silence this warning")
use_index = False
if use_index:
ids = df.index.tolist()
if isinstance(ids, str):
ids = df[ids]
if not isinstance(ids, list):
ids = ids.tolist()
return cls.from_iterable(
df[geom_col].tolist(), ids=ids, id_order=id_order, **kwargs This already supports all we want in the future but is backwards compatible and provides a deprecation period for transition. |
I have updated All changes are backwards compatible. In contiguity, we will want to change the default in @ljwolf how are Delaunay weights indexed? It would be good to have the same options there if possible. |
They only take input, and do no id processing iirc, so they create a range index. They're in weights.gabriel |
Okay, I added the same options to handle index to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
OK, still some odd behavior around weights ordering here, akin to the issues I outlined in the test bench from #184. I think that part of this fix has to be stopping sorting ids by default. An example using FIPS codes from the import geopandas
from libpysal import weights, examples
examples.load_example("south.shp") # I can't locate south if I don't load it first, even if it's downloaded?
south = geopandas.read_file(examples.get_path("south.shp"))
south_shuffled = south.sample(frac=1, replace=False)
print(weights.Rook.from_dataframe(south, ids="FIPS").id_order[:5])
['01001', '01003', '01005', '01007', '01009']
print(weights.DistanceBand.from_dataframe(south, threshold=2, ids="FIPS").id_order[:5])
['54029', '54009', '54069', '54051', '10003'] Sorting of the print(weights.Delaunay.from_dataframe(south.set_geometry(south.centroid), ids="FIPS").id_order[:5])
[0, 1, 2, 3, 4, 5]
print(weights.Gabriel.from_dataframe(south.set_geometry(south.centroid), ids="FIPS").id_order[:5])
[0, 1, 2, 3, 4, 5] And this fails entirely, since the columns aren't propagated down to the FIPS weights.Voronoi(south.set_geometry(south.centroid), ids='FIPS') because columns aren't correctly propagated through to the |
"The numba package is used extensively in this module" | ||
" to accelerate the computation of graphs. Without numba," | ||
" these computations may become unduly slow on large data." | ||
) | ||
edges, _ = self._voronoi_edges(coordinates) | ||
voronoi_neighbors = pandas.DataFrame(edges).groupby(0)[1].apply(list).to_dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is where we have to intervene to get ids to "work":
- Build
pandas.DataFrame(edges)
and store asadjlist
- re-assign the head/tail values for each edge according to the id list (e.g.
edges.loc[:,0] = ids[edges.loc[:,0]
) - then proceed with the group by and dict to get the weights into
W
input format.
I believe this would have to happen at the same location in Gabriel neighbors (line 181) and I don't yet understand how to do this at 223 (maybe pass WSP(sp, ids=...?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, too, the Voronoi fix would be to set the index of the voronoi cell dataframe in the Voronoi function...
@ljwolf whoops. I'll have a look at Delaunay/Voronoi stuff later unless you want to commit some changes yourself (which may be faster).
Can we solve the ordering issue in another PR and have this focused on passing the ids through? Since this hasn't changed from what we have on master. |
Sure! I'm drafting a fix for these inconsistencies now, will send |
I've sent a PR upstream (martinfleis/libpysal#35), since I'm not sure if this counts as "fixing" sorting by default.... What I'm referring to is this:
The current state of things is:
I'd encourage us to not break order in |
propagate relabelling for gabriel weights
The order should now be preserved for contiguity as well. I think I'll add some tests for this before merging. |
Should be ready now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sweet
@martinfleis should we merge this? |
Unless there are any other comments it is ready to merge, yes. |
As a follow-up of the discussion we had yesterday during the dev call, I am starting this PR more as a place for a discussion on top of a code rather than an actual code change.
I have created a
from_dataframe_new
method to match the ideal outcome of the transition fromids
,idVariable
andid_order
to a cleaner API. What needs to be figured out is a transitioning phase.Looking at the code, I also think that @ljwolf's suggestion to deprecate
id_order
should be done at the same time as the rest of the changes as these are quite closely linked together.There is one question we haven't discussed though. What shall we do with the actual dataframe index? We currently completely ignore it and when no
ids
are passed use positional indexing to encode weights. But the most sensible default would be to use the dataframe index instead imho. However, this is quite a big breaking change, so I am not sure what would be the best on that front. edit: the docstring now saysIf nothing is provided, the dataframe index is used
but the reality does not follow that.The API change proposed here on the example of Rook should be then mirrored to all relevant
from_dataframe
methods. In some cases (KNN, DistanceBand), it involves nearly no change asids
behave as we want it and there is noidVariable
. So the main goal here is to ensure the API is consistent.Relevant discussions: #473, #183, #184 (which does essentially the same thing as this PR), #102