clarifying the meaning of the bounds variable #416

dcherian · 2025-05-08T04:07:04Z

dcherian
May 8, 2025

Question

Hello, over at Xarray we are trying to add support for indexing using interval bounds recorded as CF recommends it.

To do so we need to set whether the bounds values are half-open on one side or closed .

From example 7.1 which says The instant time(i) should be contained within the interval, or be at one end of it." it seems like the answer is "closed on both edges".

However when combined with a later sentence "if the interval i=3 begins at 11.0 days, when interval i=2 ends, the values in timebnd(3,0) and timebnd(2,1) must be exactly the same." then it seems like the value at t=11.0 days is undefined since it technically belongs to two intervals.

Am I understanding this correctly?

sethmcg · 2025-05-08T05:04:47Z

sethmcg
May 8, 2025
Collaborator

Which end is open depends on where the time coordinate sits relative to the bounds. If the time coordinate is at the beginning of the interval, then it's half-open to the right. If the time coordinate is at the end of the interval, it's half-open to the left. If it's somewhere in the middle, then it doesn't really matter.

So if the bounds of interval i=3 are [11.0, 12.0] and time[3] = 11.0, then the time coordinate is at the beginning of the interval, it's half-open to the right, and t=11.0 belongs to interval 3.

If time[3] = 12.0, then the time coordinate is at the end of the interval, it's half-open to the left, and t=11.0 belongs to interval 2.

If time[3] = 11.5, then the time coordinate is in the middle of the interval and it's indeterminate which interval t=11.0 belongs to, but it's immaterial because there isn't any data at t=11.0.

(Your question has made me realize that CF doesn't specify that the relative positioning of the coordinate with respect to the bounds should be constant, and that it's not actually forbidden to change where you put it from interval to interval. However, that would be incredibly bad practice, and I would be genuinely shocked if anyone has ever done it. It would be perfectly justifiable for you to write code that only checks a single timestep to determine which end of the interval is open, and if somebody has problems because their data violates that assumption, that's on them.)

1 reply

benbovy May 8, 2025

then the time coordinate is in the middle of the interval and it's indeterminate which interval t=11.0 belongs to, but it's immaterial because there isn't any data at t=11.0.

I'm not sure this is always true? Cell data may be defined over the full extent of the cell.

taylor13 · 2025-05-08T06:41:36Z

taylor13
May 8, 2025
Collaborator

I would agree with Seth that in practical terms, it doesn't matter much except in a formal sense the coordinate value must always be found in the interval defined by the bounds. A common use of bounds is to determine the size of the interval represented by the coordinate value, and I think the "width" of a cell is independent of whether the bounds of the cell are open or closed.

0 replies

Armin-RS · 2025-05-08T07:21:39Z

Armin-RS
May 8, 2025

But for the data user, it does matter on which side the bounds of the cell are open or closed.

I produce a gridded data set on a regular grid in projected x/y coordinates. Similar to Figure 7.1, my x and y grid cell centre coordinates are at nn.5 km, while the bounds are at nn.0 km and (nn+1).0 km. And each cell has its own daily precipitation value.

If my user wants to find the correct daily precipitation value for arbitrary coordinates, it is easy to select the correct cell if the arbitrary coordinates are not equal to any of the cell bounds. But if one or both arbitrary coordinates match the cell bounds, i.e. they are full kilometre values, which of the 2 or 4 cells should he select to extract the correct daily precipitation value for his selected coordinates ?

3 replies

benbovy May 8, 2025

Yes I also think that for data selection / extraction it does matter to know whether cell boundaries are open or closed.

sethmcg May 8, 2025
Collaborator

If my user wants to find the correct daily precipitation value for arbitrary coordinates

The problem is that there's no such thing.

If you look at the value for a single gridcell / time interval, that's a discrete sample of an underlying field that is continuous. It's representative of the overall conditions in the cell / interval, but it's not the value for the whole thing. It's not like the temperature is exactly 72F over an entire 25x25 km gridcell, and then when you get to the edge there's a step transition down to 68F when you move into the next one.

It's more like the temperature here is 72, and over there it's 68, and in-between it's somewhere in the middle, but even that is only an approximation. There's all kinds of sub-scale variability being glossed over. It could be that it's 72 here and 68 there, but the point in the middle is 55 because it's on top of a big hill. Or if we're talking about time, if it was 68 at 11 am and 72 at noon, then at 11:30 it was probably around 70. But maybe not; maybe there was a sudden thundershower that briefly cooled things off. And if it was 68 at noon yesterday and it's 72 at noon today, that definitely doesn't imply that it was 70 at midnight last night!

Fundamentally, you only know what the values are at the points where they are defined. If you want values at other points, you have to do some kind of interpolation, and there are a million different ways to do it. There is no correct answer. "Nearest neighbor" is a valid interpolation method, and how you handle it when the distance to the two nearest neighbors is equal is a choice about the interpolation method you're using, not anything to do with what the value of the field at that point would theoretically be if you measured it. You can implement the algorithm however you like and it will be equally valid; the only consideration is that, as others have mentioned, it's valuable to users for it to be stable and give you the same answer each time you use it.

benbovy May 8, 2025

If you look at the value for a single gridcell / time interval, that's a discrete sample of an underlying field that is continuous. It's representative of the overall conditions in the cell / interval, but it's not the value for the whole thing.

What about other examples where the value results from a measurement that has been made over the whole gridcell / time interval? Like, e.g., the number of events that occurred during the time interval, the maximum value of some quantity that has been recorded over the interval, etc. There's no sub-cell or sub-interval variability in this case, the value is defined over the entire cell or interval and not at a particular point within that cell / interval. Regardless of how stable an algorithm is, depending on how the boundaries are interpreted it will yield either right or wrong results.

I could thus imagine that for those examples data providers may want to let users know exactly which intervals have been used for making those measurements, i.e., which bounds are open/closed... At least in theory, as in practice I don't have enough experience with CF-conventions and the related domain to know if CF should really take care of that. Or maybe CF already addresses that by other means?

davidhassell · 2025-05-08T13:20:58Z

davidhassell
May 8, 2025
Maintainer

I agree that for data selection / extraction it does matter to know whether cell boundaries are open or closed, but it is currently beyond the realm of CF to provide us with that information. Therefore the user can do as they please!

I don't quite follow the example described by @Armin-RS above ... what does the user want here - the value from a unique cell, or a number of cells from which an interpolation can be performed? or something else? In the former case I would say that it doesn't matter which unique cell is returned, as long as the software gives the same result when the operation is repeated (i.e. the software makes a predictable assumption, and ideally the user can configure it).

0 replies

taylor13 · 2025-05-08T13:49:10Z

taylor13
May 8, 2025
Collaborator

I agree with @davidhassell . I think it is the user's responsibility to decide what data to extract at the bounds of a grid cell. I would usually choose to extract the mean of the values carried by the two cells separated by the boundary (perhaps with some weighting based on grid cell size). But others might make a different choice.

0 replies

dcherian · 2025-05-08T17:43:48Z

dcherian
May 8, 2025
Author

"If the time coordinate is at the beginning of the interval, then it's half-open to the right. If the time coordinate is at the end of the interval, it's half-open to the left. If it's somewhere in the middle, then it doesn't really matter."

I don't see this written anywhere. Is this the common interpretation? If so, would it be in scope for CF to document it.

I think it is the user's responsibility to decide what data to extract at the bounds of a grid cell.

Clarifying the conventions document with regard to this would be great for downstream users and analysis libraries. We will add an option to allow the user to choose which side is closed.

1 reply

pvanlaake May 30, 2025

It may be a common interpretation, but it is not an established custom. To give a very practical contrary example: ERA5 data for cumulative properties (such as precipitation) are recorded at the end of the time interval. So the time coordinate 13:00:00 records the interval noon to 13:00:00, inclusive. That is well documented.

That latter point immediately points to the sore point, also pointed out above by @sethmcg: since CF does not require a certain design nor having a mechanism to convey it to the data consumer, it is up to the data producer to choose and to document the convention to be used. But that documentation is not standardised and xarray thus has nothing to navigate on.

ChrisBarker-NOAA · 2025-05-08T18:42:15Z

ChrisBarker-NOAA
May 8, 2025
Collaborator

Hmm -- it sure would make sense to make this clear in CF, but ...

This is a real challenge. I'm not sure this issue is well clarified, well, anywhere at all. So I'm a bit wary of requiring it in CF.

It's a very common use case to have a grid, or mesh, or what have you, defined by points, and from that cells bounded by lines (or planes, if 3D) -- this applies to 1-D rectangular, curvilinear, unstructured grids, what have you. And indeed, is the essentially the same issue as, say polygonal regions in GIS systems.

So you always have the issue of what cell, or polygon, or whatever, a point is in if it lies EXACTLY on a boundary:

If I'm exactly on the US-Canadian border -- which country am I in? Should a CF file define that?

In theory, the bounds themselves are of zero size: infinitesimally small points, lines, planes so it's actually impossible for a finite "thing" to be on the boundary.

In practice, these concepts are represented by finite numbers, so a point can be exactly on a boundary.

But also in practice, these numbers are often floating point numbers with limited precision, so "exactly" on the boundary is pretty darn hard to define -- it's actually "within an epsilon of the boundary"

I think the common way to deal with this is to take the concepts at their word -- the bounds are infinitesimally small, and so it really doesn't matter which side of a boundary you use, as long as it's consistent.

If you are interpolating, then you should get a continuous result either way.
If there is a single value applied to the entire cell, then there is a discontinuity at the boundaries in any case, the only question is which side of the boundary the discontinuity is -- within one epsilon.

Practically speaking, particularly with floating point computations, the answers are the same within the precision of the numbers, regardless.

A key note on dealing with this is that code needs to assure that you always get the same answer at the bounds:

e.g:
Good point-in-polygon code may not clearly define whether a point is inside or outside a polygon when it is exactly on a vertex or segment. But it will always give the same answer, regardless of direction -- that is, a point will always be in one, and only one, of two adjacent polygons.

So -- my thoughts on this question:

I don't think CF should define which side of bounds are "open".

But if it does -- we need to make sure to consider the more complex cells: 2D, 3D, curvilinear, polygonal ....

And once you get past simple 1-D or 2-D rectangular, it gets very complicated!

NOTE: what's the precedent in xarray (or numpy, or ...) for other similar problems, like for instance generating a histogram? which bin do you put the values that lie exactly on a bin boundary?

For xarray -- I think it should concern itself with making sure algorithms manage cell boundaries consistently, and not worry about what the data producer thinks :-).

The option of "add an option to allow the user to choose which side is closed." Sounds fine to me -- maybe it IS appropriate for the user to decide.

3 replies

dcherian May 8, 2025
Author

what's the precedent in xarray (or numpy, or ...) for other similar problems, like for instance generating a histogram? which bin do you put the values that lie exactly on a bin boundary?

Bins are always half-open, you can pick which side: https://numpy.org/doc/stable/reference/generated/numpy.digitize.html . Pandas will error if you pass fully-closed intervals to pd.cut for example.

sethmcg May 8, 2025
Collaborator

I don't think CF should define which side of bounds are "open".

Agreed, and I think it would be a problem for backwards-compatibility. I think it's very likely that there are datasets out there that have made different choices about putting the time coordinate at different ends of the interval.

dcherian May 9, 2025
Author

The inverse problem is that if I generate a histogram with half-open bins (e.g. with np.histogram or pandas.cut), CF does not define a standard way to record that information.

ChrisBarker-NOAA · 2025-05-08T18:44:56Z

ChrisBarker-NOAA
May 8, 2025
Collaborator

Hmm -- just looked back and noticed that the example, at least is for time -- which is 1D and relatively simple.

So if we wanted to special-case time and define which side is open, I can't see the harm in that. But I'm still not sure it's necessary.

0 replies

ChrisBarker-NOAA · 2025-05-08T19:06:17Z

ChrisBarker-NOAA
May 8, 2025
Collaborator

which bin do you put the values that lie exactly on a bin boundary?

Bins are always half-open, you can pick which side: https://numpy.org/doc/stable/reference/generated/numpy.digitize.html .

That seems like a reasonable approach then.

Again, the concept is far less straightforward for more complex cells, but for the simple cases this works.

It would be good to come up with a more universal term than "right" -- for time, and vertical, and N-S ....

But, as they say, "naming things is hard".

In short, I think a number of us have said that it's OK that this be a decision left tot he analyst / user of the data, rather than the producer of the data.

-CHB

1 reply

benbovy May 8, 2025

In short, I think a number of us have said that it's OK that this be a decision left tot he analyst / user of the data, rather than the producer of the data.

I could see some potential cases where the decision would be better left to the data provider (https://github.com/orgs/cf-convention/discussions/416#discussioncomment-13080381) although I'm missing good practical examples illustrating it.

JonathanGregory · 2025-05-09T12:51:07Z

JonathanGregory
May 9, 2025
Maintainer

Would it be helpful to insert a statement somewhere in the CF document that the convention does not define which side is closed, in order to be clear that it's not clear, so to speak?

1 reply

dcherian May 9, 2025
Author

Yes absolutely, that would help in the near-term.

IMO in the long term CF should consider defining a way for users to record this information. For example, if I generate a histogram with half-open bins (e.g. with np.histogram or pandas.cut), how do I communicate that to a downstream user?

ChrisBarker-NOAA · 2025-05-09T17:53:18Z

ChrisBarker-NOAA
May 9, 2025
Collaborator

For example, if I generate a histogram with half-open bins (e.g. with np.histogram or pandas.cut), how do I communicate that to a downstream user?

I was about to ask a similar question -- I can't see I've ever seen a histogram where someone specified whether the bins were defined as "right" or "left" open. Is this an important piece of information to communicate?

See my earlier note -- in theory, it's irrelevant -- or no more relevant than any number of other details about how a derived product may have been computed. (e.g. precision of the input data -- integer, single or double?)

A histogram is a discreet version of a (theoretically) continuous distribution -- the divisions between bins are theoretically infinitely narrow -- if the results are sensitive to which side of the bins was open, you've got other more significant problems with your analysis ;-)

-CHB

0 replies

taylor13 · 2025-05-09T18:02:01Z

taylor13
May 9, 2025
Collaborator

regarding histogram bounds for the "category" axis" (say, named "counts"), you could in the cell_methods insert text like the following: cell_methods = "counts: sum (with counts included that coincide with lower bound and counts excluded that coincide with the upper bound)"
[I'm sure someone could come up with more succinct and clear phrasing.]

3 replies

ChrisBarker-NOAA May 9, 2025
Collaborator

well, IIUC:

"""
The values of method should be selected from the list in Appendix E, Cell Methods, which includes point, sum, mean, among others. Case is not significant in the method name. Some methods (e.g., variance) imply a change of units of the variable, as is indicated in Appendix E, Cell Methods.
"""

So you are supposed to use an existing method -- is there one for "binning" maybe there should be.

I don't think "sum" is really the correct term is it?

And I don't see that adding a modifier to a cell method is allowed -- maybe it should be -- ideally in a standardized way -- so you'd have somethign like:

cell_method = "binning: right"

where either "right" or "left" are allowed.

But at the end of the day -- we can't standardize everything -- so does this need to be standardized?

Kind of like standard names: they specify what the "thing" is -- but there's always additional info that a user might want.

So as long as a CF file can be interpreted correctly (e.g. this is a histogram, or an average value in a cell, or ...) -- then exactly how it was computed can be included in non-standardized meta data.

JonathanGregory May 15, 2025
Maintainer

I think sum is the correct method, because a count is a kind of sum, and because sum is the default for quantities which are extensive within their cells, implying that by sum we mean "extensive". Perhaps we should provide extensive as an alias for sum. However, for a probability density function, sum wouldn't be appropriate, because it's intensive wrt to the cell, having been divided by the width. The default for intensive quantities is point, which isn't correct either for PDFs. The nearest might be mean because probability divided by the width of the cell is a kind of weighted mean.

ChrisBarker-NOAA May 15, 2025
Collaborator

sum is the correct method, because a count is a kind of sum,

well, sure -- but the question is: does setting cell_methods: "sum" make more or less clear what the data mean?

I would say less clear, not more. When making a histogram, you are not summing up the values within the cell, you are counting them.

presumably the fact that it's a histogram needs to be specified somewhere, somehow, I think using cell_methods does not help, and could confuse.

The nearest might be mean because probability divided by the width of the cell is a kind of weighted mean.

Yes -- if you think if a histogram as a discrete version of a distribution, then it would be mean, yes? However, that doesn't really help the problem at hand.

clarifying the meaning of the bounds variable #416

Uh oh!

Question

Replies: 12 comments · 13 replies

Uh oh!

sethmcg May 8, 2025 Collaborator

Uh oh!

Uh oh!

taylor13 May 8, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

sethmcg May 8, 2025 Collaborator

Uh oh!

Uh oh!

davidhassell May 8, 2025 Maintainer

Uh oh!

taylor13 May 8, 2025 Collaborator

Uh oh!

dcherian May 8, 2025 Author

Uh oh!

Uh oh!

ChrisBarker-NOAA May 8, 2025 Collaborator

Uh oh!

dcherian May 8, 2025 Author

Uh oh!

sethmcg May 8, 2025 Collaborator

Uh oh!

dcherian May 9, 2025 Author

Uh oh!

ChrisBarker-NOAA May 8, 2025 Collaborator

Uh oh!

ChrisBarker-NOAA May 8, 2025 Collaborator

Uh oh!

Uh oh!

JonathanGregory May 9, 2025 Maintainer

Uh oh!

dcherian May 9, 2025 Author

Uh oh!

ChrisBarker-NOAA May 9, 2025 Collaborator

Uh oh!

taylor13 May 9, 2025 Collaborator

Uh oh!

ChrisBarker-NOAA May 9, 2025 Collaborator

Uh oh!

JonathanGregory May 15, 2025 Maintainer

Uh oh!

Uh oh!

ChrisBarker-NOAA May 15, 2025 Collaborator

Replies: 12 comments 13 replies

sethmcg
May 8, 2025
Collaborator

taylor13
May 8, 2025
Collaborator

sethmcg May 8, 2025
Collaborator

davidhassell
May 8, 2025
Maintainer

taylor13
May 8, 2025
Collaborator

dcherian
May 8, 2025
Author

ChrisBarker-NOAA
May 8, 2025
Collaborator

dcherian May 8, 2025
Author

sethmcg May 8, 2025
Collaborator

dcherian May 9, 2025
Author

ChrisBarker-NOAA
May 8, 2025
Collaborator

ChrisBarker-NOAA
May 8, 2025
Collaborator

JonathanGregory
May 9, 2025
Maintainer

dcherian May 9, 2025
Author

ChrisBarker-NOAA
May 9, 2025
Collaborator

taylor13
May 9, 2025
Collaborator

ChrisBarker-NOAA May 9, 2025
Collaborator

JonathanGregory May 15, 2025
Maintainer

ChrisBarker-NOAA May 15, 2025
Collaborator