clarifying the meaning of the bounds variable #416
Replies: 12 comments 13 replies
-
Which end is open depends on where the time coordinate sits relative to the bounds. If the time coordinate is at the beginning of the interval, then it's half-open to the right. If the time coordinate is at the end of the interval, it's half-open to the left. If it's somewhere in the middle, then it doesn't really matter. So if the bounds of interval If If (Your question has made me realize that CF doesn't specify that the relative positioning of the coordinate with respect to the bounds should be constant, and that it's not actually forbidden to change where you put it from interval to interval. However, that would be incredibly bad practice, and I would be genuinely shocked if anyone has ever done it. It would be perfectly justifiable for you to write code that only checks a single timestep to determine which end of the interval is open, and if somebody has problems because their data violates that assumption, that's on them.) |
Beta Was this translation helpful? Give feedback.
-
I would agree with Seth that in practical terms, it doesn't matter much except in a formal sense the coordinate value must always be found in the interval defined by the bounds. A common use of bounds is to determine the size of the interval represented by the coordinate value, and I think the "width" of a cell is independent of whether the bounds of the cell are open or closed. |
Beta Was this translation helpful? Give feedback.
-
But for the data user, it does matter on which side the bounds of the cell are open or closed. I produce a gridded data set on a regular grid in projected x/y coordinates. Similar to Figure 7.1, my x and y grid cell centre coordinates are at nn.5 km, while the bounds are at nn.0 km and (nn+1).0 km. And each cell has its own daily precipitation value. If my user wants to find the correct daily precipitation value for arbitrary coordinates, it is easy to select the correct cell if the arbitrary coordinates are not equal to any of the cell bounds. But if one or both arbitrary coordinates match the cell bounds, i.e. they are full kilometre values, which of the 2 or 4 cells should he select to extract the correct daily precipitation value for his selected coordinates ? |
Beta Was this translation helpful? Give feedback.
-
I agree that for data selection / extraction it does matter to know whether cell boundaries are open or closed, but it is currently beyond the realm of CF to provide us with that information. Therefore the user can do as they please! I don't quite follow the example described by @Armin-RS above ... what does the user want here - the value from a unique cell, or a number of cells from which an interpolation can be performed? or something else? In the former case I would say that it doesn't matter which unique cell is returned, as long as the software gives the same result when the operation is repeated (i.e. the software makes a predictable assumption, and ideally the user can configure it). |
Beta Was this translation helpful? Give feedback.
-
I agree with @davidhassell . I think it is the user's responsibility to decide what data to extract at the bounds of a grid cell. I would usually choose to extract the mean of the values carried by the two cells separated by the boundary (perhaps with some weighting based on grid cell size). But others might make a different choice. |
Beta Was this translation helpful? Give feedback.
-
I don't see this written anywhere. Is this the common interpretation? If so, would it be in scope for CF to document it.
Clarifying the conventions document with regard to this would be great for downstream users and analysis libraries. We will add an option to allow the user to choose which side is closed. |
Beta Was this translation helpful? Give feedback.
-
Hmm -- it sure would make sense to make this clear in CF, but ... This is a real challenge. I'm not sure this issue is well clarified, well, anywhere at all. So I'm a bit wary of requiring it in CF. It's a very common use case to have a grid, or mesh, or what have you, defined by points, and from that cells bounded by lines (or planes, if 3D) -- this applies to 1-D rectangular, curvilinear, unstructured grids, what have you. And indeed, is the essentially the same issue as, say polygonal regions in GIS systems. So you always have the issue of what cell, or polygon, or whatever, a point is in if it lies EXACTLY on a boundary: If I'm exactly on the US-Canadian border -- which country am I in? Should a CF file define that? In theory, the bounds themselves are of zero size: infinitesimally small points, lines, planes so it's actually impossible for a finite "thing" to be on the boundary. In practice, these concepts are represented by finite numbers, so a point can be exactly on a boundary. But also in practice, these numbers are often floating point numbers with limited precision, so "exactly" on the boundary is pretty darn hard to define -- it's actually "within an epsilon of the boundary" I think the common way to deal with this is to take the concepts at their word -- the bounds are infinitesimally small, and so it really doesn't matter which side of a boundary you use, as long as it's consistent.
Practically speaking, particularly with floating point computations, the answers are the same within the precision of the numbers, regardless. A key note on dealing with this is that code needs to assure that you always get the same answer at the bounds: e.g: So -- my thoughts on this question: I don't think CF should define which side of bounds are "open". But if it does -- we need to make sure to consider the more complex cells: 2D, 3D, curvilinear, polygonal .... And once you get past simple 1-D or 2-D rectangular, it gets very complicated! NOTE: what's the precedent in xarray (or numpy, or ...) for other similar problems, like for instance generating a histogram? which bin do you put the values that lie exactly on a bin boundary? For xarray -- I think it should concern itself with making sure algorithms manage cell boundaries consistently, and not worry about what the data producer thinks :-). The option of "add an option to allow the user to choose which side is closed." Sounds fine to me -- maybe it IS appropriate for the user to decide. |
Beta Was this translation helpful? Give feedback.
-
Hmm -- just looked back and noticed that the example, at least is for time -- which is 1D and relatively simple. So if we wanted to special-case time and define which side is open, I can't see the harm in that. But I'm still not sure it's necessary. |
Beta Was this translation helpful? Give feedback.
-
That seems like a reasonable approach then. Again, the concept is far less straightforward for more complex cells, but for the simple cases this works. It would be good to come up with a more universal term than "right" -- for time, and vertical, and N-S .... But, as they say, "naming things is hard". In short, I think a number of us have said that it's OK that this be a decision left tot he analyst / user of the data, rather than the producer of the data. -CHB |
Beta Was this translation helpful? Give feedback.
-
Would it be helpful to insert a statement somewhere in the CF document that the convention does not define which side is closed, in order to be clear that it's not clear, so to speak? |
Beta Was this translation helpful? Give feedback.
-
I was about to ask a similar question -- I can't see I've ever seen a histogram where someone specified whether the bins were defined as "right" or "left" open. Is this an important piece of information to communicate? See my earlier note -- in theory, it's irrelevant -- or no more relevant than any number of other details about how a derived product may have been computed. (e.g. precision of the input data -- integer, single or double?) A histogram is a discreet version of a (theoretically) continuous distribution -- the divisions between bins are theoretically infinitely narrow -- if the results are sensitive to which side of the bins was open, you've got other more significant problems with your analysis ;-) -CHB |
Beta Was this translation helpful? Give feedback.
-
regarding histogram bounds for the "category" axis" (say, named "counts"), you could in the |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Question
Hello, over at Xarray we are trying to add support for indexing using interval bounds recorded as CF recommends it.
To do so we need to set whether the
bounds
values are half-open on one side or closed .From example 7.1 which says
The instant time(i) should be contained within the interval, or be at one end of it."
it seems like the answer is "closed on both edges".However when combined with a later sentence
"if the interval i=3 begins at 11.0 days, when interval i=2 ends, the values in timebnd(3,0) and timebnd(2,1) must be exactly the same."
then it seems like the value att=11.0 days
is undefined since it technically belongs to two intervals.Am I understanding this correctly?
Beta Was this translation helpful? Give feedback.
All reactions