- 
                Notifications
    
You must be signed in to change notification settings  - Fork 42
 
New types of datasets supported for Delmic HDF5 format #328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
          Codecov Report❌ Patch coverage is  
 Additional details and impacted files@@            Coverage Diff             @@
##             main     #328      +/-   ##
==========================================
+ Coverage   88.02%   88.08%   +0.06%     
==========================================
  Files          91       91              
  Lines       11538    11798     +260     
  Branches     2131     2186      +55     
==========================================
+ Hits        10156    10392     +236     
- Misses        875      890      +15     
- Partials      507      516       +9     ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
  | 
    
| 
           pre-commit.ci autofix  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @noemiebonnet for the substantial progress! I left a few comments from a first browsing over the code, but did not yet have time for a closer look. Please include the standard tracking list in the initial comment to get an overview of how far the PR is.
There are still a number of lines uncovered by tests, as commented by codecov. If lines should explicitly be ignored in the coverage test, one can add the comment # pragma: no cover at the end of the line. I see that most uncovered lines concern warnings or errors that might surface during file reading. It might be hard to test for those if one has only proper test files. What is our current best-practice in that case @ericpre? (Note that you can also get inspiration on such cases from other files readers recently implemented, such as Horiba JobinYvon or Hamamatsu).
| 
           A single spectrum is still loaded as   | 
    
        
          
                rsciio/delmic/_api.py
              
                Outdated
          
        
      | the associated image type. | ||
| """ | ||
| if Acq is None: | ||
| raise TypeError( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, all the various loading errors are giving codecov warnings, because they are not included in the tests. In hdf5, it should be rather straight forward to create some minimal test files for each case where certain elements are deleted from the file to test each of these cases. An idea could also be to use hdf5 files from some other rsciio plugins that do not match the delmic specifications to test some of the errors.
        
          
                rsciio/delmic/_api.py
              
                Outdated
          
        
      | or Image.shape[3] < 1 | ||
| or Image.shape[4] < 1 | ||
| ): | ||
| raise ValueError( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an example for testing errors, here you could use the file from a different type of data and change the img_type to get a case that triggers this error.
        
          
                rsciio/delmic/_api.py
              
                Outdated
          
        
      | Scale = np.array(ImgData.get(scale_key)) | ||
| 
               | 
          ||
| if axis_name in ["C", "T"]: | ||
| scale_value = np.mean( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some other readers have a parameter to choose whether to read such axes as a UniformDataAxis by calculating the mean scale, or as a non-uniform DataAxis by just taking exactly the vector that is in the file.
see e.g.:
rosettasciio/rsciio/jobinyvon/_api.py
Line 684 in 1274ed5
| use_uniform_signal_axis : bool, default=False | 
| 
           @noemiebonnet is it possible to update the progress you have made on your local branch? Might be enough to include the current PR in the release planned for the next days: #381  | 
    
3ab696c    to
    853f36c      
    Compare
  
    | 
           I'm running into an error message after converting a .h5 into a .hspy, and then trying to load the .hspy: 
 To reproduce: 
 It looks like it might be related to the original metadata - as if the initial loading of .h5 is done without the original metadata, the error is avoided, e.g.: 
  | 
    
          
 Indeed, the parsing of  and it results in: All values are read in as array, no matter of their content - none of them actually seem to contain more than one array element. In particular, nested dictionaries such as for the   | 
    
853f36c    to
    cd5b958      
    Compare
  
    | 
           I will pick up this PR from now on. I went through all the code and the comments of this pull-request and attempted to address all the issues raised. Here is a summary of the (major) update: 
 I now see that some CI tests are failing (though all the test cases did pass on my computer). I'll dig into these issues tomorrow. Any feedback is already well appreciated!  | 
    
        
          
                upcoming_changes/328.new.rst
              
                Outdated
          
        
      | @@ -0,0 +1 @@ | |||
| :ref:`Delmic <delmic-format>` format: add support for Delmic HDF5 (cathodoluminescence) acquisition files | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We added basic support in the 0.7 release:
https://rosettasciio.readthedocs.io/en/latest/changes.html#id42
So the changelog should make clear that more complete support for different types of acquistions is now added.
| of three datasets. It is possible to load each of them separately. | ||
| 
               | 
          ||
| .. Note:: | ||
| To load the various types of datasets in the file, use the ``signal`` argument | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were discussing what is actually the best default here - reading only CL to return a single signal item as in most other readers or the list with all items, but I guess the main point would be that it is well documented.
Also, I am not certain having the survey first is the best order of the signals in the list. I assume the order is motivated by the odemis files? Generally, I would find it more intuitive to have the actual CL signal first, then concurrent and last survey (making the survey always last item, even if there are multiple streams in one file).
Finally, I would say there is not really a point in reading in the concurrent se when it is dimensionless (for spectra, single transients or streak images).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular having the anchor region data from drift correction included in the default loading seems a bit overload - so maybe default to cl and have an option all to include the other datasets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I've now changed it to return only the "cl" datasets by default. Also changed from None to "all" to get all the datasets.
Concerning the datasets with a single point, although it's probably not useful, I'd rather always return them when that type of dataset (or "all") is requested. This keeps the code simple and makes sure that if for some reason the user actually do care about this single pixel, they can still read it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough for the datasets with a single point. Only, I would still tend to reverse the order of the different signals in the list for "all" to have the 'cl' first (as it is the main one and now also the one that is read by default) and then add the additional signals with 'survey' last.
| 
           Thanks Eric for picking up on this. In general it looks good, and some first tests with real data did not surface any problems. We are actually interested in getting this released ASAP. So I would try to resolve anything that needs to be decided in terms of API (see some comments on the  In the mid-term, additional features such as metadata parsing and features such as reading in scan areas/points on the survey as markers could be added via separate PRs. Coverage is not optimal, but missing lines mainly seem to concern errors at could maybe also be addressed later.  | 
    
          
 Thanks Jonas for the review. I've tried to correct the code for all your comments. I agree with the way-forward. Let's try to get this big change merged in, and then we can more easily bring extra features later on in "small bites". Let me know if you want me to squash some of the commits together before merging the PR.  | 
    
| 
           I've tried hard to get the packaging test to pass, but I'm now at loss at how to handle the current errors. I assume it's related to the way the test data is stored... but no idea what should be changed to fix it, or even if that's a real error. Do you have any hint on how to solve such error?  | 
    
| 
           I don't remember the exact reason, but this is related to a recently merged PR #417 that add these test files and these are not included in this branch. A rebase should fix it.  | 
    
          
 Maybe squash the last few commits with only minor corrections into one.  | 
    
Major overhaul of the code structure, to be more generic, and handle all (corner) cases. Adjust the behaviour of "signal" argument: it now defaults to passing all the data, and all the possible options are lower-case only. The metadata now also include acquisition time and dataset title. The original_metadata now also include the SVIData and acquisition time. Ensure that the original_metadata dict only contains basic Python types (ie, no numpy array or JSON-encoded strings) Fix offset value in the X & Y which were incorrectly computed. Fix incorrect direction of the X axes in 4D angular-resolved datasets. If LumiSpy is installed, use the corresponding Signal type for dataset matching CL_SEM, LumiTransient, and LumiTransientSpectrum types. Support angular-resolved data with multiple polarizations. Remove some test cases which were only testing the same as another one. Add test case attempting to load incorrect HDF5 file.
ruff check registry handle lumispy not present
By default, only return cathodoluminescence datasets. Also adjust the value to pass to get all the datasets from None to "all".
c1449f7    to
    9c63bc9      
    Compare
  
    Don't return the datasets based on the order stored in HDF5, but by type. Especially, for the user, CL data is the most interesting, so returning it first. Also add a test case for anchor region (aka drift correction).
Some of the exception handlers should only be reached with malformed data, which we don't have. So let's not count these paths for now.
9c63bc9    to
    9bac123      
    Compare
  
    | 
           I've updated the pull-request with these changes: 
 Talking of coverage, I don't have the tools to measure it locally, so it's a bit hard to check how well it will go. Also it seems the coverage is considered a "fail" if it's below the project average, which is currently at 87%. Please be aware that mathematically, if every merge is above the average, this average will increase, which will make the bar to entry higher and higher for the next pull-requests, which will make the work of submitters harder and harder...  | 
    
| 
           Thanks Éric :-) 
 My thought was to have survey last so that it can always be accessed as the [-1] element of the list (and anchor is not always there).  | 
    
          
 You can see the current report and a markup of missing lines at Codecov, but you have reached the threshold now: The aim should be close to 100% coverage for the relevant parts of new contributions (and most of the rest can usually be ignored by pragmas). It is well below 100% mainly because some of the oldest readers started out without any tests and were never brought close to this value, but indeed we have improved over the years. In the end, the check is mainly to have a guideline where we are heading and it is at the discretion of the maintainers to merge PRs even with a below average coverage. From experience, it is good to directly push for good coverage instead of postponing that to separate PRs.  | 
    
| 
           There will be an expected failure on the build (the one labelled "hyperspy_dev") with the development branch of hyperspy. This can be ignored in this PR. For the coverage, if it doesn't pass the currently set criteria, it is not a big deal, usually what we do is to check online (using the link in the "github checks") where the code is missed and correct where needed.  | 
    
          
 Seems to have solved all but one   | 
    
| 
           Yes, these are the one that are expected and sorted in #425 and hyperspy/hyperspy#3528.  | 
    
          
 Ok. Changed the order to CL, SE, anchor, survey.  | 
    
| 
           Thanks @pieleric - that is a great step forward. TODO: 
  | 
    
New supported formats: intensity, hyperspectral, angle-resolved, E-k, time-resolved.
Progress of the PR
original_metadata,upcoming_changesfolder (seeupcoming_changes/README.rst),docs/readthedocs.org:rosettasciiobuild of this PR (link in github checks)