Feature proposal: mzIdentML (mzid) #7

julianu · 2023-10-25T13:57:19Z

julianu
Oct 25, 2023

I would like to propose to implement an mzid reader and possibly also a writer.

One other thing, that is urgently important for this is to define the a PSM struct, which can then be used by all implementations handling identifications. Here, I would propose to start off with what mzid can offer. We discussed quite a lot while developing the format and I assume that almost everything you can say about a PSM is modeled in it. How we implement those is another thing, but having all the attributes and params defined in mzid would be helpful.

Should we implement this feature?

yes

75%

no

25%

4 votes

lazear · 2023-10-25T17:39:24Z

lazear
Oct 25, 2023
Collaborator

Writing mzIdentML is unfortunately somewhat complicated. If you decide to go that route, you can look at my implementation here: https://github.com/lazear/sage/tree/mzid/crates/mzidentml/src

4 replies

david-bouyssie Oct 25, 2023
Maintainer

I also have a non public crate for this purpose, that I could share for this work.

@lazear what do you find particularly difficult?

david-bouyssie Oct 25, 2023
Maintainer

I have some concerns about a common consensus regarding PSM representation. PSM are tricky because they can have a lot of properties. So we may end up with a common denominator (maybe using some definitions out there, including the mzid one) and more extended one that could hold additional and user defined properties (based or not on CV ontologies).
IMO, this is the kind of point that clearly requires proper specs.

lazear Oct 25, 2023
Collaborator

Mostly just the lack of any kind of consistency when it comes to naming different things in the spec ("dBSequence_ref" and "DBSequence"), and having to keep track of various IDs everywhere.

What's the goal for supporting reading (or writing) mzID? I also have concerns with common consensus (or lack thereof) regarding PSM repr. Search engine A and B might have vastly different property sets for their PSMs. I certainly wouldn't use a generic PSM struct in Sage.

david-bouyssie Oct 25, 2023
Maintainer

I think the goal is interop with some existing downstream processing tools (some search engines only give .mzid outputs, and some tools requires .mzid data as input). It could also be used for benchmarking tool ls development (e.g. ProteoBench). It could be used to create a Rust based version of psm_utils (https://github.com/compomics/psm_utils). Maybe @RalfG can comment on this.

I have to confess I'm not a big fan of XML based formats (and would prefer alternatives like SQLite, HDF,5, parquet, or MDBX), but this are the standards we have today.

julianu · 2023-10-26T07:33:10Z

julianu
Oct 26, 2023
Author

I totally agree with the problems of XML in mzIdentML and probably more modern (binary) implementations are more useful.
But until now, mzid (and mzML) at least offer a way to basically encode everything possible into it - at least on the time of writing the standard. But also newer ideas, which are not totally supported right now like crosslinking, will be added to mzid soonish (probably finalized at the next spring meeting, as far as I know).

So, my reasoning why we should take the implementation of mzid e.g. for the PSMs is, that with this we would cover all possibilities. I know, that each search engine or other downstream processing has its own ideas of naming and what actually is mandatory. But still, you can put it into the mzid representation. And that is definitely the thing I like very much about the PSI formats, speaking of mzid and mzML, NO mzTAB, which let's you model - well, final results at the best, which cannot be used as input.

Also, having these implementation (or structs) will make it probably easy in the future to allow support for any binary, more future proof formats. But right now, mzid would be the best full featured interchange format for all search engine results, or actually any ID results in proteomics.

1 reply

julianu Oct 26, 2023
Author

(Maybe this should have gone into an answer above. Apologies, discussions are new for me in GitHub :) )

david-bouyssie · 2023-10-26T08:14:15Z

david-bouyssie
Oct 26, 2023
Maintainer

I think two distinct and intersiting topics are raising here:

are there inportant use cases of mzid for Rusteomics that could justify the need to support it?
should we adopt a common repr of PSM (i.e. for instancein mzcore)?

We could maybe start to answer the first point because this is more pragmatic, and has less consequences (despite the required amount of work).

Regarding the second point there a lot of considerations, and IMO this will take time to find a consensus. And if we decide to answer positively to the first question, we can then adopt an mzid PSM representation within the mzid module. This would give a viable solution to people wanted to deal with PSM level data. And if we find later a strong rationale for a common PSM representation, this would give us a starting point.

3 replies

douweschulte Oct 26, 2023
Maintainer

Regarding question 1: For any work I want to do I need identified peptides support, for me personally that would be in the form of many different vendor formats (Peaks, Novor, Casanovo, ...), so if it was not built into mzio I would have to build it myself. Which I think will be the case for many people wanting to do some more post processing of MS data gotten from other tools. So I would vote for inclusion to prevent fragmentation and to build a golden standard implementation in Rust.
Regarding 2: If I can dream for a second, the very nicest implementation gives you the option to read in many different vendor formats into the same memory layout (same struct) which can then be saved into a set of commonly used standardised identified peptide standards for use in other tools. This struct will need quite a bit of discussion before the layout is finalised but after that is done adding any new input or output format should be somewhat straightforward as the final memory layout is already known. And the effort of adding support for all different file types could be left as an exercise for later.

david-bouyssie Oct 26, 2023
Maintainer

which can then be saved into a set of commonly used standardised identified peptide standards for use in other tools.

I think it depends on the tools we speak about. Some tools will always require their own design for several reasons. Maybe I'm wrong, but IMO it should benefit to some downstream analysis tools, but not forcibly to tools processing raw data directly (like Sage for instance).
It doesn't mean it's not useful, but the targeted audience is maybe (and presently) not as broad as for other features.
I would suggest to first try to list which tools/features would require this common API. Also, as mentioned in another discussion (https://github.com/orgs/rusteomics/discussions/8), the Rusteomics design could be driven by some end-users features implementation. This could help us to select the most urgent APIs to be developed, and how they should be developed.

julianu Oct 26, 2023
Author

Actually, your dreaming on 2) is exactly why I brought up the idea, why it would be great to have a good implementation of a PSM. I would definitely avoid having different structures for PSMs (or spectra) for each search engine, just because the naming or intentions slightly differ there. Having one fixed struct would make all downstream work much easier.
And: it kind of works in OpenMS, they also read in everything and put it into their one structure.

Rusteomics

Feature proposal: mzIdentML (mzid) #7

Uh oh!

julianu Oct 25, 2023

Replies: 3 comments · 8 replies

Uh oh!

lazear Oct 25, 2023 Collaborator

Uh oh!

david-bouyssie Oct 25, 2023 Maintainer

Uh oh!

Uh oh!

david-bouyssie Oct 25, 2023 Maintainer

Uh oh!

lazear Oct 25, 2023 Collaborator

Uh oh!

Uh oh!

david-bouyssie Oct 25, 2023 Maintainer

Uh oh!

julianu Oct 26, 2023 Author

Uh oh!

julianu Oct 26, 2023 Author

Uh oh!

Uh oh!

david-bouyssie Oct 26, 2023 Maintainer

Uh oh!

douweschulte Oct 26, 2023 Maintainer

Uh oh!

Uh oh!

david-bouyssie Oct 26, 2023 Maintainer

Uh oh!

julianu Oct 26, 2023 Author

julianu
Oct 25, 2023

Replies: 3 comments 8 replies

lazear
Oct 25, 2023
Collaborator

david-bouyssie Oct 25, 2023
Maintainer

david-bouyssie Oct 25, 2023
Maintainer

lazear Oct 25, 2023
Collaborator

david-bouyssie Oct 25, 2023
Maintainer

julianu
Oct 26, 2023
Author

julianu Oct 26, 2023
Author

david-bouyssie
Oct 26, 2023
Maintainer

douweschulte Oct 26, 2023
Maintainer

david-bouyssie Oct 26, 2023
Maintainer

julianu Oct 26, 2023
Author