-
Notifications
You must be signed in to change notification settings - Fork 4
Description
While working on #135, I have realized the idea is solid. This issue is to describe shortly what I plan to do; the milestones will need to change a little though.
The idea in short: for DOCX files support, I plan to implement an ODT parser and converter to Coradoc. This will not get rid of LibreOffice dependency (unless user generates ODT file himself). In my experience, ODT is very close to HTML, yet it preserves a lot more semantic than LibreOffice HTML, so this should be fairly easy to do (at least, compared to DOCX - I would describe the difference as follows: the ODT format was designed for document interchange, the DOCX format was designed to represent internal MS Word structures serialized to XML - and as @opoudjis noted, this isn't even well documented).
The plan is as follows:
- vendor in word-to-markdown dependency (part of Remove unsuitable gem dependencies #121 )
- the rationale for that:
- while our new implementation will parse ODT directly, there will always be LibreOffice HTML documents in the wild
- this implementation is in use and, for the most part, it works
- there are some (small) issues with word-to-markdown that we may be able to fix locally
- it's not a big thing, most of the work is done in HTML already
- the rationale for that:
- create a gem, that will map ODT format using Rubyzip and Lutaml::Model
- Rubyzip won't work with Opal, but we would be able to polyfill it with some Node.js library; or not ship this part
- use the above gem to create Coradoc::Input::Odt (would supersede Ability to convert Word into Coradoc (and to adoc) #115 ; I recommend to read discussion on that issue, as it refers to this one)
- benchmark the implementation using ISO Simple Template (Update implementation to be able to transform the ISO Simple Template docx #87)
- this would make Update implementation to be able to transform the ISO Simple Template docx #87 depend on the ODT implementation
- generalize plugin system and create an ISO Simple Template plugin (I assume this will be needed)
- ensure the implementation works with MS Word-generated ODT files
- optional, but would require me to buy MS Word license
- rationale:
- would allow users to export ODT directly from MS Word
- we could perhaps script in the future an option to export ODT using MS Word executable
- switch default of DOCX from current Coradoc::Input::Docx to Coradoc::Input::Odt
- I think even at this point, we should keep the old implementation, so that users will be able to choose another if the first one breaks (those implementations could be called descriptively DocxViaHtml and DocxViaOdt).
Any opinions on that plan?