Skip to content

ODT proposal #136

@hmdne

Description

@hmdne

While working on #135, I have realized the idea is solid. This issue is to describe shortly what I plan to do; the milestones will need to change a little though.

The idea in short: for DOCX files support, I plan to implement an ODT parser and converter to Coradoc. This will not get rid of LibreOffice dependency (unless user generates ODT file himself). In my experience, ODT is very close to HTML, yet it preserves a lot more semantic than LibreOffice HTML, so this should be fairly easy to do (at least, compared to DOCX - I would describe the difference as follows: the ODT format was designed for document interchange, the DOCX format was designed to represent internal MS Word structures serialized to XML - and as @opoudjis noted, this isn't even well documented).

The plan is as follows:

  • vendor in word-to-markdown dependency (part of Remove unsuitable gem dependencies #121 )
    • the rationale for that:
      • while our new implementation will parse ODT directly, there will always be LibreOffice HTML documents in the wild
      • this implementation is in use and, for the most part, it works
      • there are some (small) issues with word-to-markdown that we may be able to fix locally
      • it's not a big thing, most of the work is done in HTML already
  • create a gem, that will map ODT format using Rubyzip and Lutaml::Model
    • Rubyzip won't work with Opal, but we would be able to polyfill it with some Node.js library; or not ship this part
  • use the above gem to create Coradoc::Input::Odt (would supersede Ability to convert Word into Coradoc (and to adoc) #115 ; I recommend to read discussion on that issue, as it refers to this one)
  • benchmark the implementation using ISO Simple Template (Update implementation to be able to transform the ISO Simple Template docx #87)
  • ensure the implementation works with MS Word-generated ODT files
    • optional, but would require me to buy MS Word license
    • rationale:
      • would allow users to export ODT directly from MS Word
      • we could perhaps script in the future an option to export ODT using MS Word executable
  • switch default of DOCX from current Coradoc::Input::Docx to Coradoc::Input::Odt
    • I think even at this point, we should keep the old implementation, so that users will be able to choose another if the first one breaks (those implementations could be called descriptively DocxViaHtml and DocxViaOdt).

Any opinions on that plan?

@ronaldtse @ReesePlews @opoudjis @webdev778 @xyz65535

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions