|
7 | 7 |
|
8 | 8 | #+TODO: DRAFT PROPOSED | ACCEPTED REJECTED DEPRECATED SUPERSEDED |
9 | 9 |
|
| 10 | +* DRAFT Hyphenation |
| 11 | +- Deciders :: CE |
| 12 | +- Date :: [2024-12-20 Fr] |
| 13 | + |
| 14 | +** Context and Problem Statement |
| 15 | + |
| 16 | +Liblouis uses hyphenation dictionaries from the TeX project to provide |
| 17 | +some functionality in the form of the ~nocross~ opcode prefix. It |
| 18 | +would be nice if we could use off-the-shelf functionality instead of |
| 19 | +having to re-implement this as in the C version. |
| 20 | + |
| 21 | +The [[https://crates.io/crates/hyphenation][hyphenation crate]] makes it fairly easy to use a dictionary. It |
| 22 | +comes pre-configured with [[https://github.com/tapeinosyne/hyphenation/tree/master/dictionaries][a lot]] of TeX and OpenOffice hyphenation |
| 23 | +dictionaries. These come not in their standard form but are encoded |
| 24 | +using the bincode format. This encoding happens during the build |
| 25 | +process of the hyphenation crate, where all the [[https://github.com/tapeinosyne/hyphenation/tree/master/patterns][pattern files]] in |
| 26 | +the ~patterns~ directory are ecoded and stored in the ~dictionaries~ |
| 27 | +directory. |
| 28 | + |
| 29 | +I added the 3 relevant dictionaries from liblouis, namely |
| 30 | +~da-dk-g2.dic~, ~de-g1-core-patterns.dic~ and |
| 31 | +~de-g2-core-patterns.dic~ to the patterns folder of hyphenation, added |
| 32 | +the files to ~build.rs~ and ~hyphenation_commons/src/language.rs~ and |
| 33 | +finally built the hyphenation crate with |
| 34 | + |
| 35 | +#+begin_src shell |
| 36 | + cargo build --features build_dictionaries |
| 37 | +#+end_src |
| 38 | + |
| 39 | +The liblouis dictionary files were encoded and I grabed them out of |
| 40 | +~target/debug/build/hyphenation-4f7fc3b4af290d85/out/dictionaries~. |
| 41 | + |
| 42 | +You can now load this dictionary and hyphenate words: |
| 43 | + |
| 44 | +#+begin_src rust |
| 45 | + use std::error::Error; |
| 46 | + |
| 47 | +use hyphenation::Load; |
| 48 | +use hyphenation::{Hyphenator, Language, Standard}; |
| 49 | + |
| 50 | +fn main() -> Result<(), Box<dyn Error>> { |
| 51 | + let path_to_dict = "/path/to/da-g2.standard.bincode"; |
| 52 | + let en_us = Standard::from_path(Language::Dutch, path_to_dict)?; |
| 53 | + |
| 54 | + let hyphenated = en_us.hyphenate("bestemmer"); |
| 55 | + println!("Hello, {:?}!", hyphenated); |
| 56 | + |
| 57 | + Ok(()) |
| 58 | +} |
| 59 | +#+end_src |
| 60 | + |
| 61 | +which results in |
| 62 | + |
| 63 | +#+begin_src shell |
| 64 | +cargo run |
| 65 | +Hello, Word { text: "bestemmer", breaks: [7] }! |
| 66 | +#+end_src |
| 67 | + |
| 68 | +You'll notice that I used the language ~Language::Dutch~. The |
| 69 | +language ~DanishGrade2~, that I had added to my local version of the |
| 70 | +~hyphenation_commons~ crate, does not exist when I use the |
| 71 | +~hyphenation~ crate from crates.io. If I use ~Language::EnglishUS~ it |
| 72 | +compiles but complains and tells me the the dictionary is in for the |
| 73 | +~Language::Dutch~. |
| 74 | + |
| 75 | +The problem is that the ~hyphenation_commons~ crate converts the list |
| 76 | +of languages to an enum that is baked into the build. There does not |
| 77 | +seem to be a way to load a dictionary with out the ~Language~ enum. |
| 78 | +The bincode seems to contain the language in its serialized data |
| 79 | +structure. |
| 80 | + |
| 81 | +At the moment it definitely looks like there is more research needed |
| 82 | +as to how we could use the hyphenation crate using our own |
| 83 | +dictionaries. Maybe we'll have to rip out the relevant parsing code |
| 84 | +from ~hyphenation_commons~ and then provide the hyphenator with a |
| 85 | +deserialized version of that. |
| 86 | + |
| 87 | +** Decision Drivers |
| 88 | + |
| 89 | +** Considered Options |
| 90 | + |
| 91 | +** Decision Outcome |
| 92 | + |
| 93 | +Chosen option: "TBD", because ... |
| 94 | + |
| 95 | +** Positive Consequences |
| 96 | + |
| 97 | +- |
| 98 | + |
| 99 | +** Negative Consequences |
| 100 | + |
| 101 | +- |
| 102 | + |
| 103 | +** Pros and Cons of the Options |
| 104 | + |
| 105 | +** Links |
| 106 | + |
10 | 107 | * DRAFT Handle word boundaries |
11 | 108 | - Deciders :: CE |
12 | 109 | - Date :: [2024-03-08 Fr] |
|
0 commit comments