Skip to content

Commit ef6e06d

Browse files
committed
Add my notes from the hyphenation exploration
1 parent 8357f4c commit ef6e06d

File tree

1 file changed

+97
-0
lines changed

1 file changed

+97
-0
lines changed

doc/Architecture_Decision_Records.org

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,103 @@
77

88
#+TODO: DRAFT PROPOSED | ACCEPTED REJECTED DEPRECATED SUPERSEDED
99

10+
* DRAFT Hyphenation
11+
- Deciders :: CE
12+
- Date :: [2024-12-20 Fr]
13+
14+
** Context and Problem Statement
15+
16+
Liblouis uses hyphenation dictionaries from the TeX project to provide
17+
some functionality in the form of the ~nocross~ opcode prefix. It
18+
would be nice if we could use off-the-shelf functionality instead of
19+
having to re-implement this as in the C version.
20+
21+
The [[https://crates.io/crates/hyphenation][hyphenation crate]] makes it fairly easy to use a dictionary. It
22+
comes pre-configured with [[https://github.com/tapeinosyne/hyphenation/tree/master/dictionaries][a lot]] of TeX and OpenOffice hyphenation
23+
dictionaries. These come not in their standard form but are encoded
24+
using the bincode format. This encoding happens during the build
25+
process of the hyphenation crate, where all the [[https://github.com/tapeinosyne/hyphenation/tree/master/patterns][pattern files]] in
26+
the ~patterns~ directory are ecoded and stored in the ~dictionaries~
27+
directory.
28+
29+
I added the 3 relevant dictionaries from liblouis, namely
30+
~da-dk-g2.dic~, ~de-g1-core-patterns.dic~ and
31+
~de-g2-core-patterns.dic~ to the patterns folder of hyphenation, added
32+
the files to ~build.rs~ and ~hyphenation_commons/src/language.rs~ and
33+
finally built the hyphenation crate with
34+
35+
#+begin_src shell
36+
cargo build --features build_dictionaries
37+
#+end_src
38+
39+
The liblouis dictionary files were encoded and I grabed them out of
40+
~target/debug/build/hyphenation-4f7fc3b4af290d85/out/dictionaries~.
41+
42+
You can now load this dictionary and hyphenate words:
43+
44+
#+begin_src rust
45+
use std::error::Error;
46+
47+
use hyphenation::Load;
48+
use hyphenation::{Hyphenator, Language, Standard};
49+
50+
fn main() -> Result<(), Box<dyn Error>> {
51+
let path_to_dict = "/path/to/da-g2.standard.bincode";
52+
let en_us = Standard::from_path(Language::Dutch, path_to_dict)?;
53+
54+
let hyphenated = en_us.hyphenate("bestemmer");
55+
println!("Hello, {:?}!", hyphenated);
56+
57+
Ok(())
58+
}
59+
#+end_src
60+
61+
which results in
62+
63+
#+begin_src shell
64+
cargo run
65+
Hello, Word { text: "bestemmer", breaks: [7] }!
66+
#+end_src
67+
68+
You'll notice that I used the language ~Language::Dutch~. The
69+
language ~DanishGrade2~, that I had added to my local version of the
70+
~hyphenation_commons~ crate, does not exist when I use the
71+
~hyphenation~ crate from crates.io. If I use ~Language::EnglishUS~ it
72+
compiles but complains and tells me the the dictionary is in for the
73+
~Language::Dutch~.
74+
75+
The problem is that the ~hyphenation_commons~ crate converts the list
76+
of languages to an enum that is baked into the build. There does not
77+
seem to be a way to load a dictionary with out the ~Language~ enum.
78+
The bincode seems to contain the language in its serialized data
79+
structure.
80+
81+
At the moment it definitely looks like there is more research needed
82+
as to how we could use the hyphenation crate using our own
83+
dictionaries. Maybe we'll have to rip out the relevant parsing code
84+
from ~hyphenation_commons~ and then provide the hyphenator with a
85+
deserialized version of that.
86+
87+
** Decision Drivers
88+
89+
** Considered Options
90+
91+
** Decision Outcome
92+
93+
Chosen option: "TBD", because ...
94+
95+
** Positive Consequences
96+
97+
-
98+
99+
** Negative Consequences
100+
101+
-
102+
103+
** Pros and Cons of the Options
104+
105+
** Links
106+
10107
* DRAFT Handle word boundaries
11108
- Deciders :: CE
12109
- Date :: [2024-03-08 Fr]

0 commit comments

Comments
 (0)