|
| 1 | +# Solr OCR Coordinate Payload Plugin |
| 2 | +*Efficient indexing and bounding-box "highlighting" for OCR text* |
| 3 | + |
| 4 | + |
| 5 | +[](http://javadoc.io/doc/de.digitalcollections.search/solr-ocrpayload-plugin) |
| 6 | +[](https://travis-ci.org/dbmdz/solr-ocrpayload-plugin) |
| 7 | +[](https://codecov.io/gh/dbmdz/solr-ocrpayload-plugin) |
| 8 | +[](LICENSE) |
| 9 | +[](https://github.com/dbmdz/solr-ocrpayload-plugin/releases) |
| 10 | +[](http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22solr-ocrpayload-plugin%22) |
| 11 | + |
| 12 | +## tl;dr |
| 13 | + |
| 14 | +- Store OCR bounding box information and token position directly in the Solr index in a space-efficient manner |
| 15 | +- Retrieve bounding box and token position directly in your Solr query results, no additional parsing necessary |
| 16 | + |
| 17 | +**Indexing**: |
| 18 | + |
| 19 | +The OCR information is appended after each token as a concatenated list of `<key><val>` pairs, see further down |
| 20 | +for a detailed description of available keys. |
| 21 | + |
| 22 | +`POST /solr/mycore/update` |
| 23 | + |
| 24 | +```json |
| 25 | +[{ "id": "test_document", |
| 26 | + "ocr_text": "this|p13l5n6x111y222w333h444 is|p13l5n7x222y333w444h555 a|p13l5n8x333y333w444h555 test|p13l5n9x444y333w444h555" }] |
| 27 | +``` |
| 28 | + |
| 29 | +**Querying**: |
| 30 | + |
| 31 | +The plugin adds a new top-level key (`ocr_highlight` in this case) that contains the OCR information for |
| 32 | +each matching token as a structured object. |
| 33 | + |
| 34 | +`GET /solr/mycore/select?ocr_hl=true&ocr_hl.fields=ocr_text&indent=true&wt=json&q=test` |
| 35 | + |
| 36 | +```json |
| 37 | +{ |
| 38 | + "responseHeader": "...", |
| 39 | + "response": { |
| 40 | + "numFound": 1, |
| 41 | + "docs": [{"id": "test_document"}] |
| 42 | + }, |
| 43 | + "ocr_highlight":{ |
| 44 | + "test_document":{ |
| 45 | + "ocr_text":[{ |
| 46 | + "term":"test", |
| 47 | + "page":13, |
| 48 | + "line": 5, |
| 49 | + "word": 9, |
| 50 | + "x":0.444, |
| 51 | + "y":0.333, |
| 52 | + "width":0.444, |
| 53 | + "height":0.555}] |
| 54 | + } |
| 55 | + } |
| 56 | +} |
| 57 | +``` |
| 58 | + |
| 59 | +## Use Case |
| 60 | +At the Bavarian State Library, we try to provide full-text search over all of our OCRed content. In addition |
| 61 | +to obtaining matching documents, the user should also get a small snippet of the corresponding part of the |
| 62 | +page image, with the matching words highlighted, similar to what e.g. Google Books provides. |
| 63 | + |
| 64 | + |
| 65 | +## Approaches |
| 66 | +For this to work, we need some way of mapping matching tokens to their corresponding location in the underlying |
| 67 | +OCR text. A common approach used by a number of libraries is to **use a secondary microservice for this** that takes |
| 68 | +as input a document identifier and a text snippet and will return all coordinates of matching text snippets on |
| 69 | +the page. While this approach generally works okay, it has several drawbacks: |
| 70 | + |
| 71 | +- **Performance:** Every snippet requires a query to the OCR service, which itself has to do a linear scan |
| 72 | + through the OCR document. For e.g. a result set of 100 snippets, this will result in 101 queries (initial |
| 73 | + Solr query and 100 snippet queries). Of course this can be optimized by batching and having a good index |
| 74 | + structure for the coordinate lookup, but it's still less than ideal. |
| 75 | +- **Storage:** To reliably be able to map text matches to the base text, you have to store a copy of the |
| 76 | + full text in the index, alongside the regular index. This blows up the index size significantly. |
| 77 | + Foregoing storing the text and only using the normalized terms from the index for matching will |
| 78 | + break the mapping to OCR, since depending on the analyzer configuration, Lucene will perform stemming, etc. |
| 79 | + |
| 80 | +Alternatively, you could also **store the coordinates directly as strings in the index**. This works by e.g. |
| 81 | +indexing each token as `<token>|<coordinates>` and telling Lucene to ignore everything after the pipe during |
| 82 | +analysis. As the full text of the document is stored, you wil get back a series of these annotated tokens |
| 83 | +as query results and can then parse the coordinates from your highlighting information. This solves the |
| 84 | +*Performance* part of the above approach, but worsens the *Storage* problem: For every token, we now not only |
| 85 | +have to store the token itself, but an expensive coordinate string as well. |
| 86 | + |
| 87 | +## Our Approach |
| 88 | + |
| 89 | +This plugin uses a similar approach to the above, but solves the *Storage* problem by using an efficient binary |
| 90 | +format to store the OCR coordinate information in the index: We use bit-packing to combine a number of OCR |
| 91 | +coordinate parameters into a **byte payload**, which is not stored in the field itself, but as an associated |
| 92 | +[Lucene Payload](https://lucidworks.com/2017/09/14/solr-payloads/): |
| 93 | + |
| 94 | +- `x`, `y`, `w`, `h`: **Relative** coordinates of the bounding box on the page as floating point values between 0 and 1 |
| 95 | +- `pageIndex`: Unsigned integer that stores the page index of a token (optional) |
| 96 | +- `lineIndex`: Unsigned integer that stores the line index of a token (optional) |
| 97 | +- `wordIndex`: Unsigned integer that stores the word index of a token (optional) |
| 98 | + |
| 99 | +For each of these values, you can configure the number of bits the plugin should use to store them, or disable |
| 100 | +certain parameters entirely. This allows you to fine-tune the settings to your needs. In our case, for example, we |
| 101 | +use these values: `4 * 12 bits (coordinates) + 9 bits (word index) + 11 bits (line index) + 12 bits (page index)`, |
| 102 | +resulting in a 80 bit or 10 byte payload per token. A comparable string representation `p0l0n0x000y000w000h000` |
| 103 | +would have at least 22 bytes, so we save >50% for every token. |
| 104 | + |
| 105 | +At query time, we then retrieve the payload for each matching token and put the decoded information into the |
| 106 | +`ocr_highlight` result key that can be directly used without having to do any additional parsing. |
| 107 | + |
| 108 | +## Usage |
| 109 | +### Installation |
| 110 | + |
| 111 | +Download the [latest release from GitHub](https://github.com/dbmdz/solr-ocrpayload-plugin/releases) and put the JAR into your `$SOLR_HOME/$SOLR_CORE/lib/` directory. |
| 112 | + |
| 113 | +### Indexing configuration |
| 114 | + |
| 115 | +To use it, first add the `DelimitedOcrInfoPayloadTokenFilterFactory`☕ filter to your analyzer chain (e.g. for a `ocr_text` field type): |
| 116 | + |
| 117 | +```xml |
| 118 | +<fieldtype name="text_ocr" class="solr.TextField" |
| 119 | + termVectors="true" termPositions="true" termPayloads="true"> |
| 120 | + <analyzer> |
| 121 | + <tokenizer class="solr.WhitespaceTokenizerFactory"/> |
| 122 | + <filter class="de.digitalcollections.lucene.analysis.util.DelimitedOcrInfoPayloadTokenFilterFactory" |
| 123 | + delimiter="☞" coordinateBits="10" wordBits="0" lineBits="0" pageBits="12" /> |
| 124 | + <filter class="solr.StandardFilterFactory"/> |
| 125 | + <filter class="solr.LowerCaseFilterFactory"/> |
| 126 | + <filter class="solr.StopFilterFactory"/> |
| 127 | + <filter class="solr.PorterStemFilterFactory"/> |
| 128 | + </analyzer> |
| 129 | +</fieldtype> |
| 130 | +``` |
| 131 | + |
| 132 | +The filter takes five parameters: |
| 133 | + |
| 134 | +- `coordinateBits`: Number of bits to use for encoding OCR coordinates in the index. (mandatory) |
| 135 | +- `delimiter`: Character used for delimiting the payload from the token in the input document (default: `|`)<br/> |
| 136 | + A value of `10` (default) is recommended, resulting in coordBits to approximately three decimal places. |
| 137 | +- `wordBits`: Number of bits to use for encoding the word index.<br/> |
| 138 | + Set to 0 (default) to disable storage of the word index. |
| 139 | +- `lineBits`: Number of bits to use for encoding the line index.<br/> |
| 140 | + Set to 0 (default) to disable storage of the line index. |
| 141 | +- `pageBits`: Number of bits to use for encoding the page index.<br/> |
| 142 | + Set to 0 (default) to disable storage of the page index. |
| 143 | + |
| 144 | +The filter expects the input text to have the coordinates encoded as floating point values between |
| 145 | +0 and 1 and with the leading `0.` discarded and laid out as follows (values in brackets are optional): |
| 146 | + |
| 147 | +`<token><delimiter>[p<pageIdx>][l<lineIdx>][n<wordIdx>]x<x>y<y>w<w>h<h>` |
| 148 | + |
| 149 | +As an example, consider the token `foobar` with an OCR box of `(0.50712, 0.31432, 0.87148, 0.05089)`, |
| 150 | +the configured delimiter `☞` and storage of indices for the word (`30`), line (`12`) and page (`13`): |
| 151 | +`foobar☞p13l12n30x507y314w871h051`. |
| 152 | + |
| 153 | +Finally, you just have to configure your schema to use the field type defined above. Storing the content is **not** |
| 154 | +recommended, since it significantly increases the index size and is not used at all for querying and highlighting: |
| 155 | + |
| 156 | +```xml |
| 157 | +<field name="ocr_text" type="text_ocr" indexed="true" stored="false" /> |
| 158 | +``` |
| 159 | + |
| 160 | +### Highlighting configuration |
| 161 | + |
| 162 | +To enable highlighting using the OCR payloads, add the `OcrHighlighting` component to your Solr |
| 163 | +configuration, configure it with the same `coordinateBits`, `wordBits`, `lineBits` and `pageBits` values |
| 164 | +that were used for the filter in the analyzer chain: |
| 165 | + |
| 166 | +```xml |
| 167 | +<config> |
| 168 | + <searchComponent name="ocr_highlight" |
| 169 | + class="de.digitalcollections.solr.plugin.components.ocrhighlighting.OcrHighlighting" |
| 170 | + coordinateBits="10" wordBits="0" lineBits="0" pageBits="12" /> |
| 171 | + |
| 172 | + <requestHandler name="standard" class="solr.StandardRequestHandler"> |
| 173 | + <arr name="last-components"> |
| 174 | + <str>ocr_highlight</str> |
| 175 | + </arr> |
| 176 | + </requestHandler> |
| 177 | +</config> |
| 178 | +``` |
| 179 | + |
| 180 | +Now at query time, you can just set the `ocr_hl=true` parameter, specify the fields you want highlighted via |
| 181 | +`ocr_hl.fields=myfield,myotherfield` and retrieve highlighted matches with their OCR coordinates: |
| 182 | + |
| 183 | +`GET /solr/mycore/select?ocr_hl=true&ocr_hl.fields=ocr_text&indent=true&q=augsburg&wt=json` |
| 184 | + |
| 185 | +```json |
| 186 | +{ |
| 187 | + "responseHeader":{ |
| 188 | + "status":0, |
| 189 | + "QTime":158}, |
| 190 | + "response":{"numFound":526,"start":0,"docs":[ |
| 191 | + { |
| 192 | + "id":"bsb10502835"}, |
| 193 | + { |
| 194 | + "id":"bsb11032147"}, |
| 195 | + { |
| 196 | + "id":"bsb10485243"}, |
| 197 | + ... |
| 198 | + }, |
| 199 | + "ocr_highlight":{ |
| 200 | + "bsb10502835":{ |
| 201 | + "ocr_text":[{ |
| 202 | + "page":7, |
| 203 | + "position":9, |
| 204 | + "term":"augsburg", |
| 205 | + "x":0.111, |
| 206 | + "y":0.062, |
| 207 | + "width":0.075, |
| 208 | + "height":0.013}, |
| 209 | + { |
| 210 | + "page":7, |
| 211 | + "position":264, |
| 212 | + "term":"augsburg", |
| 213 | + "x":0.320, |
| 214 | + "y":0.670, |
| 215 | + "width":0.099, |
| 216 | + "height":0.012}, |
| 217 | + ...]}}, |
| 218 | + ... |
| 219 | + } |
| 220 | + } |
| 221 | +} |
| 222 | +``` |
| 223 | + |
| 224 | + |
| 225 | +## FAQ |
| 226 | + |
| 227 | +- **How does highlighting work with phrase queries?** |
| 228 | + |
| 229 | + You will receive a bounding box object for every individual matching term in the phrase. |
| 230 | + |
| 231 | +- **What are the performance and storage implications of using this plugin?** |
| 232 | + |
| 233 | + *Performance*: With an Intel Xeon E5-1620@3.5GHz on a single core, we measured (with JMH): |
| 234 | + |
| 235 | + - Encoding the Payload: 1,484,443.200 Payloads/Second or ~14.2MiB/s with an 80bit payload |
| 236 | + - Decoding the Payload: 1,593,036.372 Payloads/Second or ~15.2MiB/s with an 80bit payload |
| 237 | + |
| 238 | + *Storage*: This depends on your configuration. With our sample configuration of an 80 bit payload |
| 239 | + (see above), the payload overhead is 10 bytes per token. That is, for a corpus size of 10 Million Tokens, |
| 240 | + you will need approximately 95MiB to store the payloads. |
| 241 | + The actual storage required might be lower, since Lucene compresses the payloads with LZ4. |
| 242 | + |
| 243 | +- **Does this work with SolrCloud?** |
| 244 | + |
| 245 | + It does! We're running it with SolrCloud ourselves. |
0 commit comments