Skip to content
This repository was archived by the owner on Mar 20, 2019. It is now read-only.

Commit faf2b8d

Browse files
committed
Initial open-source commit
Changes compared to the in-house development version: - Made the number of bits for page, line and word indices configurable - Much more detailled documentation - Updated all docstrings - Tests for various usage scenarios
0 parents  commit faf2b8d

File tree

23 files changed

+1954
-0
lines changed

23 files changed

+1954
-0
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
/.idea/
2+
target

.travis.yml

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
language: java
2+
jdk:
3+
- openjdk8
4+
5+
before_script:
6+
- if [ ! -z "$TRAVIS_TAG" ]; then mvn versions:set -DnewVersion=$TRAVIS_TAG; fi
7+
8+
script:
9+
- mvn clean install
10+
11+
after_success:
12+
- bash <(curl -s https://codecov.io/bash)
13+
14+
deploy:
15+
provider: releases
16+
api_key:
17+
secure: B2aWP1iYhmrBeyERQxlgnt1qA7pjCNuNwjY31CrrOrCo0tlhKGN1S2yB6xpNZqT4uXdM52Y1xFI+wS5Pnptck9LpdrLhOVVEKqC0YRHdiuUQ2PjzdBTPwsP1euk6deKWiOkhB7srqi/+Wc7Yu78yFSluVAWmTHVeXwy53Eu9VaJhMPaLRH/tRF2xvh2DDHzl0B+ESRNeUMMG+rHl+8kwahg9RkR6BzIW8dPHMNTXLo5p0uqXt5TlquRvqlp6wcB3D/OjYiaNtMaxM17GuEq6GOHzEk2Ctx2ahLP1zT8rRIG1VPWKqlGZyXmMnj4jmWrJid6O9LMPcAWmHmysZ/Ii1g/rVOaifarBIkGXmSZGfjjKmiZKOXgdatJTfm7qTy/SBbjZsxiXVA1FXOQO44MJbpQccS+omnKID+uYe+J5rO8vqjHISfKVuLYy2EjkFZfG1p4rhQ4Egjo4g7QHjx7hUb/ASTPBv4tgz6CrJ7Hd3o2Zxyyt0ZunUzSWgQgtPXKzMfaCnvwyQrheMT15ZveC4sOpsZqUzqd6Vl2zg/IMsCOAYZz7koB/xA0MFBGlaMYtXSx8sMJtSR+RkpiSkc32xFsXP/ae0lRvPykcmHNoUJvQgfrM0lkvjxIdY+PfsII+yToz9ex62md0wnIZ73UFM2V8azlrUDHTYXGqagpnTTU=
18+
file_glob: true
19+
file:
20+
- "**/target/*.jar"
21+
skip_cleanup: true
22+
on:
23+
tags: true

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
The MIT License (MIT)
2+
3+
Copyright (c) 2018 Munich Digitization Center/Bavarian State Library
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 245 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,245 @@
1+
# Solr OCR Coordinate Payload Plugin
2+
*Efficient indexing and bounding-box "highlighting" for OCR text*
3+
4+
5+
[![Javadocs](http://javadoc.io/badge/de.digitalcollections.search/solr-ocrpayload-plugin.svg)](http://javadoc.io/doc/de.digitalcollections.search/solr-ocrpayload-plugin)
6+
[![Build Status](https://travis-ci.org/dbmdz/solr-ocrpayload-plugin.svg?branch=master)](https://travis-ci.org/dbmdz/solr-ocrpayload-plugin)
7+
[![codecov](https://codecov.io/gh/dbmdz/solr-ocrpayload-plugin/branch/master/graph/badge.svg)](https://codecov.io/gh/dbmdz/solr-ocrpayload-plugin)
8+
[![MIT License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
9+
[![GitHub release](https://img.shields.io/github/release/dbmdz/solr-ocrpayload-plugin.svg?maxAge=2592000)](https://github.com/dbmdz/solr-ocrpayload-plugin/releases)
10+
[![Maven Central](https://img.shields.io/maven-central/v/de.digitalcollections.search/solr-ocrpayload-plugin.svg?maxAge=2592000)](http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22solr-ocrpayload-plugin%22)
11+
12+
## tl;dr
13+
14+
- Store OCR bounding box information and token position directly in the Solr index in a space-efficient manner
15+
- Retrieve bounding box and token position directly in your Solr query results, no additional parsing necessary
16+
17+
**Indexing**:
18+
19+
The OCR information is appended after each token as a concatenated list of `<key><val>` pairs, see further down
20+
for a detailed description of available keys.
21+
22+
`POST /solr/mycore/update`
23+
24+
```json
25+
[{ "id": "test_document",
26+
"ocr_text": "this|p13l5n6x111y222w333h444 is|p13l5n7x222y333w444h555 a|p13l5n8x333y333w444h555 test|p13l5n9x444y333w444h555" }]
27+
```
28+
29+
**Querying**:
30+
31+
The plugin adds a new top-level key (`ocr_highlight` in this case) that contains the OCR information for
32+
each matching token as a structured object.
33+
34+
`GET /solr/mycore/select?ocr_hl=true&ocr_hl.fields=ocr_text&indent=true&wt=json&q=test`
35+
36+
```json
37+
{
38+
"responseHeader": "...",
39+
"response": {
40+
"numFound": 1,
41+
"docs": [{"id": "test_document"}]
42+
},
43+
"ocr_highlight":{
44+
"test_document":{
45+
"ocr_text":[{
46+
"term":"test",
47+
"page":13,
48+
"line": 5,
49+
"word": 9,
50+
"x":0.444,
51+
"y":0.333,
52+
"width":0.444,
53+
"height":0.555}]
54+
}
55+
}
56+
}
57+
```
58+
59+
## Use Case
60+
At the Bavarian State Library, we try to provide full-text search over all of our OCRed content. In addition
61+
to obtaining matching documents, the user should also get a small snippet of the corresponding part of the
62+
page image, with the matching words highlighted, similar to what e.g. Google Books provides.
63+
64+
65+
## Approaches
66+
For this to work, we need some way of mapping matching tokens to their corresponding location in the underlying
67+
OCR text. A common approach used by a number of libraries is to **use a secondary microservice for this** that takes
68+
as input a document identifier and a text snippet and will return all coordinates of matching text snippets on
69+
the page. While this approach generally works okay, it has several drawbacks:
70+
71+
- **Performance:** Every snippet requires a query to the OCR service, which itself has to do a linear scan
72+
through the OCR document. For e.g. a result set of 100 snippets, this will result in 101 queries (initial
73+
Solr query and 100 snippet queries). Of course this can be optimized by batching and having a good index
74+
structure for the coordinate lookup, but it's still less than ideal.
75+
- **Storage:** To reliably be able to map text matches to the base text, you have to store a copy of the
76+
full text in the index, alongside the regular index. This blows up the index size significantly.
77+
Foregoing storing the text and only using the normalized terms from the index for matching will
78+
break the mapping to OCR, since depending on the analyzer configuration, Lucene will perform stemming, etc.
79+
80+
Alternatively, you could also **store the coordinates directly as strings in the index**. This works by e.g.
81+
indexing each token as `<token>|<coordinates>` and telling Lucene to ignore everything after the pipe during
82+
analysis. As the full text of the document is stored, you wil get back a series of these annotated tokens
83+
as query results and can then parse the coordinates from your highlighting information. This solves the
84+
*Performance* part of the above approach, but worsens the *Storage* problem: For every token, we now not only
85+
have to store the token itself, but an expensive coordinate string as well.
86+
87+
## Our Approach
88+
89+
This plugin uses a similar approach to the above, but solves the *Storage* problem by using an efficient binary
90+
format to store the OCR coordinate information in the index: We use bit-packing to combine a number of OCR
91+
coordinate parameters into a **byte payload**, which is not stored in the field itself, but as an associated
92+
[Lucene Payload](https://lucidworks.com/2017/09/14/solr-payloads/):
93+
94+
- `x`, `y`, `w`, `h`: **Relative** coordinates of the bounding box on the page as floating point values between 0 and 1
95+
- `pageIndex`: Unsigned integer that stores the page index of a token (optional)
96+
- `lineIndex`: Unsigned integer that stores the line index of a token (optional)
97+
- `wordIndex`: Unsigned integer that stores the word index of a token (optional)
98+
99+
For each of these values, you can configure the number of bits the plugin should use to store them, or disable
100+
certain parameters entirely. This allows you to fine-tune the settings to your needs. In our case, for example, we
101+
use these values: `4 * 12 bits (coordinates) + 9 bits (word index) + 11 bits (line index) + 12 bits (page index)`,
102+
resulting in a 80 bit or 10 byte payload per token. A comparable string representation `p0l0n0x000y000w000h000`
103+
would have at least 22 bytes, so we save >50% for every token.
104+
105+
At query time, we then retrieve the payload for each matching token and put the decoded information into the
106+
`ocr_highlight` result key that can be directly used without having to do any additional parsing.
107+
108+
## Usage
109+
### Installation
110+
111+
Download the [latest release from GitHub](https://github.com/dbmdz/solr-ocrpayload-plugin/releases) and put the JAR into your `$SOLR_HOME/$SOLR_CORE/lib/` directory.
112+
113+
### Indexing configuration
114+
115+
To use it, first add the `DelimitedOcrInfoPayloadTokenFilterFactory`☕ filter to your analyzer chain (e.g. for a `ocr_text` field type):
116+
117+
```xml
118+
<fieldtype name="text_ocr" class="solr.TextField"
119+
termVectors="true" termPositions="true" termPayloads="true">
120+
<analyzer>
121+
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
122+
<filter class="de.digitalcollections.lucene.analysis.util.DelimitedOcrInfoPayloadTokenFilterFactory"
123+
delimiter="" coordinateBits="10" wordBits="0" lineBits="0" pageBits="12" />
124+
<filter class="solr.StandardFilterFactory"/>
125+
<filter class="solr.LowerCaseFilterFactory"/>
126+
<filter class="solr.StopFilterFactory"/>
127+
<filter class="solr.PorterStemFilterFactory"/>
128+
</analyzer>
129+
</fieldtype>
130+
```
131+
132+
The filter takes five parameters:
133+
134+
- `coordinateBits`: Number of bits to use for encoding OCR coordinates in the index. (mandatory)
135+
- `delimiter`: Character used for delimiting the payload from the token in the input document (default: `|`)<br/>
136+
A value of `10` (default) is recommended, resulting in coordBits to approximately three decimal places.
137+
- `wordBits`: Number of bits to use for encoding the word index.<br/>
138+
Set to 0 (default) to disable storage of the word index.
139+
- `lineBits`: Number of bits to use for encoding the line index.<br/>
140+
Set to 0 (default) to disable storage of the line index.
141+
- `pageBits`: Number of bits to use for encoding the page index.<br/>
142+
Set to 0 (default) to disable storage of the page index.
143+
144+
The filter expects the input text to have the coordinates encoded as floating point values between
145+
0 and 1 and with the leading `0.` discarded and laid out as follows (values in brackets are optional):
146+
147+
`<token><delimiter>[p<pageIdx>][l<lineIdx>][n<wordIdx>]x<x>y<y>w<w>h<h>`
148+
149+
As an example, consider the token `foobar` with an OCR box of `(0.50712, 0.31432, 0.87148, 0.05089)`,
150+
the configured delimiter `` and storage of indices for the word (`30`), line (`12`) and page (`13`):
151+
`foobar☞p13l12n30x507y314w871h051`.
152+
153+
Finally, you just have to configure your schema to use the field type defined above. Storing the content is **not**
154+
recommended, since it significantly increases the index size and is not used at all for querying and highlighting:
155+
156+
```xml
157+
<field name="ocr_text" type="text_ocr" indexed="true" stored="false" />
158+
```
159+
160+
### Highlighting configuration
161+
162+
To enable highlighting using the OCR payloads, add the `OcrHighlighting` component to your Solr
163+
configuration, configure it with the same `coordinateBits`, `wordBits`, `lineBits` and `pageBits` values
164+
that were used for the filter in the analyzer chain:
165+
166+
```xml
167+
<config>
168+
<searchComponent name="ocr_highlight"
169+
class="de.digitalcollections.solr.plugin.components.ocrhighlighting.OcrHighlighting"
170+
coordinateBits="10" wordBits="0" lineBits="0" pageBits="12" />
171+
172+
<requestHandler name="standard" class="solr.StandardRequestHandler">
173+
<arr name="last-components">
174+
<str>ocr_highlight</str>
175+
</arr>
176+
</requestHandler>
177+
</config>
178+
```
179+
180+
Now at query time, you can just set the `ocr_hl=true` parameter, specify the fields you want highlighted via
181+
`ocr_hl.fields=myfield,myotherfield` and retrieve highlighted matches with their OCR coordinates:
182+
183+
`GET /solr/mycore/select?ocr_hl=true&ocr_hl.fields=ocr_text&indent=true&q=augsburg&wt=json`
184+
185+
```json
186+
{
187+
"responseHeader":{
188+
"status":0,
189+
"QTime":158},
190+
"response":{"numFound":526,"start":0,"docs":[
191+
{
192+
"id":"bsb10502835"},
193+
{
194+
"id":"bsb11032147"},
195+
{
196+
"id":"bsb10485243"},
197+
...
198+
},
199+
"ocr_highlight":{
200+
"bsb10502835":{
201+
"ocr_text":[{
202+
"page":7,
203+
"position":9,
204+
"term":"augsburg",
205+
"x":0.111,
206+
"y":0.062,
207+
"width":0.075,
208+
"height":0.013},
209+
{
210+
"page":7,
211+
"position":264,
212+
"term":"augsburg",
213+
"x":0.320,
214+
"y":0.670,
215+
"width":0.099,
216+
"height":0.012},
217+
...]}},
218+
...
219+
}
220+
}
221+
}
222+
```
223+
224+
225+
## FAQ
226+
227+
- **How does highlighting work with phrase queries?**
228+
229+
You will receive a bounding box object for every individual matching term in the phrase.
230+
231+
- **What are the performance and storage implications of using this plugin?**
232+
233+
*Performance*: With an Intel Xeon E5-1620@3.5GHz on a single core, we measured (with JMH):
234+
235+
- Encoding the Payload: 1,484,443.200 Payloads/Second or ~14.2MiB/s with an 80bit payload
236+
- Decoding the Payload: 1,593,036.372 Payloads/Second or ~15.2MiB/s with an 80bit payload
237+
238+
*Storage*: This depends on your configuration. With our sample configuration of an 80 bit payload
239+
(see above), the payload overhead is 10 bytes per token. That is, for a corpus size of 10 Million Tokens,
240+
you will need approximately 95MiB to store the payloads.
241+
The actual storage required might be lower, since Lucene compresses the payloads with LZ4.
242+
243+
- **Does this work with SolrCloud?**
244+
245+
It does! We're running it with SolrCloud ourselves.

0 commit comments

Comments
 (0)