Skip to content

Commit 06a4529

Browse files
authored
Merge pull request #64 from cipherstash/docs-v1
docs: add docs v1 and improve readme
2 parents 9c4c08a + aa8fa98 commit 06a4529

File tree

16 files changed

+567
-458
lines changed

16 files changed

+567
-458
lines changed

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -185,3 +185,7 @@ cipherstash-proxy.toml
185185
release/
186186

187187
.mise.*
188+
189+
# jupyter notebook
190+
.ipynb_checkpoints
191+
__pycache__

README.md

Lines changed: 265 additions & 422 deletions
Large diffs are not rendered by default.

cipherstash-proxy/cipherstash-proxy.toml.example

Lines changed: 0 additions & 25 deletions
This file was deleted.

cipherstash-proxy/docker-compose.yaml

Lines changed: 0 additions & 11 deletions
This file was deleted.
File renamed without changes.

docs/reference/INDEX.md

Lines changed: 241 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,241 @@
1+
# EQL index configuration
2+
3+
The following functions allow you to configure indexes for encrypted columns.
4+
All these functions modify the `cs_configuration_v1` table in your database, and is added during the EQL installation.
5+
6+
> **IMPORTANT:** When you modify or add an index, you must re-encrypt data that's already been stored in the database.
7+
The CipherStash encryption solution will encrypt the data based on the current state of the configuration.
8+
9+
### Adding an index (`cs_add_index`)
10+
11+
Add an index to an encrypted column.
12+
13+
```sql
14+
SELECT cs_add_index_v1(
15+
'table_name', -- Name of the table
16+
'column_name', -- Name of the column
17+
'index_name', -- Index kind ('unique', 'match', 'ore', 'ste_vec')
18+
'cast_as', -- PostgreSQL type to cast decrypted data ('text', 'int', etc.)
19+
'opts' -- Index options as JSONB (optional)
20+
);
21+
```
22+
23+
| Parameter | Description | Notes |
24+
| ------------- | -------------------------------------------------- | ------------------------------------------------------------------------ |
25+
| `table_name` | Name of target table | Required |
26+
| `column_name` | Name of target column | Required |
27+
| `index_name` | The index kind | Required. |
28+
| `cast_as` | The PostgreSQL type decrypted data will be cast to | Optional. Defaults to `text` |
29+
| `opts` | Index options | Optional for `match` indexes, required for `ste_vec` indexes (see below) |
30+
31+
#### Option (`cast_as`)
32+
33+
Supported types:
34+
35+
- `text`
36+
- `int`
37+
- `small_int`
38+
- `big_int`
39+
- `boolean`
40+
- `date`
41+
- `jsonb`
42+
43+
#### Options for match indexes (`opts`)
44+
45+
A match index enables full text search across one or more text fields in queries.
46+
47+
The default Match index options are:
48+
49+
```json
50+
{
51+
"k": 6,
52+
"m": 2048,
53+
"include_original": true,
54+
"tokenizer": {
55+
"kind": "ngram",
56+
"token_length": 3
57+
}
58+
"token_filters": {
59+
"kind": "downcase"
60+
}
61+
}
62+
```
63+
64+
- `tokenFilters`: a list of filters to apply to normalize tokens before indexing.
65+
- `tokenizer`: determines how input text is split into tokens.
66+
- `m`: The size of the backing [bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) in bits. Defaults to `2048`.
67+
- `k`: The maximum number of bits set in the bloom filter per term. Defaults to `6`.
68+
69+
**Token filters**
70+
71+
There are currently only two token filters available: `downcase` and `upcase`. These are used to normalise the text before indexing and are also applied to query terms. An empty array can also be passed to `tokenFilters` if no normalisation of terms is required.
72+
73+
**Tokenizer**
74+
75+
There are two `tokenizer`s provided: `standard` and `ngram`.
76+
`standard` simply splits text into tokens using this regular expression: `/[ ,;:!]/`.
77+
`ngram` splits the text into n-grams and accepts a configuration object that allows you to specify the `tokenLength`.
78+
79+
**m** and **k**
80+
81+
`k` and `m` are optional fields for configuring [bloom filters](https://en.wikipedia.org/wiki/Bloom_filter) that back full text search.
82+
83+
`m` is the size of the bloom filter in bits. `filterSize` must be a power of 2 between `32` and `65536` and defaults to `2048`.
84+
85+
`k` is the number of hash functions to use per term.
86+
This determines the maximum number of bits that will be set in the bloom filter per term.
87+
`k` must be an integer from `3` to `16` and defaults to `6`.
88+
89+
**Caveats around n-gram tokenization**
90+
91+
While using n-grams as a tokenization method allows greater flexibility when doing arbitrary substring matches, it is important to bear in mind the limitations of this approach.
92+
Specifically, searching for strings _shorter_ than the `tokenLength` parameter will not _generally_ work.
93+
94+
If you're using n-gram as a token filter, then a token that is already shorter than the `tokenLength` parameter will be kept as-is when indexed, and so a search for that short token will match that record.
95+
However, if that same short string only appears as a part of a larger token, then it will not match that record.
96+
In general, therefore, you should try to ensure that the string you search for is at least as long as the `tokenLength` of the index, except in the specific case where you know that there are shorter tokens to match, _and_ you are explicitly OK with not returning records that have that short string as part of a larger token.
97+
98+
#### Options for ste_vec indexes (`opts`)
99+
100+
An ste_vec index on a encrypted JSONB column enables the use of PostgreSQL's `@>` and `<@` [containment operators](https://www.postgresql.org/docs/16/functions-json.html#FUNCTIONS-JSONB-OP-TABLE).
101+
102+
An ste_vec index requires one piece of configuration: the `context` (a string) which is passed as an info string to a MAC (Message Authenticated Code).
103+
This ensures that all of the encrypted values are unique to that context.
104+
It is generally recommended to use the table and column name as a the context (e.g. `users/name`).
105+
106+
Within a dataset, encrypted columns indexed using an `ste_vec` that use different contexts cannot be compared.
107+
Containment queries that manage to mix index terms from multiple columns will never return a positive result.
108+
This is by design.
109+
110+
The index is generated from a JSONB document by first flattening the structure of the document such that a hash can be generated for each unique path prefix to a node.
111+
112+
The complete set of JSON types is supported by the indexer.
113+
Null values are ignored by the indexer.
114+
115+
- Object `{ ... }`
116+
- Array `[ ... ]`
117+
- String `"abc"`
118+
- Boolean `true`
119+
- Number `123.45`
120+
121+
For a document like this:
122+
123+
```json
124+
{
125+
"account": {
126+
"email": "alice@example.com",
127+
"name": {
128+
"first_name": "Alice",
129+
"last_name": "McCrypto"
130+
},
131+
"roles": ["admin", "owner"]
132+
}
133+
}
134+
```
135+
136+
Hashes would be produced from the following list of entries:
137+
138+
```js
139+
[
140+
[Obj, Key("account"), Obj, Key("email"), String("alice@example.com")],
141+
[
142+
Obj,
143+
Key("account"),
144+
Obj,
145+
Key("name"),
146+
Obj,
147+
Key("first_name"),
148+
String("Alice"),
149+
],
150+
[
151+
Obj,
152+
Key("account"),
153+
Obj,
154+
Key("name"),
155+
Obj,
156+
Key("last_name"),
157+
String("McCrypto"),
158+
],
159+
[Obj, Key("account"), Obj, Key("roles"), Array, String("admin")],
160+
[Obj, Key("account"), Obj, Key("roles"), Array, String("owner")],
161+
];
162+
```
163+
164+
Using the first entry to illustrate how an entry is converted to hashes:
165+
166+
```js
167+
[Obj, Key("account"), Obj, Key("email"), String("alice@example.com")];
168+
```
169+
170+
The hashes would be generated for all prefixes of the full path to the leaf node.
171+
172+
```js
173+
[
174+
[Obj],
175+
[Obj, Key("account")],
176+
[Obj, Key("account"), Obj],
177+
[Obj, Key("account"), Obj, Key("email")],
178+
[Obj, Key("account"), Obj, Key("email"), String("alice@example.com")],
179+
// (remaining leaf nodes omitted)
180+
];
181+
```
182+
183+
Query terms are processed in the same manner as the input document.
184+
185+
A query prior to encrypting & indexing looks like a structurally similar subset of the encrypted document, for example:
186+
187+
```json
188+
{
189+
"account": {
190+
"email": "alice@example.com",
191+
"roles": "admin"
192+
}
193+
}
194+
```
195+
196+
The expression `cs_ste_vec_v1(encrypted_account) @> cs_ste_vec_v1($query)` would match all records where the `encrypted_account` column contains a JSONB object with an "account" key containing an object with an "email" key where the value is the string "alice@example.com".
197+
198+
When reduced to a prefix list, it would look like this:
199+
200+
```js
201+
[
202+
[Obj],
203+
[Obj, Key("account")],
204+
[Obj, Key("account"), Obj],
205+
[Obj, Key("account"), Obj, Key("email")],
206+
[Obj, Key("account"), Obj, Key("email"), String("alice@example.com")][
207+
(Obj, Key("account"), Obj, Key("roles"))
208+
],
209+
[Obj, Key("account"), Obj, Key("roles"), Array],
210+
[Obj, Key("account"), Obj, Key("roles"), Array, String("admin")],
211+
];
212+
```
213+
214+
Which is then turned into an ste_vec of hashes which can be directly queries against the index.
215+
216+
### Modifying an index (`cs_modify_index`)
217+
218+
Modifies an existing index configuration.
219+
Accepts the same parameters as `cs_add_index`
220+
221+
```sql
222+
SELECT cs_modify_index_v1(
223+
table_name text,
224+
column_name text,
225+
index_name text,
226+
cast_as text,
227+
opts jsonb
228+
);
229+
```
230+
231+
### Removing an index (`cs_remove_index`)
232+
233+
Removes an index configuration from the column.
234+
235+
```sql
236+
SELECT cs_remove_index_v1(
237+
table_name text,
238+
column_name text,
239+
index_name text
240+
);
241+
```
File renamed without changes.
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)