Skip to content

Commit dbe27f6

Browse files
committed
initial commit
0 parents  commit dbe27f6

File tree

9 files changed

+1388
-0
lines changed

9 files changed

+1388
-0
lines changed

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
env
2+
.ruff_cache
3+
dist
4+
*.egg-info
5+
**/__pycache__

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2023 Matt Rickard
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# ReLLM
2+
Regular Expressions for Language Model Completions.
3+
4+
Get exact structure out of any language model completion with regular expressions.
5+
6+
Return specific syntactic structure (e.g. JSON or XML), or specific semantic structure (e.g. a date or a number), or even complete templates (e.g. a sentence with a blank to fill in).
7+
8+
How does it work? For each token, ReLLM tests every possible completion against a partial regex. For the potential completions that do not match the pattern, ReLLM masks the logits so that the language model does not generate them.
9+
10+
### Installation
11+
```
12+
pip install rellm
13+
```
14+
15+
The preliminary results are interesting -- even for small models, constraining the token space with ReLLM can improve the quality of the completions. Not to mention the ability to more easily parse the output programmatically. Take a look at some of the examples below (you can run them with [example.py](example.py))
16+
17+
```python
18+
import regex
19+
from transformers import AutoModelForCausalLM, AutoTokenizer
20+
21+
from rellm import complete_re
22+
23+
model = AutoModelForCausalLM.from_pretrained("gpt2")
24+
tokenizer = AutoTokenizer.from_pretrained("gpt2")
25+
26+
prompt = "ReLLM, the best way to get structured data out of LLMs, is an acronym for "
27+
pattern = regex.compile(r'Re[a-z]+ L[a-z]+ L[a-z]+ M[a-z]+')
28+
output = complete_re(tokenizer=tokenizer,
29+
model=model,
30+
prompt=prompt,
31+
pattern=pattern,
32+
do_sample=True,
33+
max_new_tokens=80)
34+
print(output)
35+
```
36+
37+
```
38+
> Realized Logistic Logistics Model
39+
```
40+
41+
42+
## Examples using GPT2 (124 million parameters)
43+
44+
**Prompt**: Return the first three letters of the alphabet in a json array:
45+
46+
**Pattern** \[\"[a-z]\", \"[a-z]\", \"[a-z]\"\]
47+
48+
**ReLLM**: ["a", "b", "c"]
49+
50+
**Without ReLLM**: { "index": 0, "id":"1", "description":"", "text": "[{ "id": 0, "name":
51+
#
52+
**Prompt**: Fill in the sentence with an interesting story about the dentist:
53+
54+
**Pattern**: Today I\'m going to the [a-z]+ to [a-z]+ because ([a-z]+ )*\.
55+
56+
**ReLLM**: Today I'm going to the dentist to see because it is a very important day for me
57+
58+
**Without ReLLM**: 'My family bought me an appointment with a dentist when I was 15. The dentist gave me one a year and then I was told on
59+
#
60+
61+
**Prompt**: Is this a good demo?
62+
63+
**Pattern**: (Yes|No)
64+
65+
**ReLLM**: No.
66+
67+
**Without ReLLM**: I don't know, but this is amazing! Even more amazing is how the design can take place on a small stage that uses LEDs.
68+
As
69+
70+
#
71+
72+
**Prompt**: Convert the date May 4, 2023 to the format mm/dd/yyyy:
73+
74+
**Pattern**: [0-9]{2}/[0-9]{2}/[0-9]{4}
75+
76+
**ReLLM**: 00/00/0045
77+
78+
**Without ReLLM**: mm:ss
79+
80+
A-Z, Z-A, W-H (0-9:9:19)
81+
82+
Z-R
83+
84+
#
85+
86+
**Prompt**: Jeff Dean is a
87+
88+
**Pattern** (Programmer|Computer Scientist|AGI)
89+
90+
**ReLLM**: Computer Scientist
91+
92+
**Without ReLLM**: former national basketball champion and a former professional basketball player. He currently serves as general counsel for the NCAA Office of the Vice President for Academic Affairs.
93+
94+
#
95+
96+
**Prompt**: I can eat
97+
98+
**Pattern**: [0-9]{1,10} [a-z]* of [a-z]*
99+
100+
**ReLLM**: 800 calories of coffee
101+
102+
**Without ReLLM**: iced coffee here on the west side and do this, so can you?"
103+
104+
"Why, I don't understand. What did you mean by
105+
106+
#
107+
108+
**Prompt**: ReLLM, the best way to get structured data out of LLMs, is an acronym for
109+
110+
**Patern**: Re[a-z]+ L[a-z]+ L[a-z]+ M[a-z]+
111+
112+
**ReLLM**: Realized Logistic Logistics Model
113+
114+
**Without ReLLM**: Largest Largest Address Space (MELSP), which has its roots in the Internet network, at least when compared

example.py

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
2+
import regex
3+
from transformers import AutoModelForCausalLM, AutoTokenizer
4+
5+
from rellm import complete_re
6+
7+
model = AutoModelForCausalLM.from_pretrained("gpt2")
8+
tokenizer = AutoTokenizer.from_pretrained("gpt2")
9+
10+
examples = [
11+
{
12+
"prompt": "Return the first three letters of the alphabet in a json array:",
13+
"pattern": regex.compile(r'\[\"[a-z]\", \"[a-z]\", \"[a-z]\"\]'),
14+
"max_new_tokens": 10,
15+
},
16+
{
17+
"prompt": "Fill in the sentence with an interesting story about the dentist:",
18+
"pattern": regex.compile(r'Today I\'m going to the [a-z]+ to [a-z]+ because ([a-z]+ )*\.'),
19+
"max_new_tokens": 20,
20+
},
21+
{
22+
"prompt": "Is this a good demo?",
23+
"pattern": regex.compile(r'(Yes|No)\.'),
24+
"max_new_tokens": 2,
25+
},
26+
{
27+
"prompt": "Convert the date May 4, 2023 to the format mm/dd/yyyy:",
28+
"pattern": regex.compile(r'[0-9]{2}/[0-9]{2}/[0-9]{4}'),
29+
"max_new_tokens": 20,
30+
},
31+
{
32+
"prompt": "Jeff Dean is a ",
33+
"pattern": regex.compile(r'(Programmer|Computer Scientist|AGI)'),
34+
"max_new_tokens": 10,
35+
},
36+
{
37+
"prompt": 'I can eat ',
38+
"pattern": regex.compile(r'[0-9]{1,10} [a-z]* of [a-z]*'),
39+
"max_new_tokens": 10,
40+
"do_sample": True,
41+
},
42+
{
43+
"prompt": 'ReLLM, the best way to get structured data out of LLMs, is an acronym for ',
44+
"pattern": regex.compile(r'Re[a-z]+ L[a-z]+ L[a-z]+ M[a-z]+'),
45+
"max_new_tokens": 10,
46+
"do_sample": True,
47+
}
48+
]
49+
50+
for example in examples:
51+
print("\n===Prompt===\n", example["prompt"])
52+
output = complete_re(tokenizer=tokenizer, model=model,**example)
53+
print("\n===ReLLM===\n", output)
54+
vanilla_output_ids = model.generate(tokenizer.encode(example["prompt"], return_tensors="pt"),
55+
max_new_tokens=30,
56+
pad_token_id=tokenizer.eos_token_id,
57+
do_sample=True)
58+
print("\n===Without ReLLM===\n", tokenizer.decode(vanilla_output_ids[0])[len(example["prompt"]):])

0 commit comments

Comments
 (0)