|
| 1 | +# ReLLM |
| 2 | +Regular Expressions for Language Model Completions. |
| 3 | + |
| 4 | +Get exact structure out of any language model completion with regular expressions. |
| 5 | + |
| 6 | +Return specific syntactic structure (e.g. JSON or XML), or specific semantic structure (e.g. a date or a number), or even complete templates (e.g. a sentence with a blank to fill in). |
| 7 | + |
| 8 | +How does it work? For each token, ReLLM tests every possible completion against a partial regex. For the potential completions that do not match the pattern, ReLLM masks the logits so that the language model does not generate them. |
| 9 | + |
| 10 | +### Installation |
| 11 | +``` |
| 12 | +pip install rellm |
| 13 | +``` |
| 14 | + |
| 15 | +The preliminary results are interesting -- even for small models, constraining the token space with ReLLM can improve the quality of the completions. Not to mention the ability to more easily parse the output programmatically. Take a look at some of the examples below (you can run them with [example.py](example.py)) |
| 16 | + |
| 17 | +```python |
| 18 | +import regex |
| 19 | +from transformers import AutoModelForCausalLM, AutoTokenizer |
| 20 | + |
| 21 | +from rellm import complete_re |
| 22 | + |
| 23 | +model = AutoModelForCausalLM.from_pretrained("gpt2") |
| 24 | +tokenizer = AutoTokenizer.from_pretrained("gpt2") |
| 25 | + |
| 26 | +prompt = "ReLLM, the best way to get structured data out of LLMs, is an acronym for " |
| 27 | +pattern = regex.compile(r'Re[a-z]+ L[a-z]+ L[a-z]+ M[a-z]+') |
| 28 | +output = complete_re(tokenizer=tokenizer, |
| 29 | + model=model, |
| 30 | + prompt=prompt, |
| 31 | + pattern=pattern, |
| 32 | + do_sample=True, |
| 33 | + max_new_tokens=80) |
| 34 | +print(output) |
| 35 | +``` |
| 36 | + |
| 37 | +``` |
| 38 | +> Realized Logistic Logistics Model |
| 39 | +``` |
| 40 | + |
| 41 | + |
| 42 | +## Examples using GPT2 (124 million parameters) |
| 43 | + |
| 44 | +**Prompt**: Return the first three letters of the alphabet in a json array: |
| 45 | + |
| 46 | +**Pattern** \[\"[a-z]\", \"[a-z]\", \"[a-z]\"\] |
| 47 | + |
| 48 | +**ReLLM**: ["a", "b", "c"] |
| 49 | + |
| 50 | +**Without ReLLM**: { "index": 0, "id":"1", "description":"", "text": "[{ "id": 0, "name": |
| 51 | +# |
| 52 | +**Prompt**: Fill in the sentence with an interesting story about the dentist: |
| 53 | + |
| 54 | +**Pattern**: Today I\'m going to the [a-z]+ to [a-z]+ because ([a-z]+ )*\. |
| 55 | + |
| 56 | +**ReLLM**: Today I'm going to the dentist to see because it is a very important day for me |
| 57 | + |
| 58 | +**Without ReLLM**: 'My family bought me an appointment with a dentist when I was 15. The dentist gave me one a year and then I was told on |
| 59 | +# |
| 60 | + |
| 61 | +**Prompt**: Is this a good demo? |
| 62 | + |
| 63 | +**Pattern**: (Yes|No) |
| 64 | + |
| 65 | +**ReLLM**: No. |
| 66 | + |
| 67 | +**Without ReLLM**: I don't know, but this is amazing! Even more amazing is how the design can take place on a small stage that uses LEDs. |
| 68 | +As |
| 69 | + |
| 70 | +# |
| 71 | + |
| 72 | +**Prompt**: Convert the date May 4, 2023 to the format mm/dd/yyyy: |
| 73 | + |
| 74 | +**Pattern**: [0-9]{2}/[0-9]{2}/[0-9]{4} |
| 75 | + |
| 76 | +**ReLLM**: 00/00/0045 |
| 77 | + |
| 78 | +**Without ReLLM**: mm:ss |
| 79 | + |
| 80 | +A-Z, Z-A, W-H (0-9:9:19) |
| 81 | + |
| 82 | +Z-R |
| 83 | + |
| 84 | +# |
| 85 | + |
| 86 | +**Prompt**: Jeff Dean is a |
| 87 | + |
| 88 | +**Pattern** (Programmer|Computer Scientist|AGI) |
| 89 | + |
| 90 | +**ReLLM**: Computer Scientist |
| 91 | + |
| 92 | +**Without ReLLM**: former national basketball champion and a former professional basketball player. He currently serves as general counsel for the NCAA Office of the Vice President for Academic Affairs. |
| 93 | + |
| 94 | +# |
| 95 | + |
| 96 | +**Prompt**: I can eat |
| 97 | + |
| 98 | +**Pattern**: [0-9]{1,10} [a-z]* of [a-z]* |
| 99 | + |
| 100 | +**ReLLM**: 800 calories of coffee |
| 101 | + |
| 102 | +**Without ReLLM**: iced coffee here on the west side and do this, so can you?" |
| 103 | + |
| 104 | +"Why, I don't understand. What did you mean by |
| 105 | + |
| 106 | +# |
| 107 | + |
| 108 | +**Prompt**: ReLLM, the best way to get structured data out of LLMs, is an acronym for |
| 109 | + |
| 110 | +**Patern**: Re[a-z]+ L[a-z]+ L[a-z]+ M[a-z]+ |
| 111 | + |
| 112 | +**ReLLM**: Realized Logistic Logistics Model |
| 113 | + |
| 114 | +**Without ReLLM**: Largest Largest Address Space (MELSP), which has its roots in the Internet network, at least when compared |
0 commit comments