Skip to content

ocha221/semantic-tagging-tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

semantic-tagging-tools

simple tag expander & merger based on tags within your dataset

the model you use for synonym matching will HEAVILY influence the results you get.

Overview

  1. Group all unique tags, use a set() to make it easier
  2. Pass the tags, in the format "x","y","z" to the embedding model (i used mistral-embed)
  3. Normalise and pass the new embeddings to faiss
  4. Adjust the threshold for matching within faiss 4b. For each tag in your original list, the model will be supplied with the tag and candidate vectors from faiss which are within your similarity threshold (distance) 4c. The model (i chose mistral-small v3) will return a json with valid synonyms it picked.
  5. Rewrite the original label.
  6. Profit?

Thoughts

I used mistral-small as its free and good enough for the task, though it will make mistakes sometimes. If you used something bigger (like Qwen 32+B, or Claude) you will get better results, as they're smarter models.

i believe normalising before feeding the vectors to fass is probably uncesesary, but testing with it on seemed better than without

Currently, the program will go through the list and if necessary try to batch the candidate (for my testing i set batch size and k neighbours to 10 so it'll never show, please change this). this means that if word X has 200 candidates, and you have a batch size of 100, the llm would get two batches of 100 and add the synonyms at the end, though you might wanna mess with the semantic threshold a little so you dont get too many overlapping candidates.

info on faiss: https://github.com/facebookresearch/faiss/wiki/

Examples:

ex1 ex2 ex3

About

simple tag expander/merger based on what's found within a tagged dataset

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages