Skip to content

Parsing Rules for the Glove.42B.300D #200

@hontimzam

Description

@hontimzam

Hello, I am Tim.
I have some questions about the pre-trained vectors in glove.42B.300D.txt

As I am working on some text, and I would like to transfer the text to vectors via the glove.42.B.300D in Python. However, I created my parsing rules/tokenizer (via Spacy library) for my text but the words selected from my own rules is not always fit with the words/vocabulary in the golve.txt.

For example in some texts:
"New York is a big city and there are many stores. The items in the stores are non-expensive. There are 5.5-billion peoples in the world"

After the my parsing rules:
"new york", "is", "a", "big", "city", "and", "there", "are", "many", "stores", "the", "items", "in", "the", "stores", "are", "non", "expensive",
"there", "are", "5.5 billion", "peoples", "in", "the", "world".

However, in the glove.42.B.300D.txt, there are:
1). no "new york" BUT "new-york",
2). contains "non" and "expensive", but also contains "non-expensive" (which is different vectors)
3). even we have hyphens, there are no "5.5-billion", but sometimes contains "9.5-billion", "4.5-billion", etc.
4). Other similar expectation cases.

As a result, only 65% of the words are covered by the library, which is not because there is anything wrong with the dictionary, it is simply because my parsing rules are not good enough. The question is how can I modify the parsing rules such that the words can well-fit the dictionary? is there are already some existing parsing rules?

I have tried to look at the words and fix the issue case by case. However, we cannot ensure that in the future, there will be some new exceptional cases....

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions