Replies: 1 comment 2 replies
-
i'm familiar with many smoothing formulas. i asked you evidence that your preferred smoothing formula would make any difference in dedupe performance. by that i mean showing it makes a consistent difference in performance and recall across datasets, for example, the datasets in the dedupe-examples repo. i would be very interested in what was possible with sklearn. it would be great do drop the dependencies on BTrees. i suspect that it will not favorable because you will still need an implementation of an inverted index. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi @fgregg, following your suggestion in issue #1126 here are my thoughts about how IDF should be computed.
Currently the IDF is computed as
idf(t) = log(1 + N/n_t)
whereN
is the total number of documents andn_t
the number of documents in which the termt
appears.Reading sklearn documentation:
In this medium post you can see a comparison on a toy example of the behaviour of the standard TfIdf against the version implemented in sklearn.
Please keep in mind that I'm not an expert in this field.
P.S.: have you ever considered using the sklearn implementation to compute TfIdf (which is likely much faster)?
Beta Was this translation helpful? Give feedback.
All reactions