forked from ecprice/newsdiffs
-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Description
You have a small bug in NYT-first-said.parsers.simple_scrape.context: if the word appears as a substring of a word before appearing on its own, the context snippet returns the first occurrence of that word and not the standalone word.
This bug manifests itself if there's a new word that appears plural first (with an s at the end) and then singular, the snippet will always return the context of the plural (since str.find() returns the index of the first occurrence). See: https://twitter.com/NYT_first_said/status/1135591139413778433
One possible fix would be to find the shortest word (token) in the article that contains the new word and use that to determine the snippet:
def context(content, word):
tokens_containing_word = []
tokens = content.split()
for token in tokens:
if word in token:
tokens_containing_word.append(token)
# you also might want to write a custom key function here that calculates length after
# removing punctuation, otherwise "crocodyliforms" is the same length as "crocodyliform."
context_token = min(tokens_containing_word, key=lambda x: len(x))
loc = content.find(context_token)
# existing logic proceeds...
Metadata
Metadata
Assignees
Labels
No labels