Skip to content

Context snippet returns first occurance even if the word is appearing as a substring #5

@sethdandridge

Description

@sethdandridge

You have a small bug in NYT-first-said.parsers.simple_scrape.context: if the word appears as a substring of a word before appearing on its own, the context snippet returns the first occurrence of that word and not the standalone word.

This bug manifests itself if there's a new word that appears plural first (with an s at the end) and then singular, the snippet will always return the context of the plural (since str.find() returns the index of the first occurrence). See: https://twitter.com/NYT_first_said/status/1135591139413778433

One possible fix would be to find the shortest word (token) in the article that contains the new word and use that to determine the snippet:

def context(content, word):
    tokens_containing_word = []
    tokens = content.split()
    for token in tokens:
        if word in token:
            tokens_containing_word.append(token)
    # you also might want to write a custom key function here that calculates length after 
    # removing punctuation, otherwise "crocodyliforms" is the same length as "crocodyliform."
    context_token = min(tokens_containing_word, key=lambda x: len(x))
    loc = content.find(context_token)
    # existing logic proceeds...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions