-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
As documented in mediacloud/story-indexer#278, we're seeing instances of headlines appearing at the tail end of stories and polluting results. We should consider the original idea of moving this stage of deduplication (which we used to do in the legacy system) to a sous-chef feature. The idea would be to do something like (a) tokenize by sentence, (b) remove duplicate sentences from stories after their first appearance, and (c) remove stories from the corpus that no longer match the query post-sentence-dedup'ing. This is non-trivial, but will take some design work.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request