Consider approaches to sentence-based deduplication

As documented in https://github.com/mediacloud/story-indexer/issues/278, we're seeing instances of headlines appearing at the tail end of stories and polluting results. We should consider the original idea of moving this stage of deduplication (which we used to do in the legacy system) to a sous-chef feature. The idea would be to do something like (a) tokenize by sentence, (b) remove duplicate sentences from stories after their first appearance, and (c) remove stories from the corpus that no longer match the query post-sentence-dedup'ing. This is non-trivial, but will take some design work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consider approaches to sentence-based deduplication #18

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consider approaches to sentence-based deduplication #18

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions