Skip to content

v0.3.1: Improve efficiency by avoiding explicit whitespace rows

Latest
Compare
Choose a tag to compare
@github-actions github-actions released this 28 May 10:29
· 0 commits to v0.x since this release

Previous versions used explicit zeroed rows corresponding to whitespace tokens in spaCy. This required duplication and a number of assignments into the transformer output, which was inefficient.

Instead, whitespace tokens are now regarded as not aligning to any wordpiece tokens. If you do doc._.trf_data[i] where i is the index of a whitespace token, you'll receive an array of shape (0, n) where n is the output dimension. This is handled in Thinc's pooling operations, so the change doesn't require any update to models consuming the trf_data.