·
0 commits
to v0.x
since this release
Previous versions used explicit zeroed rows corresponding to whitespace tokens in spaCy. This required duplication and a number of assignments into the transformer output, which was inefficient.
Instead, whitespace tokens are now regarded as not aligning to any wordpiece tokens. If you do doc._.trf_data[i]
where i
is the index of a whitespace token, you'll receive an array of shape (0, n)
where n
is the output dimension. This is handled in Thinc's pooling operations, so the change doesn't require any update to models consuming the trf_data
.