Skip to content
This repository was archived by the owner on Jan 15, 2024. It is now read-only.
Discussion options

You must be logged in to vote

Hi @Jetcodery. Yes, I think the decreased sequence length would decrease performance. In the original experiment for reproducing BERT we did the training with sequence length 512. Nowadays, many people train BERT in two stages, with the first stage under length 128, followed by a second stage training of 512. The two stage training appears to close the performance gap.

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@Jetcodery
Comment options

Answer selected by Jetcodery
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants