I’m confused about the differences between the intent of the [SEP] tokens and Segment Embeddings applied to the input of BERT during pretraining.
As far as I’ve understood, the [SEP] tokens are inserted between sentence A and B to enable the model’s ability to distinguish between the two sentences for BERTs Next-Sentence Prediction pretraining-task. Similarly, the Segment Embeddings are added to the input embeddings to alter the input, creating another opportunity for the model to learn that sentence A and B are distinct things.
However, these seem to be facilitating the same purpose. Why can’t BERT be trained on only Segment Embeddings, omitting [SEP] tokens? What additional information do [SEP] tokens conceptually provide, that the Segment Embeddings don’t?
Furthermore, [SEP] tokens aren’t used directly anyways. NSP is trained on the [CLS] embeddings, which I understand to sort of represent an embedding of sentence continuity.