BERT pretraining: [SEP] vs. Segment Embeddings?

I’m confused about the differences between the intent of the [SEP] tokens and Segment Embeddings applied to the input of BERT during pretraining.

As far as I’ve understood, the [SEP] tokens are inserted between sentence A and B to enable the model’s ability to distinguish between the two sentences for BERTs Next-Sentence Prediction pretraining-task. Similarly, the Segment Embeddings are added to the input embeddings to alter the input, creating another opportunity for the model to learn that sentence A and B are distinct things.

However, these seem to be facilitating the same purpose. Why can’t BERT be trained on only Segment Embeddings, omitting [SEP] tokens? What additional information do [SEP] tokens conceptually provide, that the Segment Embeddings don’t?

Furthermore, [SEP] tokens aren’t used directly anyways. NSP is trained on the [CLS] embeddings, which I understand to sort of represent an embedding of sentence continuity.

1 Like

Hi Velixo,

It’s a good question! I believe in the past Chris and I asked the same thing but couldn’t come up with a good answer. As far as I see it, your reasoning is correct: you need the segment embeddings / token_type_ids but don’t necessarily need the [SEP] tokens. A few ideas:

  1. [SEP] isn’t used for any of the pretraining tasks, but maybe it helps out anyway by being some kind of buffer between sentences or encoding useful information. My intuition about this kind of black box part of pretraining doesn’t really point one way or the other. Real evidence in this would be a) some BERTology paper that investigates whether the [SEP] token encodes anything, or better b) an ablation study that tests a BERT-like model with and without the [SEP] token. I haven’t come across either of those. The T5 or RoBERTa papers would have been good candidates for this ablation experiment but I didn’t see anything. Everyone keeps using the [SEP] token, and I can’t tell whether this is just convention now or it results in some kind of slight experimental improvement that no one mentions.
  1. Maybe just a relic of the input formatting. Putting a SEP token in the raw text input is possibly just a natural and easy way to distinguish the input sentences. Rather than counting token indices and supplying your own segment embedding 0s and 1s, you just leave it to a downstream function to generate the segment embeddings. Not a great answer (why not just have the function that creates segment embeddings also remove SEP token?) but it strikes me as plausible.

If you figure out let us know! When I get a chance to ask someone who’s trained BERT-like models from scratch I’ll ask.

Chris found some information on the SEP token in section 3.2 of this paper: https://arxiv.org/pdf/1906.04341.pdf

They note that in the middle layers of BERT, SEP tokens get a much higher than average attention than other tokens but at the same time have a much, much lower gradient wrt loss. So, they’re used heavily in the middle layers but have very little effect. They therefore believe that the SEP tokens are used by the model as a kind of do-nothing, “no-op” trick when the attention head’s functions are not applicable. There’s some interesting figures on that page of the paper looking at relative attention and gradient wrt loss of special tokens across different layers.

Some interesting insights there, but I’m still interested in seeing an ablation study :slight_smile: