Adding custom vocabulary and embeddings to BERT

I skimmed through the notebook which shows how to do the above. I have a couple of questions regarding this. Let us say that I want to fine-tune BERT on a dataset which has apparel descriptions, for example: “Feel angelic in the Extratropical Dress. In a beautiful neutral taupe shade, this dress is the perfect shade to compliment any and every skin tone.” Now, considering that I am using a domain specific dataset, I would like to add certain words to the existing vocabulary and the notebook demonstrates that fine.

Now, if we consider the word print, it will exist in tokenizer.vocab, but the meaning of the word print in the existing vocab will be quite different from the meaning of the word print (pattern on a dress) in my dataset. How to deal with such a situation?

In the final section of the notebook, @ChrisMcC, you have shown a neat trick to customize the embeddings of a new word that has been added to a vocab. Can you please explain how this technique could be used when i am adding say 1000 new words to the vocab?

Sincerely,
Vishal

Hey Vishal,

For your first question with the “print” token, you might try averaging the original embedding with other existing tokens that bring it closer to your intended meaning. For example

[‘print’] = ([‘print’] + [‘fabric’] + [‘pattern’]) / 3

This approach is totally experimental, so you might want to try out different things (weighted sum, perhaps) in order to cheat the embedding towards your domain.

For adding 1000 new words to your vocab, there’s no real trick to scaling custom embeddings. You can initialize them randomly, customize each one, or you could make some kind of best guess about an initial custom embedding. For example, if they are all the names of different kinds/prints of fabric then you might initialize them all with the embedding from the fabric token.

Note that these are all techniques to try and shortcut the model into creating good embeddings on your domain-specific dataset. I believe the most reliable and robust approach is to continue pretraining the model on your dataset or on a dataset in the same domain.

If I was working on fashion specifically, I would also google around and look on huggingface community models for models that have already been trained in this domain. FashionBERT itself, or the Related Works section of the FashionBERT paper would also be a good place to start.

Best,
Nick