Should we clean text before BERT?

Asked by Aswin Candra here:

Should we still need to do stopwords removal or another text preprocessing steps before we feed our dataset into the finetuning process? Or just leave it just the way it is? Or it still depends on our dataset?

Hi Aswin - When BERT was pre-trained, there was no stopword removal or stemming, so I think BERT expects to see everything. Plus, I think that BERT is intelligent enough that it can use any subtle information that those details convey.

If you have any weird formatting or metadata in your text, then I think it’d be worth trying to clean that up.

Since BERT was pre-trained on Wikipedia and BooksCorpus, it was trained on pretty clean text, so I’d be curious to know whether things like spelling correction or replacing abbreviated text (like “pls”, “np”, …) would help. :man_shrugging: