Non-English/Multilingual Bert

Hi Chris and Nick,

Thank you for all the amazing content! :slight_smile:
I have been looking into multilingual models and exploring their performance on low-resource languages, so I would like to hear what is your take on the following matter: It has been observed that multilingual models’ performance is not as good as the one of a similarly sized monolingual model which has been trained solely on that specific language.
Since it would be practically inefficient or infeasible to train models from scratch for every language, what are your thoughts on how to cover the need for further improving multilingual models for specific languages (or domains)?

Thanks!

Best,
Stella

Hi Stella,

Good question!

I think a lot of the time the introduction of good benchmarks and datasets is more responsible for research advances than many people realize, so the creation of multilingual benchmarks and datasets like XNLI or XTREME help get the broader research community interested and give them something to aim at.

I’d say availability is a huge factor as well. Platforms like huggingface do a great job of facilitating multlingual NLP by making models and datasets easier to access, train, and (especially) share via the community model hub. Before that, if you needed an NLP solution in another language you would have to worry about a) finding something at all and b) getting it to run which, I say from experience, can be very time-consuming and discouraging.

Transfer learning and zero-shot learning works better between languages that are typologically similar, for example low resource Slavic languages benefit from high-resource Russian language models through transfer or zero-shot learning. Another idea is to try creating multlingual models that specifically target language families and typologies that aren’t already well covered,
so if you can’t create a monolingual model for low resource language X, at least you would be able to transfer off of a multilingual model trained specifically on languages in the same family.

On a more general note: just like monolingual models, multilingual model development has a number of different “engineering” directions for improvement: training regimens, architecture, etc. You can see a lot of these new ideas in the more recent model submission to cross-lingual/multilingual benchmark tasks. For example, by browsing new submissions in the EXTREME leaderboard (https://sites.research.google/xtreme) you can get a sense for the directions different research groups are taking multilingual language modeling.

Getting researchers to invest more in multlingual models and NLP for low resource languages is another challenge that’s had better progress in recent years. Incentivizing and studying this is important for a variety of reasons. These two papers have great ideas I don’t have space to cover myself!

https://www.aclweb.org/anthology/2020.acl-main.560.pdf