Two or more text columns as input to transformer

Hi Chris,
Thanks a lot for creating this channel. I have been following your BERT tutorials and it helped me a lot.

My question is, say I have two or more text columns as input, how do I pass this to BERT? any examples on this? and how do we make use of categorical input features?

thanks,
Sai

1 Like

Hey, Sai! Thanks for the encouragement, and for joining the forum!

I feel like I’m missing something obvious in your question about the columns… Is it that you are trying to incorporate the layout of the text into the input? And if so, what information is conveyed by the fact that the text is in columns?

As for categorical input features, I’ve been asked this before and I would love to research it! When you have categorical data, my understanding is that decision trees are the answer. I suspect that people combine text and categorical data by feeding the categorical information, along with the output of BERT (e.g., the classification scores), into a model like XGBoost.

Again, it’s an area where I’m lacking in knowledge, but I would love to fix that :slight_smile:. If you’re willing, maybe post a request in the “content requests” category, and we’ll see how much support it gets?

Thanks!
Chris

Chris, let me rephrase my question on two or more text columns as input:

for an example, we have a datasource with three columns
column_a: text data which describes one feature
column_b: text data which describes another feature
column_c : category/label

If i have to approach this kind of text classification problem with BERT, how can we pass column_a and column_b as inputs to bert model? as of now, i have concatenated both the columns and passing as input to bert and i am not sure whether this the right approach or not.

Ah! That makes perfect sense now, thanks :blush:.

BERT is actually capable of taking in two independent pieces of text, because it was pre-trained on some two-sentence tasks. At a low level, you give BERT two pieces of text by concatenating their token sequences, but inserting the special “SEP” token in between them first. You also add a special “Segment A” embedding to all of the tokens in the first text and “Segment B” embedding to all of the tokens in the second text.

In practice, whatever library you’re working with should have the ability to feed in two independent pieces of text and take care of the above steps for you, so make sure to use that functionality rather than simply concatenating your strings before passing them in. (It’d be interesting to try both approaches, though, and see whether it changes much!)

So the support is there in the architecture, BUT, it’s limited to two pieces of text. You could pre-train your own BERT model to handle more than just two pieces of text, but that can get expensive in GPU time…

I think I would experiment first with the two pieces of text to see whether treating them independently actually improves BERT’s performance. If it doesn’t improve it, then that simplifies things and you can just go back to concatenating your text before inputing.

Interesting to think about, thanks again for asking that!

Chris

2 Likes

thanks for insights @ChrisMcC , any suggestions/ideas/examples you can direct on passing two independent text columns using huggingface transformers ?