Concatenation of each out put from 12 heads for the final output?

Hi,

I understand output from one Encoder is input for another Encoder. For the final output: are we concatenating all the 12 outputs of [1 * 64] dimension from 12 Encoders ( 12 self-attention hence multi headed attention) which gives [1 * 768] dim output for a single entity( entity could be a word or a character).
Now assuming we have a 2 word sentence, this above process happens for both the word at the same time. So we have 24 self attention layer running at once and concatenation happening at the end for each of the 12 self attention heads?