BPEmb: Subword Embeddings in 275 Languages

Benjamin Heinzerling and Michael Strube

What is this?

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.

tl;dr

  • Subwords allow guessing the meaning of unknown / out-of-vocabulary words. E.g., the suffix -shire in Melfordshire indicates a location.
  • Byte-Pair Encoding gives a subword segmentation that is often good enough, without requiring tokenization or morphological analysis. In this case the BPE segmentation might be something like melf ord shire.
  • Pre-trained byte-pair embeddings work surprisingly well, while requiring no tokenization and being much smaller than alternatives: an 11 MB BPEmb English model matches the results of the 6 GB FastText model in our evaluation.

Usage

Import BPEmb:
>>> from bpemb import BPEmb

Load a BPEmb model for English:
>>> bpemb_en = BPEmb(lang="en")

Byte-pair encode text:
>>> bpemb_en.encode("Stratford")
['▁strat', 'ford']

>>> bpemb_en.encode("This is anarchism")
['▁this', '▁is', '▁an', 'arch', 'ism']

Load a Chinese model with vocabulary size 100,000:
>>> bpemb_zh = BPEmb(lang="zh", vs=100000)
>>> bpemb_zh.encode("这是一个中文句子") # "This is a Chinese sentence."
['▁这是一个', '中文', '句子'] # ["This is a", "Chinese", "sentence"]

BPEmb objects wrap a gensim KeyedVectors instance:
>>> type(bpemb_zh.emb)
gensim.models.keyedvectors.Word2VecKeyedVectors

The actual embedding vectors are stored as a numpy array:
>>> bpemb_zh.vectors.shape
(100000, 100)

Use these vectors as a pretrained embedding layer in your neural model. For example, in PyTorch:
>>> from torch import nn, tensor
>>> emb_layer = nn.Embedding.from_pretrained(tensor(bpemb_zh.vectors))
>>> emb_layer
Embedding(100000, 100)

To perform an embedding lookup, encode the text into byte-pair IDs, which correspond to row indices in the embedding matrix:
>>> ids = bpemb_zh.encode_ids("这是一个中文句子")
>>> ids
[25950, 695, 20199]
>>> bpemb_zh.vectors[ids].shape
(3, 100)
>>> emb_layer(tensor(ids)).shape # in PyTorch
torch.Size([3, 100])

The embed method combines encoding and embedding lookup:
>>> bpemb_zh.embed("这是一个中文句子").shape
(3, 100)

What are subword embeddings and why should I use them?

If you are using word embeddings like word2vec or GloVe, you have probably encountered out-of-vocabulary words, i.e., words for which no embedding exists. A makeshift solution is to replace such words with an <unk> token and train a generic embedding representing such unknown words.

Subword approaches try to solve the unknown word problem differently, by assuming that you can reconstruct a word's meaning from its parts. For example, the suffix -shire lets you guess that Melfordshire is probably a location, or the suffix -osis that Myxomatosis might be a sickness.

There are many ways of splitting a word into subwords. A simple method is to split into characters and then learn to transform this character sequence into a vector representation by feeding it to a convolutional neural network (CNN) or a recurrent neural network (RNN), usually a long-short term memory (LSTM). This vector representation can then be used like a word embedding.

Another, more linguistically motivated way is a morphological analysis, but this requires tools and training data which might not be available for your language and domain of interest.

Enter Byte-Pair Encoding (BPE) [Sennrich et al, 2016], an unsupervised subword segmentation method. BPE starts with a sequence of symbols, for example characters, and iteratively merges the most frequent symbol pair into a new symbol.

For example, applying BPE to English might first merge the characters h and e into a new symbol he, then t and h into th, then t and he into the, and so on.

Learning these merge operations from a large corpus (e.g. all Wikipedia articles in a given language) often yields reasonable subword segementations. For example, a BPE model trained on English Wikipedia splits melfordshire into mel, ford, and shire.

Applying BPE to a large corpus and then training embeddings allows capturing semantic similarity on the subword level:
>>> bpemb_en.most_similar("shire", topn=20)
[('hire', 0.9314496517181396),
('▁yorkshire', 0.8953744173049927),
('kshire', 0.8550578355789185),
('borough', 0.8407645225524902),
('▁somers', 0.8306833505630493),
('bury', 0.8140815496444702),
('ingham', 0.7990236878395081),
('cester', 0.7609555721282959),
('ford', 0.754569411277771),
('▁lanc', 0.7440991401672363),
('▁kent', 0.7358010411262512),
('wick', 0.7234422564506531),
('wich', 0.7164122462272644),
('ampton', 0.710740327835083),
('worth', 0.7066425085067749),
('▁chester', 0.6958329677581787),
('ham', 0.6919639110565186),
('▁wales', 0.688853919506073),
('bridge', 0.6859016418457031),
('ington', 0.6753650903701782)]

The most similar BPE symbols include many English location suffixes like bury (e.g. Salisbury), ford (Stratford), bridge (Cambridge), or ington (Islington).

Let's load a model with a different vocabulary size and query the subwords most similar to osis:
>>> bpemb_en_100k = BPEmb(lang="en", vs=100000)
>>> bpemb_en_100k.most_similar("osis")
[('itis', 0.7558665871620178),
('otic', 0.7047083377838135),
('▁hyp', 0.6756638288497925),
('▁chronic', 0.6747802495956421),
('▁disease', 0.6637889742851257),
('▁dermat', 0.6540030241012573),
('▁infection', 0.6292201280593872),
('asis', 0.6207979321479797),
('ogenic', 0.6143321990966797),
('omy', 0.6116182804107666)]

You can also look at similar subwords in the TensorFlow embedding projector:

A similar example with a common German place name suffix:
>>> bpemb_de = BPEmb(lang="de")
>>> bpemb_de.most_similar("ingen")
[('hausen', 0.7823410034179688),
('lingen', 0.7540420889854431),
('hofen', 0.7394638061523438),
('sheim', 0.7340810298919678),
('sbach', 0.722784161567688),
('heim', 0.7004263997077942),
('weiler', 0.6882345676422119),
('dorf', 0.6798275709152222),
('bach', 0.6773027181625366),
('burg', 0.6758529543876648)]

And with the German equivalent of -osis:
>>> bpemb_de.most_similar("ose", topn=15)
[('omat', 0.6046937108039856),
('krank', 0.6041499376296997),
('yse', 0.5890116691589355),
('▁diagn', 0.5877634882926941),
('syn', 0.5804826021194458),
('apie', 0.563900887966156),
('▁erkrank', 0.5624006390571594),
('pt', 0.5604687333106995),
('▁hy', 0.5580164790153503),
('▁behandlung', 0.5550136566162109),
('amin', 0.5534716844558716),
('hy', 0.5433641076087952),
('drom', 0.5430601835250854),
('ase', 0.5377143621444702),
('itis', 0.5358116626739502)]

Which vocabulary size should I pick?

The vocabulary size is the sum of the number of BPE merge operations and the number of characters in the training data. The number of BPE merge operations determines if the resulting symbol sequences will tend to be short (few merge operations) or longer (more merge operations). Using very few merge operations will produce mostly character unigrams, bigrams, and trigrams, while peforming a large number of merge operations will create symbols representing the most frequent words. The table below shows this for English (top), Japanese (middle), and Chinese (bottom):

Vocab. size Byte-pair encoded text
1000 to y od a _station is _a _r ail way _station _on _the _ch ū ō _main _l ine
3000 to y od a _station _is _a _railway _station _on _the _ch ū ō _main _line
10000 toy oda _station _is _a _railway _station _on _the _ch ū ō _main _line
50000 toy oda _station _is _a _railway _station _on _the _chū ō _main _line
100000 toy oda _station _is _a _railway _station _on _the _chūō _main _line
Tokenized toyoda station is a railway station on the chūō main line
5000 豊 田 駅 ( と よ だ え き ) は 、 東京都 日 野 市 豊 田 四 丁目 にある
10000 豊 田 駅 ( と よ だ えき ) は 、 東京都 日 野市 豊 田 四 丁目にある
25000 豊 田駅 ( とよ だ えき ) は 、 東京都 日 野市 豊田 四 丁目にある
50000 豊 田駅 ( とよ だ えき ) は 、 東京都 日 野市 豊田 四丁目にある
Tokenized 豊田 駅 ( と よ だ え き ) は 、 東京 都 日野 市 豊田 四 丁目 に ある
10000 豐 田 站 是 東 日本 旅 客 鐵 道 ( JR 東 日本 ) 中央 本 線 的 鐵路 車站
25000 豐田 站是 東日本旅客鐵道 ( JR 東日本 ) 中央 本 線的鐵路車站
50000 豐田 站是 東日本旅客鐵道 ( JR 東日本 ) 中央 本線的鐵路車站
Tokenized 豐田站 是 東日本 旅客 鐵道 ( JR 東日本 ) 中央本線 的 鐵路車站

The advantage of having few operations is that this results in a smaller vocabulary of symbols. You need less data to learn representations (embeddings) of these symbols. The disadvantage is that you need data to learn how to compose those symbols into meaningful units (e.g. words).

The advantage of having many operations is that many frequent words get their own symbols, so you don't have to learn how what the word railway means by composing it from the symbols r, ail, and way. The disadvantage is that you need more data to train good embeddings for these longer symbols, which is available for high-resource languages like English, but less so for low-resource languages like Khmer.

Download

The easiest way to use BPEmb is to install it as a Python package via pip:
pip install bpemb
Embeddings and SentencePiece models will be downloaded automatically the first time you use them.

Alternatively, you can download pretrained embeddings and SentencePiece models on the download page of the language of your choice. The download page also contains various visualizations, a link to view the embeddings in the TensorFlow projector, and a sample of the Wikipedia edition used for training.

Visualization

You can explore UMAP projections of the embedding space like the one below on each language's download page. You also can load the embeddings into the TensorFlow projector.

MultiBPEmb

MultiBPEmb is the multilingual version of BPEmb. Instead of training 275 monolingual subword segmentations models and embeddings, here we've trained one large, multilingual segmentation model and corresponding embeddings with a subword vocabulary that is shared among all 275 languages. If you want to do multilingual NLP, find MultiBPEmb here.

Citing BPEmb

If you use BPEmb in academic work, please cite:
Benjamin Heinzerling and Michael Strube, 2018: BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages.

@InProceedings{heinzerling2018bpemb,
  author = {Benjamin Heinzerling and Michael Strube},
  title = "{BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages}",
  booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
  year = {2018},
  month = {May 7-12, 2018},
  address = {Miyazaki, Japan},
  editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {979-10-95546-00-9},
  language = {english}
  }