Marshallese (mh) subword embeddings

Vocab size vocab model 25 dim 50 dim 100 dim 200 dim 300 dim
1000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix

2-D UMAP plots

Embedding matrix plots

Training corpus sample, encoded with different BPE vocabulary sizes

Vocab sizemhwiki sample
original ke ear jar ear loi juon meram. meram eo ej itok jen lan im ear wanlaltak mae iien ear bed iion beran.
joseph ear kajjitok ipperro kabun ta eo emol. jisos ear uak im ba bwe kabun eo an ejjab bed iion lalin.
ilo raan kein, kabun eo an jisos kraist im armij ro rekwojarjar ilo raan ko aliktata ej bed ilo majol kiia. elon branch ilo laura, ajeltaki, long isla
1000 ▁ke ▁ear ▁jar ▁ear ▁loi ▁juon ▁meram . ▁meram ▁eo ▁ej ▁itok ▁jen ▁lan ▁im ▁ear ▁wanlaltak ▁mae ▁iien ▁ear ▁bed ▁iion ▁beran .
▁joseph ▁ear ▁kajjitok ▁ipperro ▁kabun ▁ta ▁eo ▁emol . ▁jisos ▁ear ▁uak ▁im ▁ba ▁bwe ▁kabun ▁eo ▁an ▁ejjab ▁bed ▁iion ▁lalin .
▁ilo ▁raan ▁kein , ▁kabun ▁eo ▁an ▁jisos ▁kraist ▁im ▁armij ▁ro ▁rekwojarjar ▁ilo ▁raan ▁ko ▁aliktata ▁ej ▁bed ▁ilo ▁majol ▁kiia . ▁elon ▁branch ▁ilo ▁laura , ▁ajeltaki , ▁long ▁is la