Southern Sotho (st) subword embeddings

Vocab size vocab model 25 dim 50 dim 100 dim 200 dim 300 dim
1000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
3000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
5000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix

Embedding matrix plots

Training corpus sample, encoded with different BPE vocabulary sizes

Vocab sizestwiki sample
original digae ya barkly east barkly east ke motse-moholo oa joe gqabi district municipality, leboya la provense kapa botjhabela ka moka afrika borwa.
poland (pl. - ''polska'') naha ka uropa. 00 000 000 (0000) baahi. warsaw ke motsemoholo wa poland. warsaw
vietjet air vietjet air ke vietnamesisch tekanyetso lifofane. ntlo-kholo ba hanoi. e le setsi sa mosebetsi e tsoa ho tan son nhat international airpor
1000 ▁di ga e ▁ya ▁ba r k ly ▁ea st ▁ba r k ly ▁ea st ▁ke ▁mo tse - mo holo ▁oa ▁jo e ▁g qa bi ▁district ▁municipality , ▁lebo ya ▁la ▁provense ▁kapa ▁botjhabela ▁ka ▁mo ka ▁afrika ▁borwa .
▁po land ▁( p l . ▁- ▁'' p ol s ka '' ) ▁naha ▁ka ▁u ro pa . ▁00 ▁000 ▁000 ▁(0000) ▁baahi . ▁wa r sa w ▁ke ▁mo ts emo holo ▁wa ▁po land . ▁wa r sa w
▁vi e tj et ▁air ▁vi e tj et ▁air ▁ke ▁vi et na me sis ch ▁te ka ny etso ▁li fofane . ▁nt lo - k holo ▁ba ▁ha no i . ▁e ▁le ▁se tsi ▁sa ▁mo sebetsi ▁e ▁tsoa ▁ho ▁ta n ▁so n ▁n ha t ▁i nt er na tiona l ▁air po r
3000 ▁di ga e ▁ya ▁bark ly ▁east ▁bark ly ▁east ▁ke ▁motse - moholo ▁oa ▁joe ▁gqabi ▁district ▁municipality , ▁leboya ▁la ▁provense ▁kapa ▁botjhabela ▁ka ▁moka ▁afrika ▁borwa .
▁poland ▁( p l . ▁- ▁'' p ol ska '') ▁naha ▁ka ▁uropa . ▁00 ▁000 ▁000 ▁(0000) ▁baahi . ▁war saw ▁ke ▁motsemoholo ▁wa ▁poland . ▁war saw
▁vietjet ▁air ▁vietjet ▁air ▁ke ▁vi etna me sis ch ▁tekanyetso ▁lifofane . ▁nt lo - k holo ▁ba ▁ha no i . ▁e ▁le ▁setsi ▁sa ▁mosebetsi ▁e ▁tsoa ▁ho ▁ta n ▁son ▁n ha t ▁international ▁air po r
5000 ▁di ga e ▁ya ▁barkly ▁east ▁barkly ▁east ▁ke ▁motse - moholo ▁oa ▁joe ▁gqabi ▁district ▁municipality , ▁leboya ▁la ▁provense ▁kapa ▁botjhabela ▁ka ▁moka ▁afrika ▁borwa .
▁poland ▁( p l . ▁- ▁'' polska '') ▁naha ▁ka ▁uropa . ▁00 ▁000 ▁000 ▁(0000) ▁baahi . ▁warsaw ▁ke ▁motsemoholo ▁wa ▁poland . ▁warsaw
▁vietjet ▁air ▁vietjet ▁air ▁ke ▁vi etna mesisch ▁tekanyetso ▁lifofane . ▁nt lo - k holo ▁ba ▁ha no i . ▁e ▁le ▁setsi ▁sa ▁mosebetsi ▁e ▁tsoa ▁ho ▁tan ▁son ▁n ha t ▁international ▁air po r