Somali (so) subword embeddings

Vocab size vocab model 25 dim 50 dim 100 dim 200 dim 300 dim
1000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
3000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
5000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
10000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
25000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
50000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix

Training corpus sample, encoded with different BPE vocabulary sizes

Vocab sizesowiki sample
original af-jarmal (''deutsch'', ''deutsche sprache'') waa afka lagagahadlo dalka jarmalka. afka's waana lagagahadlo dawladdaha kale ee austriya, switzerland,
israafiil waa beel ku abtirsato beesha karanle beesha israafiil waxeey caan ku yihiin nabada iyo xasiloonida labaatanki sano oo wadanka dagaalada ahli
wxa ay kutaalaa galbeedka koonfur ameerika xeebaha jili waxa ay ku fidsanyihiin, badweynta baasifik waxa ay xuduudo la wadaagtaa wadama peru oo waqoyi
1000 ▁af - j ar ma l ▁( '' d e u t s ch '' , ▁'' d e u t s c he ▁s p r ac he '' ) ▁waa ▁af ka ▁laga ga ha d lo ▁dalka ▁j ar ma lka . ▁af ka ' s ▁waana ▁laga ga ha d lo ▁dawlad daha ▁kale ▁ee ▁a us t ri ya , ▁s w it z er land ,
▁is r aaf iil ▁waa ▁b eel ▁ku ▁ab t ir sa to ▁beesha ▁kar an le ▁beesha ▁is r aaf iil ▁wax eey ▁caan ▁ku ▁yihiin ▁na bada ▁iyo ▁x as il oon ida ▁lab aa tan ki ▁sano ▁oo ▁wadanka ▁dagaa lada ▁ah li
▁w xa ▁ay ▁ku taa laa ▁galbeed ka ▁koonfur ▁am eer i ka ▁x ee baha ▁j ili ▁waxa ▁ay ▁ku ▁f id san y ihiin , ▁bad weyn ta ▁b aasi f i k ▁waxa ▁ay ▁x ud uud o ▁la ▁wad aag taa ▁wada ma ▁p er u ▁oo ▁waq o y i
3000 ▁af - j ar mal ▁( '' de ut s ch '', ▁'' de ut s c he ▁sp r ac he '') ▁waa ▁afka ▁lagaga had lo ▁dalka ▁jarmalka . ▁afka ' s ▁waana ▁lagaga had lo ▁dawlad daha ▁kale ▁ee ▁a ust riya , ▁s w it z er land ,
▁is r aaf iil ▁waa ▁beel ▁ku ▁ab tir sato ▁beesha ▁kar an le ▁beesha ▁is r aaf iil ▁wax eey ▁caan ▁ku ▁yihiin ▁nabada ▁iyo ▁xas il oon ida ▁lab aa tan ki ▁sano ▁oo ▁wadanka ▁dagaa lada ▁ah li
▁w xa ▁ay ▁ku taa laa ▁galbeedka ▁koonfur ▁ameerika ▁xeebaha ▁jili ▁waxa ▁ay ▁ku ▁fid san yihiin , ▁badweynta ▁b aasi f ik ▁waxa ▁ay ▁xuduud o ▁la ▁wadaag taa ▁wada ma ▁per u ▁oo ▁waq o y i
5000 ▁af - jar mal ▁('' de ut s ch '', ▁'' de ut s che ▁sp r ac he '') ▁waa ▁afka ▁lagaga had lo ▁dalka ▁jarmalka . ▁afka ' s ▁waana ▁lagaga had lo ▁dawlad daha ▁kale ▁ee ▁aust riya , ▁sw it z er land ,
▁is raaf iil ▁waa ▁beel ▁ku ▁abtir sato ▁beesha ▁kar an le ▁beesha ▁is raaf iil ▁waxeey ▁caan ▁ku ▁yihiin ▁nabada ▁iyo ▁xas il oon ida ▁labaatan ki ▁sano ▁oo ▁wadanka ▁dagaa lada ▁ah li
▁w xa ▁ay ▁kutaalaa ▁galbeedka ▁koonfur ▁ameerika ▁xeebaha ▁jili ▁waxa ▁ay ▁ku ▁fid san yihiin , ▁badweynta ▁baasif ik ▁waxa ▁ay ▁xuduud o ▁la ▁wadaag taa ▁wada ma ▁per u ▁oo ▁waq o yi
10000 ▁af - jar mal ▁('' de ut sch '', ▁'' de ut s che ▁sp rac he '') ▁waa ▁afka ▁lagaga had lo ▁dalka ▁jarmalka . ▁afka ' s ▁waana ▁lagaga had lo ▁dawlad daha ▁kale ▁ee ▁austriya , ▁sw it z er land ,
▁is raaf iil ▁waa ▁beel ▁ku ▁abtir sato ▁beesha ▁karan le ▁beesha ▁is raaf iil ▁waxeey ▁caan ▁ku ▁yihiin ▁nabada ▁iyo ▁xasiloon ida ▁labaatan ki ▁sano ▁oo ▁wadanka ▁dagaalada ▁ah li
▁w xa ▁ay ▁kutaalaa ▁galbeedka ▁koonfur ▁ameerika ▁xeebaha ▁jili ▁waxa ▁ay ▁ku ▁fidsan yihiin , ▁badweynta ▁baasifik ▁waxa ▁ay ▁xuduud o ▁la ▁wadaagtaa ▁wadama ▁per u ▁oo ▁waq o yi
25000 ▁af - jarmal ▁('' de utsch '', ▁'' de uts che ▁sp rac he '') ▁waa ▁afka ▁lagaga had lo ▁dalka ▁jarmalka . ▁afka ' s ▁waana ▁lagaga had lo ▁dawladdaha ▁kale ▁ee ▁austriya , ▁switzerland ,
▁is raaf iil ▁waa ▁beel ▁ku ▁abtirsato ▁beesha ▁karanle ▁beesha ▁is raaf iil ▁waxeey ▁caan ▁ku ▁yihiin ▁nabada ▁iyo ▁xasiloonida ▁labaatan ki ▁sano ▁oo ▁wadanka ▁dagaalada ▁ah li
▁wxa ▁ay ▁kutaalaa ▁galbeedka ▁koonfur ▁ameerika ▁xeebaha ▁jili ▁waxa ▁ay ▁ku ▁fidsan yihiin , ▁badweynta ▁baasifik ▁waxa ▁ay ▁xuduudo ▁la ▁wadaagtaa ▁wadama ▁peru ▁oo ▁waqo yi
50000 ▁af - jarmal ▁('' deutsch '', ▁'' de utsche ▁sp rac he '') ▁waa ▁afka ▁lagagahadlo ▁dalka ▁jarmalka . ▁afka ' s ▁waana ▁lagagahadlo ▁dawladdaha ▁kale ▁ee ▁austriya , ▁switzerland ,
▁israafiil ▁waa ▁beel ▁ku ▁abtirsato ▁beesha ▁karanle ▁beesha ▁israafiil ▁waxeey ▁caan ▁ku ▁yihiin ▁nabada ▁iyo ▁xasiloonida ▁labaatan ki ▁sano ▁oo ▁wadanka ▁dagaalada ▁ah li
▁wxa ▁ay ▁kutaalaa ▁galbeedka ▁koonfur ▁ameerika ▁xeebaha ▁jili ▁waxa ▁ay ▁ku ▁fidsan yihiin , ▁badweynta ▁baasifik ▁waxa ▁ay ▁xuduudo ▁la ▁wadaagtaa ▁wadama ▁peru ▁oo ▁waqoyi