Igbo (ig) subword embeddings

Vocab size vocab model 25 dim 50 dim 100 dim 200 dim 300 dim
1000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
3000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
5000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix

Embedding matrix plots

Training corpus sample, encoded with different BPE vocabulary sizes

Vocab sizeigwiki sample
original the chimamanda ngozi adichie website, english department of the university of liège, belgium, looked up on 00 maachị 0000. *
chinụa achebe rụkwara ọrụ onye ''diplomat'' n'ọchịchị nke biafra site na 0000. mana site na 0000 ruo 0000, ọ kụzikwara na mahadum dị iche iche n'ala n
habila bi na njikota obodo amerika n'obodo nke ''centreville, fairfax'', végíníyà. ọ na-arụ ọrụ na mahadum ''george mason'' nso washington, d.c.
1000 ▁the ▁chi ma man da ▁n go zi ▁adi chi e ▁web si te , ▁eng lish ▁de p ar t ment ▁of ▁the ▁uni versi ty ▁of ▁li è ge , ▁be l gi um , ▁looked ▁up ▁on ▁00 ▁maachị ▁0000. ▁*
▁chi n ụ a ▁a che be ▁ rụ kwara ▁ọrụ ▁onye ▁'' di p lo ma t '' ▁n ' ọ chị chị ▁nke ▁bi af ra ▁site ▁na ▁0000. ▁mana ▁site ▁na ▁0000 ▁ru o ▁0000, ▁ọ ▁k ụ zi kwara ▁na ▁ma hadum ▁dị ▁iche ▁iche ▁n ' ala ▁n
▁ha bi la ▁bi ▁na ▁njikota ▁obodo ▁amerika ▁n ' o bodo ▁nke ▁'' c ent re vi l le , ▁fa ir f a x '', ▁v é g í n í y à . ▁ọ ▁na - ar ụ ▁ọrụ ▁na ▁ma hadum ▁'' ge or ge ▁ma s on '' ▁nso ▁wa sh ing ton , ▁d . c .
3000 ▁the ▁chi ma man da ▁ngozi ▁adi chie ▁website , ▁english ▁de par t ment ▁of ▁the ▁university ▁of ▁li è ge , ▁belgi um , ▁looked ▁up ▁on ▁00 ▁maachị ▁0000. ▁*
▁chinụ a ▁achebe ▁rụ kwara ▁ọrụ ▁onye ▁'' di p lo mat '' ▁n ' ọ chịchị ▁nke ▁biafra ▁site ▁na ▁0000. ▁mana ▁site ▁na ▁0000 ▁ruo ▁0000, ▁ọ ▁k ụ zi kwara ▁na ▁mahadum ▁dị ▁iche ▁iche ▁n ' ala ▁n
▁ha bi la ▁bi ▁na ▁njikota ▁obodo ▁amerika ▁n ' obodo ▁nke ▁'' cent re vi l le , ▁fair fa x '', ▁v é g í ní y à . ▁ọ ▁na - ar ụ ▁ọrụ ▁na ▁mahadum ▁'' ge or ge ▁ma son '' ▁nso ▁washington , ▁d . c .
5000 ▁the ▁chimamanda ▁ngozi ▁adichie ▁website , ▁english ▁de part ment ▁of ▁the ▁university ▁of ▁li è ge , ▁belgium , ▁looked ▁up ▁on ▁00 ▁maachị ▁0000. ▁*
▁chinụa ▁achebe ▁rụ kwara ▁ọrụ ▁onye ▁'' di plo mat '' ▁n ' ọ chịchị ▁nke ▁biafra ▁site ▁na ▁0000. ▁mana ▁site ▁na ▁0000 ▁ruo ▁0000, ▁ọ ▁k ụ zi kwara ▁na ▁mahadum ▁dị ▁iche ▁iche ▁n ' ala ▁n
▁habila ▁bi ▁na ▁njikota ▁obodo ▁amerika ▁n ' obodo ▁nke ▁'' cent revi lle , ▁fair fa x '', ▁v égí níyà . ▁ọ ▁na - arụ ▁ọrụ ▁na ▁mahadum ▁'' ge or ge ▁ma son '' ▁nso ▁washington , ▁d . c .