Buginese (bug) subword embeddings

Vocab size vocab model 25 dim 50 dim 100 dim 200 dim 300 dim
1000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
3000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
5000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix

Embedding matrix plots

Training corpus sample, encoded with different BPE vocabulary sizes

Vocab sizebugwiki sample
original * labview * lagoona * latex * lava * leda * lexico * limbo * linc * lingo * lisp * logo * lotusscript * lpc * lse * lua * lucid * lustre * lyapas
imagesize = width:000 height:000 plotarea = left:000 right:0 top:00 bottom:00 dateformat = yyyy timeaxis = orientation:horizontal format:yyyy
berikut adalah daftar tvri stasiun daerah: * sumatera: ** tvri aceh (banda aceh) ** tvri jambi (jambi) ** tvri sumatera barat (padang) ** tvri sumater
1000 ▁* ▁l ab vi ew ▁* ▁l ag o ona ▁* ▁l ate x ▁* ▁la v a ▁* ▁l ed a ▁* ▁le x ic o ▁* ▁l im b o ▁* ▁l in c ▁* ▁l ing o ▁* ▁l is p ▁* ▁logo ▁* ▁lo tu s s c ri pt ▁* ▁l p c ▁* ▁l se ▁* ▁l ua ▁* ▁l u c id ▁* ▁l ust re ▁* ▁l y ap as
▁im ag esi z e ▁ = ▁w id th : 000 ▁h e ig ht : 000 ▁p lo t ar e a ▁ = ▁le f t : 000 ▁ri g ht : 0 ▁t op : 00 ▁b ot t om : 00 ▁d ate f orm at ▁ = ▁y y y y ▁tim e a x is ▁ = ▁o ri ent ation : h ori z ont al ▁f orm at : y y y y
▁beri k ut ▁adalah ▁da f t ar ▁tvri ▁stasiun ▁da er ah : ▁* ▁sum at era : ▁** ▁tvri ▁aceh ▁( b and a ▁aceh ) ▁** ▁tvri ▁j amb i ▁( j amb i ) ▁** ▁tvri ▁sum at era ▁bar at ▁( p ad ang ) ▁** ▁tvri ▁sum at er
3000 ▁* ▁lab vi ew ▁* ▁l ag o ona ▁* ▁l ate x ▁* ▁la va ▁* ▁l eda ▁* ▁le x ic o ▁* ▁lim bo ▁* ▁lin c ▁* ▁l ing o ▁* ▁lis p ▁* ▁logo ▁* ▁lo tus script ▁* ▁l p c ▁* ▁l se ▁* ▁l ua ▁* ▁lu c id ▁* ▁l ust re ▁* ▁l y ap as
▁im ag esi ze ▁= ▁wid th :000 ▁he ight :000 ▁p lo t are a ▁= ▁le ft :000 ▁right :0 ▁top :00 ▁bot t om :00 ▁d ate form at ▁= ▁y y y y ▁tim e a x is ▁= ▁o ri ent ation : h ori z ont al ▁f orm at : y y y y
▁berikut ▁adalah ▁da ft ar ▁tvri ▁stasiun ▁daerah : ▁* ▁sumatera : ▁** ▁tvri ▁aceh ▁( b anda ▁aceh ) ▁** ▁tvri ▁jambi ▁( j ambi ) ▁** ▁tvri ▁sumatera ▁barat ▁( p ad ang ) ▁** ▁tvri ▁sumat er
5000 ▁* ▁lab view ▁* ▁l ago ona ▁* ▁late x ▁* ▁la va ▁* ▁l eda ▁* ▁le x ico ▁* ▁lim bo ▁* ▁lin c ▁* ▁ling o ▁* ▁lis p ▁* ▁logo ▁* ▁lo tus script ▁* ▁lp c ▁* ▁l se ▁* ▁l ua ▁* ▁lu c id ▁* ▁l ust re ▁* ▁l y ap as
▁im ag esi ze ▁= ▁width :000 ▁he ight :000 ▁plo tare a ▁= ▁left :000 ▁right :0 ▁top :00 ▁bottom :00 ▁d ate form at ▁= ▁y yyy ▁tim e axis ▁= ▁ori ent ation : hori z ont al ▁f orm at : yy yy
▁berikut ▁adalah ▁daftar ▁tvri ▁stasiun ▁daerah : ▁* ▁sumatera : ▁** ▁tvri ▁aceh ▁( b anda ▁aceh ) ▁** ▁tvri ▁jambi ▁( j ambi ) ▁** ▁tvri ▁sumatera ▁barat ▁( pad ang ) ▁** ▁tvri ▁sumat er