Galician (gl) subword embeddings

Vocab size vocab model 25 dim 50 dim 100 dim 200 dim 300 dim
1000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
3000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
5000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
10000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
25000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
50000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
100000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
200000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix

Training corpus sample, encoded with different BPE vocabulary sizes

Vocab sizeglwiki sample
original ''tijuana''baixa california ''san cristobal de las casas''chiapas ''porto vallarta''xalisco ''el tajín''veracruz ''morelia''michoacán
o esperanto ten cinco vogais. a lingua non fai distinción na lonxitude das vogais e tampouco existen vogais nasalizadas.
* carlos del álamo: conselleiro de medio ambiente na xunta de galicia entre 0000 e 0000. * francisco álvarez cascos: ministro de fomento entre 0000 e
1000 ▁'' ti j u ana '' ba ix a ▁c ali for n ia ▁'' s an ▁c ris to bal ▁de ▁l as ▁cas as '' ch ia p as ▁'' por to ▁val l ar ta '' x al is co ▁'' el ▁ta j ín '' v era c ru z ▁'' mo re l ia '' m ich o ac án
▁o ▁espe ran to ▁ten ▁c in co ▁vo ga is . ▁a ▁lingua ▁non ▁fa i ▁dis tin ción ▁na ▁lon xi tu de ▁das ▁vo ga is ▁e ▁ta mp ou co ▁ex isten ▁vo ga is ▁nas aliz adas .
▁* ▁car los ▁del ▁á la mo : ▁cons el le iro ▁de ▁me dio ▁a mb i ente ▁na ▁x un ta ▁de ▁galicia ▁entre ▁0000 ▁e ▁0000. ▁* ▁fran cis co ▁á l va re z ▁cas cos : ▁min is tro ▁de ▁fo mento ▁entre ▁0000 ▁e
3000 ▁'' ti j u ana '' ba ix a ▁cali for nia ▁'' san ▁cris to bal ▁de ▁las ▁cas as '' ch ia pas ▁'' por to ▁val lar ta '' x al is co ▁'' el ▁ta j ín '' v era c ru z ▁'' mo rel ia '' m ich o ac án
▁o ▁espe ran to ▁ten ▁cinco ▁vo ga is . ▁a ▁lingua ▁non ▁fai ▁distin ción ▁na ▁lonxitude ▁das ▁vo ga is ▁e ▁ta mp ou co ▁existen ▁vo ga is ▁nas aliz adas .
▁* ▁carlos ▁del ▁á la mo : ▁cons elle iro ▁de ▁medio ▁ambi ente ▁na ▁xun ta ▁de ▁galicia ▁entre ▁0000 ▁e ▁0000. ▁* ▁francisco ▁ál va rez ▁cas cos : ▁minis tro ▁de ▁fo mento ▁entre ▁0000 ▁e
5000 ▁'' ti ju ana '' ba ixa ▁california ▁'' san ▁cristo bal ▁de ▁las ▁casas '' ch ia pas ▁'' porto ▁val lar ta '' x al isco ▁'' el ▁ta j ín '' vera cru z ▁'' mo rel ia '' m ich o ac án
▁o ▁espe ran to ▁ten ▁cinco ▁vo ga is . ▁a ▁lingua ▁non ▁fai ▁distin ción ▁na ▁lonxitude ▁das ▁vo ga is ▁e ▁ta mp ou co ▁existen ▁vo ga is ▁nas aliz adas .
▁* ▁carlos ▁del ▁á la mo : ▁cons elle iro ▁de ▁medio ▁ambiente ▁na ▁xunta ▁de ▁galicia ▁entre ▁0000 ▁e ▁0000. ▁* ▁francisco ▁álvarez ▁cas cos : ▁ministro ▁de ▁fo mento ▁entre ▁0000 ▁e
10000 ▁'' ti ju ana '' ba ixa ▁california ▁'' san ▁cristo bal ▁de ▁las ▁casas '' ch ia pas ▁'' porto ▁val lar ta '' x al isco ▁'' el ▁ta j ín '' vera cru z ▁'' mo rel ia '' m ich o ac án
▁o ▁espe ran to ▁ten ▁cinco ▁vo gais . ▁a ▁lingua ▁non ▁fai ▁distin ción ▁na ▁lonxitude ▁das ▁vo gais ▁e ▁tampouco ▁existen ▁vo gais ▁nas aliz adas .
▁* ▁carlos ▁del ▁á la mo : ▁conselleiro ▁de ▁medio ▁ambiente ▁na ▁xunta ▁de ▁galicia ▁entre ▁0000 ▁e ▁0000. ▁* ▁francisco ▁álvarez ▁cas cos : ▁ministro ▁de ▁fo mento ▁entre ▁0000 ▁e
25000 ▁'' ti ju ana '' ba ixa ▁california ▁'' san ▁cristo bal ▁de ▁las ▁casas '' chia pas ▁'' porto ▁val lar ta '' xal isco ▁'' el ▁ta j ín '' vera cruz ▁'' mo rel ia '' mich o ac án
▁o ▁esperan to ▁ten ▁cinco ▁vogais . ▁a ▁lingua ▁non ▁fai ▁distinción ▁na ▁lonxitude ▁das ▁vogais ▁e ▁tampouco ▁existen ▁vogais ▁nas alizadas .
▁* ▁carlos ▁del ▁á lamo : ▁conselleiro ▁de ▁medio ▁ambiente ▁na ▁xunta ▁de ▁galicia ▁entre ▁0000 ▁e ▁0000. ▁* ▁francisco ▁álvarez ▁cas cos : ▁ministro ▁de ▁fomento ▁entre ▁0000 ▁e
50000 ▁'' ti juana '' baixa ▁california ▁'' san ▁cristo bal ▁de ▁las ▁casas '' chia pas ▁'' porto ▁val lar ta '' xal isco ▁'' el ▁taj ín '' vera cruz ▁'' mo relia '' mich o acán
▁o ▁esperanto ▁ten ▁cinco ▁vogais . ▁a ▁lingua ▁non ▁fai ▁distinción ▁na ▁lonxitude ▁das ▁vogais ▁e ▁tampouco ▁existen ▁vogais ▁nas alizadas .
▁* ▁carlos ▁del ▁álamo : ▁conselleiro ▁de ▁medio ▁ambiente ▁na ▁xunta ▁de ▁galicia ▁entre ▁0000 ▁e ▁0000. ▁* ▁francisco ▁álvarez ▁cascos : ▁ministro ▁de ▁fomento ▁entre ▁0000 ▁e
100000 ▁'' ti juana '' baixa ▁california ▁'' san ▁cristobal ▁de ▁las ▁casas '' chia pas ▁'' porto ▁val lar ta '' xal isco ▁'' el ▁taj ín '' vera cruz ▁'' mo relia '' mich o acán
▁o ▁esperanto ▁ten ▁cinco ▁vogais . ▁a ▁lingua ▁non ▁fai ▁distinción ▁na ▁lonxitude ▁das ▁vogais ▁e ▁tampouco ▁existen ▁vogais ▁nas alizadas .
▁* ▁carlos ▁del ▁álamo : ▁conselleiro ▁de ▁medio ▁ambiente ▁na ▁xunta ▁de ▁galicia ▁entre ▁0000 ▁e ▁0000. ▁* ▁francisco ▁álvarez ▁cascos : ▁ministro ▁de ▁fomento ▁entre ▁0000 ▁e