Scottish Gaelic (gd) subword embeddings

Vocab size vocab model 25 dim 50 dim 100 dim 200 dim 300 dim
1000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
3000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
5000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
10000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
25000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix

Training corpus sample, encoded with different BPE vocabulary sizes

Vocab sizegdwiki sample
original * mac griogair, an t-urramach alasdair. (0000) 'air cruinn-mheallaibh soillseach nan speur' ann an ''rosg gàidhlig: specimens of gaelic prose'' w. j.
tha atlantis (greugais: ἀτλαντὶς νῆσος, "eilean atlais") na tìr uirsgeulach, a chreid cuid a bha anns a' chuan siar. se plato a sgriobh mu dheidhinn b
tha ''yossi & jagger'' (eabhra: יוסי וג'אגר) na film 0000 israelach air sheòladh le eytan fox, mu dà saighdeir, yossi agus lior (a bheil ainmichte "ja
1000 ▁* ▁mac ▁g ri og air , ▁an ▁t - ur ra mach ▁al as d air . ▁(0000) ▁' air ▁c ruinn - mh eall aibh ▁s o ills each ▁nan ▁sp eur ' ▁ann ▁an ▁'' ro sg ▁gàidhlig : ▁sp e c im en s ▁of ▁ga e li c ▁p ro se '' ▁w . ▁j .
▁tha ▁a t lan tis ▁( g re ug ais : ▁ ἀ τ λ α ν τ ὶ ς ▁ ν ῆ σ ο ς , ▁" eil ean ▁a t lais " ) ▁na ▁t ìr ▁ uir s ge ul ach , ▁a ▁ch re id ▁cuid ▁a ▁bha ▁anns ▁a ' ▁ch u an ▁s iar . ▁se ▁p la to ▁a ▁sg ri obh ▁mu ▁dhe idh inn ▁b
▁tha ▁'' y os s i ▁ & ▁j ag g er '' ▁( eabh ra : ▁ י ו ס י ▁ ו ג ' אג ר ) ▁na ▁f il m ▁0000 ▁is ra el ach ▁air ▁sh e òl adh ▁le ▁e y tan ▁fo x , ▁mu ▁dà ▁s aigh de ir , ▁y os s i ▁agus ▁l ior ▁( a ▁bheil ▁ainm ichte ▁" j a
3000 ▁* ▁mac ▁g riog air , ▁an ▁t - ur ra mach ▁alasdair . ▁(0000) ▁' air ▁cruinn - mh eall aibh ▁so ills each ▁nan ▁sp eur ' ▁ann ▁an ▁'' ro sg ▁gàidhlig : ▁sp e c im ens ▁of ▁gaelic ▁pro se '' ▁w . ▁j .
▁tha ▁at lan tis ▁( gre ugais : ▁ ἀ τ λ α ν τ ὶ ς ▁ ν ῆ σ ο ς , ▁" eilean ▁at lais ") ▁na ▁tìr ▁ uir sgeul ach , ▁a ▁ch reid ▁cuid ▁a ▁bha ▁anns ▁a ' ▁chuan ▁siar . ▁se ▁pla to ▁a ▁sg ri obh ▁mu ▁dheidhinn ▁b
▁tha ▁'' y os si ▁& ▁j ag ger '' ▁( eabh ra : ▁ י ו ס י ▁ ו ג ' אג ר ) ▁na ▁film ▁0000 ▁is ra el ach ▁air ▁sh eòl adh ▁le ▁e y tan ▁fo x , ▁mu ▁dà ▁s aigh de ir , ▁y os si ▁agus ▁l ior ▁( a ▁bheil ▁ainm ichte ▁" ja
5000 ▁* ▁mac ▁g riogair , ▁an ▁t - ur ramach ▁alasdair . ▁(0000) ▁' air ▁cruinn - mh eall aibh ▁so ills each ▁nan ▁speur ' ▁ann ▁an ▁'' ro sg ▁gàidhlig : ▁sp ec im ens ▁of ▁gaelic ▁pro se '' ▁w . ▁j .
▁tha ▁at lan tis ▁( greugais : ▁ ἀ τ λ α ν τ ὶ ς ▁ ν ῆ σ ο ς , ▁" eilean ▁at lais ") ▁na ▁tìr ▁ uir sgeul ach , ▁a ▁chreid ▁cuid ▁a ▁bha ▁anns ▁a ' ▁chuan ▁siar . ▁se ▁pla to ▁a ▁sg ri obh ▁mu ▁dheidhinn ▁b
▁tha ▁'' y os si ▁& ▁j ag ger '' ▁( eabh ra : ▁ י ו ס י ▁ ו ג ' אג ר ) ▁na ▁film ▁0000 ▁is rael ach ▁air ▁sh eòl adh ▁le ▁e y tan ▁fo x , ▁mu ▁dà ▁s aigh de ir , ▁y os si ▁agus ▁l ior ▁( a ▁bheil ▁ainmichte ▁" ja
10000 ▁* ▁mac ▁griogair , ▁an ▁t - urramach ▁alasdair . ▁(0000) ▁' air ▁cruinn - mh eall aibh ▁so ills each ▁nan ▁speur ' ▁ann ▁an ▁'' ro sg ▁gàidhlig : ▁sp ec im ens ▁of ▁gaelic ▁pro se '' ▁w . ▁j .
▁tha ▁atlan tis ▁( greugais : ▁ ἀ τ λ α ν τ ὶ ς ▁ ν ῆ σ ος , ▁" eilean ▁at lais ") ▁na ▁tìr ▁uir sgeulach , ▁a ▁chreid ▁cuid ▁a ▁bha ▁anns ▁a ' ▁chuan ▁siar . ▁se ▁pla to ▁a ▁sgriobh ▁mu ▁dheidhinn ▁b
▁tha ▁'' y os si ▁& ▁jag ger '' ▁( eabhra : ▁ י ו ס י ▁ ו ג ' אג ר ) ▁na ▁film ▁0000 ▁is rael ach ▁air ▁sheòl adh ▁le ▁e y tan ▁fox , ▁mu ▁dà ▁s aigh de ir , ▁y os si ▁agus ▁l ior ▁( a ▁bheil ▁ainmichte ▁" ja
25000 ▁* ▁mac ▁griogair , ▁an ▁t - urramach ▁alasdair . ▁(0000) ▁' air ▁cruinn - mh eall aibh ▁so ills each ▁nan ▁speur ' ▁ann ▁an ▁'' rosg ▁gàidhlig : ▁spec im ens ▁of ▁gaelic ▁prose '' ▁w . ▁j .
▁tha ▁atlan tis ▁( greugais : ▁ ἀ τ λ αν τ ὶ ς ▁ ν ῆ σ ος , ▁" eilean ▁at lais ") ▁na ▁tìr ▁uir sgeulach , ▁a ▁chreid ▁cuid ▁a ▁bha ▁anns ▁a ' ▁chuan ▁siar . ▁se ▁plato ▁a ▁sgriobh ▁mu ▁dheidhinn ▁b
▁tha ▁'' y ossi ▁& ▁jagger '' ▁( eabhra : ▁י ו ס י ▁ ו ג ' אג ר ) ▁na ▁film ▁0000 ▁israel ach ▁air ▁sheòl adh ▁le ▁ey tan ▁fox , ▁mu ▁dà ▁saigh deir , ▁y ossi ▁agus ▁lior ▁( a ▁bheil ▁ainmichte ▁" ja