Hawaiian (haw) subword embeddings

Vocab size vocab model 25 dim 50 dim 100 dim 200 dim 300 dim
1000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
3000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
5000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix

Embedding matrix plots

Training corpus sample, encoded with different BPE vocabulary sizes

Vocab sizehawwiki sample
original ua noho aliʻi nō 00 makahiki a me 000 lā, ʻoiai i ka mahele mua o kona noho aliʻi i lalo ka noho aliʻi na kuini kaʻahumanu ā mahope aku na kuini kaʻah
{| border=0 align=right cellpadding=0 cellspacing=0 width=000 style="margin: 0 0 0em 0em; background: #f0f0f0; border: 0px #aaaaaa solid; border-colla
‘o ''‘imi loa dora'' (''dora the explorer'') ke pūkaʻina polokalamu kīwī no kamali’i pēlā ho’olaha i nickelodeon i ka ‘amelika hui pū ‘ia. la kuhina n
1000 ▁ua ▁noho ▁ali ʻ i ▁nō ▁00 ▁makahiki ▁a ▁me ▁000 ▁lā , ▁ʻ oia i ▁i ▁ka ▁ma hele ▁mua ▁o ▁kona ▁noho ▁ali ʻ i ▁i ▁la lo ▁ka ▁noho ▁ali ʻ i ▁na ▁k ui ni ▁ka ʻ ahu man u ▁ā ▁mahope ▁aku ▁na ▁k ui ni ▁ka ʻ a h
▁{| ▁border =0 ▁align = right ▁cellpadding =0 ▁cellspacing =0 ▁width =000 ▁style =" margin : ▁0 ▁0 ▁0 em ▁0 em ; ▁background : ▁# f 0 f 0 f 0; ▁border : ▁0 px ▁# aaaaaa ▁solid ; ▁border - c ol la
▁‘ o ▁'' ‘ i mi ▁loa ▁do ra '' ▁('' do ra ▁the ▁e x p lo r er '') ▁ke ▁pū ka ʻ ina ▁polo kala mu ▁kī w ī ▁no ▁kama li ’ i ▁p ē lā ▁ho ’ ola ha ▁i ▁n ic ke lo de on ▁i ▁ka ▁‘ amelika ▁hui ▁pū ▁‘ ia . ▁la ▁ku hina ▁n
3000 ▁ua ▁noho ▁ali ʻ i ▁nō ▁00 ▁makahiki ▁a ▁me ▁000 ▁lā , ▁ʻ oiai ▁i ▁ka ▁mahele ▁mua ▁o ▁kona ▁noho ▁ali ʻ i ▁i ▁lalo ▁ka ▁noho ▁ali ʻ i ▁na ▁kuini ▁ka ʻ ahu manu ▁ā ▁mahope ▁aku ▁na ▁kuini ▁ka ʻ a h
▁{| ▁border =0 ▁align = right ▁cellpadding =0 ▁cellspacing =0 ▁width =000 ▁style =" margin : ▁0 ▁0 ▁0 em ▁0 em ; ▁background : ▁# f 0 f 0 f 0; ▁border : ▁0 px ▁# aaaaaa ▁solid ; ▁border - c ol la
▁‘ o ▁'' ‘ imi ▁loa ▁dora '' ▁('' do ra ▁the ▁ex p lor er '') ▁ke ▁pūka ʻ ina ▁polokalamu ▁kīwī ▁no ▁kamali ’ i ▁pēlā ▁ho ’ ola ha ▁i ▁nickelodeon ▁i ▁ka ▁‘ amelika ▁hui ▁pū ▁‘ ia . ▁la ▁kuhina ▁n
5000 ▁ua ▁noho ▁ali ʻ i ▁nō ▁00 ▁makahiki ▁a ▁me ▁000 ▁lā , ▁ʻ oiai ▁i ▁ka ▁mahele ▁mua ▁o ▁kona ▁noho ▁ali ʻ i ▁i ▁lalo ▁ka ▁noho ▁ali ʻ i ▁na ▁kuini ▁ka ʻ ahumanu ▁ā ▁mahope ▁aku ▁na ▁kuini ▁ka ʻ a h
▁{| ▁border =0 ▁align = right ▁cellpadding =0 ▁cellspacing =0 ▁width =000 ▁style =" margin : ▁0 ▁0 ▁0 em ▁0 em ; ▁background : ▁# f 0 f 0 f 0; ▁border : ▁0 px ▁# aaaaaa ▁solid ; ▁border - c ol la
▁‘ o ▁''‘ imi ▁loa ▁dora '' ▁('' do ra ▁the ▁exp lorer '') ▁ke ▁pūka ʻ ina ▁polokalamu ▁kīwī ▁no ▁kamali ’ i ▁pēlā ▁ho ’ ola ha ▁i ▁nickelodeon ▁i ▁ka ▁‘ amelika ▁hui ▁pū ▁‘ ia . ▁la ▁kuhina ▁n