Inuktitut (iu) subword embeddings

Vocab size vocab model 25 dim 50 dim 100 dim 200 dim 300 dim
1000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
3000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
5000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix

Embedding matrix plots

Training corpus sample, encoded with different BPE vocabulary sizes

Vocab sizeiuwiki sample
original * admins can change the ''text'' of the interface by editing the pages in the mediawiki namespace. this includes the text at the top of pages such as
ᐃᓐᑎᐋᓈ—ᖃᓪᓗᓈᑎᑐᑦ—-{indiana}-— 0,000,000 (0000) ᐃᑎᐊᔪᑦ ᐃᓗᐊᓂ. ᐃᓐᑎᐋᓈ ᐃᓄᖁᑎ ᐊᒥᐊᓕᑲ. ᐊᐅᓚᑦᑎᔩᑦ ᓯᕗᓕᖅᑎᖓᑦ-ᓄᓇᓖᑦ ᐃᓐᑎᐋᓈᐳᓕᔅ «ᖃᓪᓗᓈᑎᑐᑦ--{indianapolis}-» ᓄᓇᓖᑦ
{| border=0 align=right cellpadding=0 cellspacing=0 width=000 style="margin: 0 0 0em 0em; background: #f0f0f0; border: 0px #aaaaaa solid; border-colla
1000 ▁* ▁admins ▁can ▁chang e ▁the ▁'' te x t '' ▁of ▁the ▁inter f ac e ▁by ▁edit ing ▁the ▁pages ▁in ▁the ▁m edia w ik i ▁n am es p ac e . ▁this ▁inc lu d es ▁the ▁t e x t ▁at ▁the ▁to p ▁of ▁pages ▁s u ch ▁a s
▁ᐃ ᓐ ᑎ ᐋ ᓈ — ᖃᓪᓗᓈᑎᑐᑦ —-{ in d ian a }-— ▁0,000,000 ▁(0000) ▁ᐃᑎᐊᔪᑦ ▁ᐃᓗᐊᓂ . ▁ᐃ ᓐ ᑎ ᐋ ᓈ ▁ᐃᓄᖁᑎ ▁ᐊᒥᐊᓕᑲ . ▁ᐊᐅᓚᑦᑎᔩᑦ ▁ᓯᕗᓕᖅᑎᖓᑦ - ᓄᓇᓖᑦ ▁ᐃ ᓐ ᑎ ᐋ ᓈ ᐳ ᓕ ᔅ ▁« ᖃᓪᓗᓈᑎᑐᑦ --{ in d ian ap ol is }-» ▁ᓄᓇᓖᑦ
▁{| ▁border =0 ▁align = right ▁cellpadding =0 ▁cellspacing =0 ▁width =000 ▁style =" margin : ▁0 ▁0 ▁0 em ▁0 em ; ▁background : ▁# f 0 f 0 f 0; ▁border : ▁0 px ▁# aaaaaa ▁solid ; ▁border - c oll a
3000 ▁* ▁admins ▁can ▁change ▁the ▁'' text '' ▁of ▁the ▁interface ▁by ▁editing ▁the ▁pages ▁in ▁the ▁mediawiki ▁n am es p ace . ▁this ▁inclu des ▁the ▁t ext ▁at ▁the ▁top ▁of ▁pages ▁su ch ▁as
▁ᐃᓐᑎᐋᓈ — ᖃᓪᓗᓈᑎᑐᑦ —-{ ind iana }-— ▁0,000,000 ▁(0000) ▁ᐃᑎᐊᔪᑦ ▁ᐃᓗᐊᓂ . ▁ᐃᓐᑎᐋᓈ ▁ᐃᓄᖁᑎ ▁ᐊᒥᐊᓕᑲ . ▁ᐊᐅᓚᑦᑎᔩᑦ ▁ᓯᕗᓕᖅᑎᖓᑦ - ᓄᓇᓖᑦ ▁ᐃᓐᑎᐋᓈ ᐳᓕᔅ ▁« ᖃᓪᓗᓈᑎᑐᑦ --{ ind ian apol is }-» ▁ᓄᓇᓖᑦ
▁{| ▁border =0 ▁align = right ▁cellpadding =0 ▁cellspacing =0 ▁width =000 ▁style =" margin : ▁0 ▁0 ▁0 em ▁0 em ; ▁background : ▁# f 0 f 0 f 0; ▁border : ▁0 px ▁# aaaaaa ▁solid ; ▁border - c oll a
5000 ▁* ▁admins ▁can ▁change ▁the ▁'' text '' ▁of ▁the ▁interface ▁by ▁editing ▁the ▁pages ▁in ▁the ▁mediawiki ▁n am es p ace . ▁this ▁includes ▁the ▁text ▁at ▁the ▁top ▁of ▁pages ▁such ▁as
▁ᐃᓐᑎᐋᓈ — ᖃᓪᓗᓈᑎᑐᑦ —-{ indiana }-— ▁0,000,000 ▁(0000) ▁ᐃᑎᐊᔪᑦ ▁ᐃᓗᐊᓂ . ▁ᐃᓐᑎᐋᓈ ▁ᐃᓄᖁᑎ ▁ᐊᒥᐊᓕᑲ . ▁ᐊᐅᓚᑦᑎᔩᑦ ▁ᓯᕗᓕᖅᑎᖓᑦ - ᓄᓇᓖᑦ ▁ᐃᓐᑎᐋᓈ ᐳᓕᔅ ▁« ᖃᓪᓗᓈᑎᑐᑦ --{ indian apolis }-» ▁ᓄᓇᓖᑦ
▁{| ▁border =0 ▁align = right ▁cellpadding =0 ▁cellspacing =0 ▁width =000 ▁style =" margin : ▁0 ▁0 ▁0 em ▁0 em ; ▁background : ▁# f 0 f 0 f 0; ▁border : ▁0 px ▁# aaaaaa ▁solid ; ▁border - c oll a