Cherokee (chr) subword embeddings

Vocab size vocab model 25 dim 50 dim 100 dim 200 dim 300 dim
1000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
3000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
5000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix

Embedding matrix plots

Training corpus sample, encoded with different BPE vocabulary sizes

Vocab sizechrwiki sample
original ᎾᏍᎩᎾᎢ ᏄᎵᏍᏛ ᏣᏔᎦ ᎠᎩᏏ ᏰᎵᏇ ᎠᏠᏯᏍᏗ, ᎦᏙ ᎤᏍᏗ ᎨᏒᎢ ᎤᏙᎯᏳ ᏓᎦᏘᎴᎬ ᎠᎴ ᎬᎾ ᎠᏨᏯ ᏰᎵᏇ ᏂᏓᏙᎳᎬᎾ ᎦᏙ ᎤᏍᏗ ᎨᏒᎢ ᎤᏙᎯᏳᎯᏯ ᎤᏙᎯᏳ ᏧᏣᏔᏊ.
ᎯᎠ ᎠᎹᏳᎸᏗ ᎨᏒᎢ ᎧᏁᎢᏍᏔᏅᎯ ᏥᏄᏍᏗ ᎯᎠ ᎢᎦᏛ ᎯᎠ ᎦᏙᎯ ᎾᎥᏂᎨᏍᏙᏗ ᎠᎴ ᎾᎥᏂᎨ ᎯᎠ ᎠᎺᏉᎯ. coastline ᎨᏒᎢ ᏗᏙᎳᏤᎩ, ᎠᏍᏓᏅᏅ ᎾᎿ ᎡᎶᎯ ᏓᏟᎶᏍᏔᏅ ᎠᎾᏓᏎᎮᎲᎢ ᎯᎠ ᏚᏂᏲᏍᎬ ᎠᎹᏳᎸᏗ ᎠᎴ ᎯᎠ ᎧᏁᏨ ᎨᏒᎢ ᎢᏳᏓᎵᎭ Ꭼ
feral ᏪᏌ ᎭᏫᎾᏗᏢ ᎢᎦᏘᎭ ᎢᎬᏁᏗ colonies ᏰᎵᏇ ᎠᎴᏂᏓᏍᏗ ᎤᏣᏘ ᎦᏅᎯᏓᎨ; ᎯᎠ ᎩᎵᏏ ᏪᏌ ᎠᎵᏖᎸᏗ ᎰᏩ ᎠᏰᎸᏗ ᎬᏂᎨᏒ ᎢᎬᏁᎸ 00-ᏑᏕᏘᏴᏓ-ᎠᎦᏴᎵ feral ᎠᎨᏴ.
1000 ▁ ꮎꮝꭹꮎꭲ ▁ ꮔꮅꮝꮫ ▁ ꮳꮤꭶ ▁ ꭰꭹꮟ ▁ ᏸꮅꮗ ▁ ꭰꮰꮿꮝꮧ , ▁ ꭶꮩ ▁ ꭴꮝꮧ ▁ ꭸꮢꭲ ▁ ꭴꮩꭿᏻ ▁ ꮣꭶꮨꮄꭼ ▁ ꭰꮄ ▁ ꭼꮎ ▁ ꭰꮸꮿ ▁ ᏸꮅꮗ ▁ ꮒꮣꮩꮃꭼꮎ ▁ ꭶꮩ ▁ ꭴꮝꮧ ▁ ꭸꮢꭲ ▁ ꭴꮩꭿᏻꭿꮿ ▁ ꭴꮩꭿᏻ ▁ ꮷꮳꮤꮚ .
▁ ꭿꭰ ▁ ꭰꮉᏻꮈꮧ ▁ ꭸꮢꭲ ▁ ꭷꮑꭲꮝꮤꮕꭿ ▁ ꮵꮔꮝꮧ ▁ ꭿꭰ ▁ ꭲꭶꮫ ▁ ꭿꭰ ▁ ꭶꮩꭿ ▁ ꮎꭵꮒꭸꮝꮩꮧ ▁ ꭰꮄ ▁ ꮎꭵꮒꭸ ▁ ꭿꭰ ▁ ꭰꮊꮙꭿ . ▁c o as t l ine ▁ ꭸꮢꭲ ▁ ꮧꮩꮃꮴꭹ , ▁ ꭰꮝꮣꮕꮕ ▁ ꮎꮏ ▁ ꭱꮆꭿ ▁ ꮣꮯꮆꮝꮤꮕ ▁ ꭰꮎꮣꮞꭾꮂꭲ ▁ ꭿꭰ ▁ ꮪꮒᏺꮝꭼ ▁ ꭰꮉᏻꮈꮧ ▁ ꭰꮄ ▁ ꭿꭰ ▁ ꭷꮑꮸ ▁ ꭸꮢꭲ ▁ ꭲᏻꮣꮅꭽ ▁ ꭼ
▁f er al ▁ ꮺꮜ ▁ ꭽꮻꮎꮧꮲ ▁ ꭲꭶꮨꭽ ▁ ꭲꭼꮑꮧ ▁c ol on ies ▁ ᏸꮅꮗ ▁ ꭰꮄꮒꮣꮝꮧ ▁ ꭴꮳꮨ ▁ ꭶꮕꭿꮣꭸ ; ▁ ꭿꭰ ▁ ꭹꮅꮟ ▁ ꮺꮜ ▁ ꭰꮅꮦꮈꮧ ▁ ꮀꮹ ▁ ꭰᏸꮈꮧ ▁ ꭼꮒꭸꮢ ▁ ꭲꭼꮑꮈ ▁00 - ꮡꮥꮨᏼꮣ - ꭰꭶᏼꮅ ▁f er al ▁ ꭰꭸᏼ .
3000 ▁ ꮎꮝꭹꮎꭲ ▁ ꮔꮅꮝꮫ ▁ ꮳꮤꭶ ▁ ꭰꭹꮟ ▁ ᏸꮅꮗ ▁ ꭰꮰꮿꮝꮧ , ▁ ꭶꮩ ▁ ꭴꮝꮧ ▁ ꭸꮢꭲ ▁ ꭴꮩꭿᏻ ▁ ꮣꭶꮨꮄꭼ ▁ ꭰꮄ ▁ ꭼꮎ ▁ ꭰꮸꮿ ▁ ᏸꮅꮗ ▁ ꮒꮣꮩꮃꭼꮎ ▁ ꭶꮩ ▁ ꭴꮝꮧ ▁ ꭸꮢꭲ ▁ ꭴꮩꭿᏻꭿꮿ ▁ ꭴꮩꭿᏻ ▁ ꮷꮳꮤꮚ .
▁ ꭿꭰ ▁ ꭰꮉᏻꮈꮧ ▁ ꭸꮢꭲ ▁ ꭷꮑꭲꮝꮤꮕꭿ ▁ ꮵꮔꮝꮧ ▁ ꭿꭰ ▁ ꭲꭶꮫ ▁ ꭿꭰ ▁ ꭶꮩꭿ ▁ ꮎꭵꮒꭸꮝꮩꮧ ▁ ꭰꮄ ▁ ꮎꭵꮒꭸ ▁ ꭿꭰ ▁ ꭰꮊꮙꭿ . ▁co ast l ine ▁ ꭸꮢꭲ ▁ ꮧꮩꮃꮴꭹ , ▁ ꭰꮝꮣꮕꮕ ▁ ꮎꮏ ▁ ꭱꮆꭿ ▁ ꮣꮯꮆꮝꮤꮕ ▁ ꭰꮎꮣꮞꭾꮂꭲ ▁ ꭿꭰ ▁ ꮪꮒᏺꮝꭼ ▁ ꭰꮉᏻꮈꮧ ▁ ꭰꮄ ▁ ꭿꭰ ▁ ꭷꮑꮸ ▁ ꭸꮢꭲ ▁ ꭲᏻꮣꮅꭽ ▁ ꭼ
▁f er al ▁ ꮺꮜ ▁ ꭽꮻꮎꮧꮲ ▁ ꭲꭶꮨꭽ ▁ ꭲꭼꮑꮧ ▁col on ies ▁ ᏸꮅꮗ ▁ ꭰꮄꮒꮣꮝꮧ ▁ ꭴꮳꮨ ▁ ꭶꮕꭿꮣꭸ ; ▁ ꭿꭰ ▁ ꭹꮅꮟ ▁ ꮺꮜ ▁ ꭰꮅꮦꮈꮧ ▁ ꮀꮹ ▁ ꭰᏸꮈꮧ ▁ ꭼꮒꭸꮢ ▁ ꭲꭼꮑꮈ ▁00 - ꮡꮥꮨᏼꮣ - ꭰꭶᏼꮅ ▁f er al ▁ ꭰꭸᏼ .
5000 ▁ ꮎꮝꭹꮎꭲ ▁ ꮔꮅꮝꮫ ▁ ꮳꮤꭶ ▁ ꭰꭹꮟ ▁ ᏸꮅꮗ ▁ ꭰꮰꮿꮝꮧ , ▁ ꭶꮩ ▁ ꭴꮝꮧ ▁ ꭸꮢꭲ ▁ ꭴꮩꭿᏻ ▁ ꮣꭶꮨꮄꭼ ▁ ꭰꮄ ▁ ꭼꮎ ▁ ꭰꮸꮿ ▁ ᏸꮅꮗ ▁ ꮒꮣꮩꮃꭼꮎ ▁ ꭶꮩ ▁ ꭴꮝꮧ ▁ ꭸꮢꭲ ▁ ꭴꮩꭿᏻꭿꮿ ▁ ꭴꮩꭿᏻ ▁ ꮷꮳꮤꮚ .
▁ ꭿꭰ ▁ ꭰꮉᏻꮈꮧ ▁ ꭸꮢꭲ ▁ ꭷꮑꭲꮝꮤꮕꭿ ▁ ꮵꮔꮝꮧ ▁ ꭿꭰ ▁ ꭲꭶꮫ ▁ ꭿꭰ ▁ ꭶꮩꭿ ▁ ꮎꭵꮒꭸꮝꮩꮧ ▁ ꭰꮄ ▁ ꮎꭵꮒꭸ ▁ ꭿꭰ ▁ ꭰꮊꮙꭿ . ▁co ast line ▁ ꭸꮢꭲ ▁ ꮧꮩꮃꮴꭹ , ▁ ꭰꮝꮣꮕꮕ ▁ ꮎꮏ ▁ ꭱꮆꭿ ▁ ꮣꮯꮆꮝꮤꮕ ▁ ꭰꮎꮣꮞꭾꮂꭲ ▁ ꭿꭰ ▁ ꮪꮒᏺꮝꭼ ▁ ꭰꮉᏻꮈꮧ ▁ ꭰꮄ ▁ ꭿꭰ ▁ ꭷꮑꮸ ▁ ꭸꮢꭲ ▁ ꭲᏻꮣꮅꭽ ▁ ꭼ
▁f eral ▁ ꮺꮜ ▁ ꭽꮻꮎꮧꮲ ▁ ꭲꭶꮨꭽ ▁ ꭲꭼꮑꮧ ▁col on ies ▁ ᏸꮅꮗ ▁ ꭰꮄꮒꮣꮝꮧ ▁ ꭴꮳꮨ ▁ ꭶꮕꭿꮣꭸ ; ▁ ꭿꭰ ▁ ꭹꮅꮟ ▁ ꮺꮜ ▁ ꭰꮅꮦꮈꮧ ▁ ꮀꮹ ▁ ꭰᏸꮈꮧ ▁ ꭼꮒꭸꮢ ▁ ꭲꭼꮑꮈ ▁00 - ꮡꮥꮨᏼꮣ - ꭰꭶᏼꮅ ▁f eral ▁ ꭰꭸᏼ .