Official Aramaic (700-300 BCE) (arc) subword embeddings

Vocab size vocab model 25 dim 50 dim 100 dim 200 dim 300 dim
1000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
3000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
5000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix

Embedding matrix plots

Training corpus sample, encoded with different BPE vocabulary sizes

Vocab sizearcwiki sample
original ܡܕܝܕ ܐܘ ܡܕܝܬ ܡܕܝܢܬܐ ܣܘܪܝܝܬܐ ܢܩܦܐ ܠܡܪܥܝܬܐ ܕܩܠܝܡܐ ܛܘܪ ܥܒܕܝܢ ܝܘܡܢܐ. ܘܗܝ ܐܝܬܝܗ ܐܪܫܟܝܬܐ ܕܡܪܥܝܬܐ ܕܛܘܪ ܥܒܕܝܢ ܣܘܪܝܝܬܐ܀
ܡܪܝܡ ܡܓܕܠܝܬܐ ܥܡ ܩܝܣܐ ܕܨܠܝܒܐ. ܡܪܝܡ ܡܓܕܠܝܬܐ (ܝܘܢܝܐ ܥܬܝܩܐ: μαρία ἡ μαγδαληνή) ܗܝ ܚܕܐ ܐܢܬܬܐ ܡܢ ܕܝܬܝܩܝ ܚܕܬܐ ܡܢ ܩܪܝܬܐ ܕܡܓܕܠܐ. ܒܕܝܬܝܩܝ ܚܕܬܐ ܚܙܝܢܢ ܝܫܘܥ ܢܨܪܝܐ
ܟܪܟ ܣܠܘܟ ܐܘ ܟܪܟܘܟ (ܥܪܒܐܝܬ: كركوك܄ ܟܘܪܕܐܝܬ: kerkûk) ܗܝ ܡܕܝܢܬܐ ܪܒܬܐ ܘܐܪܫܟܝܬܐ ܕܗܘܦܪܟܝܐ ܕܟܪܟ ܣܠܘܟ ܒܐܬܪܐ ܕܥܝܪܐܩ. ܐܬܐ ܫܡܐ ܕܟܪܟܘܟ ܡܢ ܫܡܐ ܕܟܪܟ ܣܠܘܟ ܫܡܐ ܥܬܝܩܐ
1000 ▁ܡܕ ܝܕ ▁ܐܘ ▁ܡܕ ܝܬ ▁ܡܕܝܢܬܐ ▁ܣܘܪܝܝܬܐ ▁ܢ ܩ ܦܐ ▁ܠܡܪ ܥܝܬܐ ▁ܕܩ ܠܝ ܡܐ ▁ܛܘܪ ▁ܥܒܕܝܢ ▁ܝܘܡܢܐ . ▁ܘ ܗܝ ▁ܐܝܬܝܗ ▁ܐܪܫܟܝܬܐ ▁ܕܡܪ ܥܝܬܐ ▁ܕܛܘܪ ▁ܥܒܕܝܢ ▁ܣܘܪܝܝܬܐ ܀
▁ܡܪܝ ܡ ▁ܡܓ ܕ ܠ ܝܬܐ ▁ܥܡ ▁ܩ ܝܣ ܐ ▁ܕܨ ܠܝ ܒܐ . ▁ܡܪܝ ܡ ▁ܡܓ ܕ ܠ ܝܬܐ ▁( ܝܘ ܢܝܐ ▁ܥܬܝܩܐ : ▁ μ α ρ ί α ▁ ἡ ▁ μ α γ δ α λ η ν ή ) ▁ܗܝ ▁ܚܕܐ ▁ܐܢ ܬ ܬܐ ▁ܡܢ ▁ܕ ܝܬܝܩܝ ▁ܚܕܬܐ ▁ܡܢ ▁ܩܪ ܝܬܐ ▁ܕܡ ܓ ܕ ܠܐ . ▁ܒ ܕ ܝܬܝܩܝ ▁ܚܕܬܐ ▁ܚ ܙ ܝܢ ܢ ▁ܝܫܘܥ ▁ܢܨܪܝܐ
▁ܟ ܪܟ ▁ܣ ܠ ܘܟ ▁ܐܘ ▁ܟ ܪܟ ܘܟ ▁( ܥܪܒܐܝܬ : ▁ ك ر ك و ك ܄ ▁ܟ ܘܪ ܕܐ ܝܬ : ▁k er k û k ) ▁ܗܝ ▁ܡܕܝܢܬܐ ▁ܪܒܬܐ ▁ܘܐܪ ܫܟܝܬܐ ▁ܕܗܘܦܪܟܝܐ ▁ܕܟ ܪܟ ▁ܣ ܠ ܘܟ ▁ܒܐܬܪܐ ▁ܕܥܝܪܐܩ . ▁ܐܬܐ ▁ܫܡܐ ▁ܕܟ ܪܟ ܘܟ ▁ܡܢ ▁ܫܡܐ ▁ܕܟ ܪܟ ▁ܣ ܠ ܘܟ ▁ܫܡܐ ▁ܥܬܝܩܐ
3000 ▁ܡܕܝܕ ▁ܐܘ ▁ܡܕ ܝܬ ▁ܡܕܝܢܬܐ ▁ܣܘܪܝܝܬܐ ▁ܢ ܩܦܐ ▁ܠܡܪܥܝܬܐ ▁ܕܩ ܠܝܡܐ ▁ܛܘܪ ▁ܥܒܕܝܢ ▁ܝܘܡܢܐ . ▁ܘܗܝ ▁ܐܝܬܝܗ ▁ܐܪܫܟܝܬܐ ▁ܕܡܪ ܥܝܬܐ ▁ܕܛܘܪ ▁ܥܒܕܝܢ ▁ܣܘܪܝܝܬܐ ܀
▁ܡܪܝܡ ▁ܡܓܕ ܠܝܬܐ ▁ܥܡ ▁ܩ ܝܣܐ ▁ܕܨ ܠܝܒܐ . ▁ܡܪܝܡ ▁ܡܓܕ ܠܝܬܐ ▁( ܝܘܢܝܐ ▁ܥܬܝܩܐ : ▁μα ρία ▁ ἡ ▁μ αγ δ α λ η ν ή ) ▁ܗܝ ▁ܚܕܐ ▁ܐܢ ܬܬܐ ▁ܡܢ ▁ܕܝܬܝܩܝ ▁ܚܕܬܐ ▁ܡܢ ▁ܩܪܝܬܐ ▁ܕܡ ܓܕ ܠܐ . ▁ܒܕ ܝܬܝܩܝ ▁ܚܕܬܐ ▁ܚܙܝܢܢ ▁ܝܫܘܥ ▁ܢܨܪܝܐ
▁ܟ ܪܟ ▁ܣܠܘܟ ▁ܐܘ ▁ܟ ܪܟ ܘܟ ▁( ܥܪܒܐܝܬ : ▁ك رك و ك ܄ ▁ܟܘܪ ܕܐܝܬ : ▁k er k û k ) ▁ܗܝ ▁ܡܕܝܢܬܐ ▁ܪܒܬܐ ▁ܘܐܪܫܟܝܬܐ ▁ܕܗܘܦܪܟܝܐ ▁ܕܟܪܟ ▁ܣܠܘܟ ▁ܒܐܬܪܐ ▁ܕܥܝܪܐܩ . ▁ܐܬܐ ▁ܫܡܐ ▁ܕܟܪܟ ܘܟ ▁ܡܢ ▁ܫܡܐ ▁ܕܟܪܟ ▁ܣܠܘܟ ▁ܫܡܐ ▁ܥܬܝܩܐ
5000 ▁ܡܕܝܕ ▁ܐܘ ▁ܡܕܝܬ ▁ܡܕܝܢܬܐ ▁ܣܘܪܝܝܬܐ ▁ܢ ܩܦܐ ▁ܠܡܪܥܝܬܐ ▁ܕܩ ܠܝܡܐ ▁ܛܘܪ ▁ܥܒܕܝܢ ▁ܝܘܡܢܐ . ▁ܘܗܝ ▁ܐܝܬܝܗ ▁ܐܪܫܟܝܬܐ ▁ܕܡܪ ܥܝܬܐ ▁ܕܛܘܪ ▁ܥܒܕܝܢ ▁ܣܘܪܝܝܬܐ ܀
▁ܡܪܝܡ ▁ܡܓܕ ܠܝܬܐ ▁ܥܡ ▁ܩ ܝܣܐ ▁ܕܨ ܠܝܒܐ . ▁ܡܪܝܡ ▁ܡܓܕ ܠܝܬܐ ▁( ܝܘܢܝܐ ▁ܥܬܝܩܐ : ▁μα ρία ▁ἡ ▁μ αγ δ αλ η ν ή ) ▁ܗܝ ▁ܚܕܐ ▁ܐܢܬܬܐ ▁ܡܢ ▁ܕܝܬܝܩܝ ▁ܚܕܬܐ ▁ܡܢ ▁ܩܪܝܬܐ ▁ܕܡ ܓܕ ܠܐ . ▁ܒܕܝܬܝܩܝ ▁ܚܕܬܐ ▁ܚܙܝܢܢ ▁ܝܫܘܥ ▁ܢܨܪܝܐ
▁ܟܪܟ ▁ܣܠܘܟ ▁ܐܘ ▁ܟܪܟ ܘܟ ▁( ܥܪܒܐܝܬ : ▁ك ركوك ܄ ▁ܟܘܪܕܐܝܬ : ▁ker kûk ) ▁ܗܝ ▁ܡܕܝܢܬܐ ▁ܪܒܬܐ ▁ܘܐܪܫܟܝܬܐ ▁ܕܗܘܦܪܟܝܐ ▁ܕܟܪܟ ▁ܣܠܘܟ ▁ܒܐܬܪܐ ▁ܕܥܝܪܐܩ . ▁ܐܬܐ ▁ܫܡܐ ▁ܕܟܪܟ ܘܟ ▁ܡܢ ▁ܫܡܐ ▁ܕܟܪܟ ▁ܣܠܘܟ ▁ܫܡܐ ▁ܥܬܝܩܐ