Hausa (ha) subword embeddings

Vocab size vocab model 25 dim 50 dim 100 dim 200 dim 300 dim
1000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
3000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
5000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix

Embedding matrix plots

Training corpus sample, encoded with different BPE vocabulary sizes

Vocab sizehawiki sample
original {| border=0 align=right cellpadding=0 cellspacing=0 width=000 style="margin: 0 0 0em 0em; border: 0px #aaa solid; border-collapse: collapse;"
faransa ta mamaye turai wajen a’adu, da siyasa da mulkin soja kalkashin louis. masu ilimin falfasa sun taka rawa mai kyau a lokacin wayewan kai a karn
ca mau filin dong hoi filin yana a filin jirgin sama da ca mau, lardin ca mau, da vietnam. filin jirgin sama da ke bakin teku. ya yi runway, 0000 mito
1000 ▁{| ▁border =0 ▁align = right ▁cellpadding =0 ▁cellspacing =0 ▁width = 000 ▁style =" margin : ▁0 ▁0 ▁0 em ▁0 em ; ▁border : ▁0 px ▁# aa a ▁so li d ; ▁border - collapse : ▁co llapse ;"
▁faransa ▁ta ▁mama ye ▁turai ▁wajen ▁a ’ adu , ▁da ▁siyasa ▁da ▁mulkin ▁so ja ▁kal ka shin ▁lo u is . ▁masu ▁ilimin ▁fa l fa sa ▁sun ▁ta ka ▁ra wa ▁mai ▁kya u ▁a ▁lokacin ▁wa ye wan ▁kai ▁a ▁kar n
▁ca ▁ma u ▁fil in ▁don g ▁ho i ▁fil in ▁yana ▁a ▁fil in ▁jirgin ▁sama ▁da ▁ca ▁ma u , ▁l ar din ▁ca ▁ma u , ▁da ▁v i et na m . ▁fil in ▁jirgin ▁sama ▁da ▁ke ▁bakin ▁te ku . ▁ya ▁yi ▁r un wa y , ▁0000 ▁mi to
3000 ▁{| ▁border =0 ▁align = right ▁cellpadding =0 ▁cellspacing =0 ▁width =000 ▁style =" margin : ▁0 ▁0 ▁0 em ▁0 em ; ▁border : ▁0 px ▁# aa a ▁solid ; ▁border - collapse : ▁collapse ;"
▁faransa ▁ta ▁mamaye ▁turai ▁wajen ▁a ’ adu , ▁da ▁siyasa ▁da ▁mulkin ▁soja ▁kal ka shin ▁lo u is . ▁masu ▁ilimin ▁fal fasa ▁sun ▁taka ▁rawa ▁mai ▁kyau ▁a ▁lokacin ▁waye wan ▁kai ▁a ▁kar n
▁ca ▁ma u ▁filin ▁don g ▁ho i ▁filin ▁yana ▁a ▁filin ▁jirgin ▁sama ▁da ▁ca ▁ma u , ▁lardin ▁ca ▁ma u , ▁da ▁vietnam . ▁filin ▁jirgin ▁sama ▁da ▁ke ▁bakin ▁teku . ▁ya ▁yi ▁r un way , ▁0000 ▁mi to
5000 ▁{| ▁border =0 ▁align = right ▁cellpadding =0 ▁cellspacing =0 ▁width =000 ▁style =" margin : ▁0 ▁0 ▁0 em ▁0 em ; ▁border : ▁0 px ▁# aaa ▁solid ; ▁border - collapse : ▁collapse ;"
▁faransa ▁ta ▁mamaye ▁turai ▁wajen ▁a ’ adu , ▁da ▁siyasa ▁da ▁mulkin ▁soja ▁kal ka shin ▁lo u is . ▁masu ▁ilimin ▁fal fasa ▁sun ▁taka ▁rawa ▁mai ▁kyau ▁a ▁lokacin ▁waye wan ▁kai ▁a ▁kar n
▁ca ▁mau ▁filin ▁dong ▁hoi ▁filin ▁yana ▁a ▁filin ▁jirgin ▁sama ▁da ▁ca ▁mau , ▁lardin ▁ca ▁mau , ▁da ▁vietnam . ▁filin ▁jirgin ▁sama ▁da ▁ke ▁bakin ▁teku . ▁ya ▁yi ▁runway , ▁0000 ▁mi to