Pennsylvania German (pdc) subword embeddings

Vocab size vocab model 25 dim 50 dim 100 dim 200 dim 300 dim
1000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
3000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
5000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix

Embedding matrix plots

Training corpus sample, encoded with different BPE vocabulary sizes

Vocab sizepdcwiki sample
original abdeeling:mennischde abdeeling:gebore im 00. yaahrhunnert abdeeling:gschtaerewe im 00. yaahrhunnert abdeeling:mann
wappen vun zurich view of the inner city with the four main churches visible, and the albis in the backdrop panorama mit grossmünster vun zurich zuric
der pennsylvaanisch-deitsche gehle bledder iss en pdf file mit alle addresse, ass wichtig sinn fer ebber, wu geindresst iss in ebbes ausfinne iwwer di
1000 ▁abdeeling : m enn ischde ▁abdeeling : gebore ▁im ▁00. ▁yaahr h unnert ▁abdeeling : gschtaerewe ▁im ▁00. ▁yaahr h unnert ▁abdeeling : mann
▁wa pp en ▁vun ▁z ur ich ▁vie w ▁of ▁the ▁in ner ▁c ity ▁w ith ▁the ▁f o ur ▁ma in ▁church es ▁v is i b le , ▁and ▁the ▁al b is ▁in ▁the ▁b ack d rop ▁p an or am a ▁mit ▁gro ss m ü n st er ▁vun ▁z ur ich ▁zu ri c
▁der ▁pennsylvaanisch - deitsche ▁geh le ▁b le dder ▁iss ▁en ▁p d f ▁f ile ▁mit ▁alle ▁a dd re sse , ▁ass ▁w icht ig ▁sinn ▁fer ▁eb ber , ▁wu ▁ge ind re ss t ▁iss ▁in ▁eb b es ▁aus f inn e ▁iwwer ▁di
3000 ▁abdeeling : mennischde ▁abdeeling : gebore ▁im ▁00. ▁yaahrhunnert ▁abdeeling : gschtaerewe ▁im ▁00. ▁yaahrhunnert ▁abdeeling : mann
▁wappen ▁vun ▁zurich ▁vie w ▁of ▁the ▁in ner ▁city ▁with ▁the ▁fo ur ▁main ▁church es ▁v is ib le , ▁and ▁the ▁al b is ▁in ▁the ▁back d rop ▁pan or ama ▁mit ▁gross m ün ster ▁vun ▁zurich ▁zu ri c
▁der ▁pennsylvaanisch - deitsche ▁geh le ▁bledder ▁iss ▁en ▁p d f ▁file ▁mit ▁alle ▁add re sse , ▁ass ▁w icht ig ▁sinn ▁fer ▁ebber , ▁wu ▁ge ind ress t ▁iss ▁in ▁ebbes ▁aus f inne ▁iwwer ▁di
5000 ▁abdeeling : mennischde ▁abdeeling : gebore ▁im ▁00. ▁yaahrhunnert ▁abdeeling : gschtaerewe ▁im ▁00. ▁yaahrhunnert ▁abdeeling : mann
▁wappen ▁vun ▁zurich ▁vie w ▁of ▁the ▁in ner ▁city ▁with ▁the ▁fo ur ▁main ▁churches ▁v is ible , ▁and ▁the ▁alb is ▁in ▁the ▁back d rop ▁pan or ama ▁mit ▁gross m ün ster ▁vun ▁zurich ▁zu ric
▁der ▁pennsylvaanisch - deitsche ▁geh le ▁bledder ▁iss ▁en ▁p d f ▁file ▁mit ▁alle ▁add re sse , ▁ass ▁wicht ig ▁sinn ▁fer ▁ebber , ▁wu ▁ge ind ress t ▁iss ▁in ▁ebbes ▁aus f inne ▁iwwer ▁di