Understanding GPT tokenizers: I wrote about how the tokenizers used by the various GPT models actually work, including an interactive tool for experimenting with their output https://simonwillison.net/2023/Jun/8/gpt-tokenizers/
The tokenizers have a strong bias towards English: "The dog eats the apples" is 5 tokens, "El perro come las manzanas" is 8 tokens, and many Japanese characters end up using two integer tokens for each character of text.
@simon
The longest words in the official Finnish dictionary: pyyhkäisyelektronimikroskooppi - 17 tokens
= electron microscope - 3 tokens
Elintarviketurvallisuusvirasto - 13 tokens
= food safety authority - 3 tokens
A constructed compound word:
lentokonesuihkuturbiinimoottoriapumekaanikkoaliupseerioppilas - 29 tokens
= airplane jet turbine engine assistant mechanic non-commissioned officer in training - 15 tokens
Those are pretty extreme differences.