Understanding GPT tokenizers: I wrote about how the tokenizers used by the various GPT models actually work, including an interactive tool for experimenting with their output https://simonwillison.net/2023/Jun/8/gpt-tokenizers/
The tokenizers have a strong bias towards English: "The dog eats the apples" is 5 tokens, "El perro come las manzanas" is 8 tokens, and many Japanese characters end up using two integer tokens for each character of text.
Here's the interactive @observablehq notebook I built to help demonstrate how the tokenizers work: https://observablehq.com/@simonw/gpt-tokenizer
@simon
The longest words in the official Finnish dictionary: pyyhkäisyelektronimikroskooppi - 17 tokens
= electron microscope - 3 tokens
Elintarviketurvallisuusvirasto - 13 tokens
= food safety authority - 3 tokens
A constructed compound word:
lentokonesuihkuturbiinimoottoriapumekaanikkoaliupseerioppilas - 29 tokens
= airplane jet turbine engine assistant mechanic non-commissioned officer in training - 15 tokens
Those are pretty extreme differences.
@fuzzychef 28!
@simon @fuzzychef now the challenge is to find a German word with the most tokens that has a Wikipedia page. Can you try https://en.wikipedia.org/wiki/Donaudampfschiffahrtselektrizit%C3%A4tenhauptbetriebswerkbauunterbeamtengesellschaft?
And here's a demo of my "llm" tool (https://github.com/simonw/llm) showing output from GPT-4 a token at a time - note how the word "Pelly" is two tokens but the word "Captain" in "Captain Gulliver" is only one.