Post by Osma A 🇫🇮🇺🇦

2y

Understanding GPT tokenizers: I wrote about how the tokenizers used by the various GPT models actually work, including an interactive tool for experimenting with their output https://simonwillison.net/2023/Jun/8/gpt-tokenizers/

1 0 0 View Post & Replies See Original

2y

The tokenizers have a strong bias towards English: "The dog eats the apples" is 5 tokens, "El perro come las manzanas" is 8 tokens, and many Japanese characters end up using two integer tokens for each character of text.

3 0 0 View Post & Replies See Original

2y

@simon
The longest words in the official Finnish dictionary: pyyhkäisyelektronimikroskooppi - 17 tokens
= electron microscope - 3 tokens
Elintarviketurvallisuusvirasto - 13 tokens
= food safety authority - 3 tokens

A constructed compound word:
lentokonesuihkuturbiinimoottoriapumekaanikkoaliupseerioppilas - 29 tokens
= airplane jet turbine engine assistant mechanic non-commissioned officer in training - 15 tokens

Those are pretty extreme differences.

0 0 0 View Post & Replies See Original