Simon Willison (@simon@simonwillison.net)

2y

Understanding GPT tokenizers: I wrote about how the tokenizers used by the various GPT models actually work, including an interactive tool for experimenting with their output https://simonwillison.net/2023/Jun/8/gpt-tokenizers/

1 0 0 View Post & Replies See Original

2y

The tokenizers have a strong bias towards English: "The dog eats the apples" is 5 tokens, "El perro come las manzanas" is 8 tokens, and many Japanese characters end up using two integer tokens for each character of text.

3 0 0 View Post & Replies See Original

2y

Here's the interactive @observablehq notebook I built to help demonstrate how the tokenizers work: https://observablehq.com/@simonw/gpt-tokenizer

1 0 0 View Post & Replies See Original

2y

@simon How many tokens is Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz?

1 0 0 View Post & Replies See Original

2y

@simon
The longest words in the official Finnish dictionary: pyyhkäisyelektronimikroskooppi - 17 tokens
= electron microscope - 3 tokens
Elintarviketurvallisuusvirasto - 13 tokens
= food safety authority - 3 tokens

A constructed compound word:
lentokonesuihkuturbiinimoottoriapumekaanikkoaliupseerioppilas - 29 tokens
= airplane jet turbine engine assistant mechanic non-commissioned officer in training - 15 tokens

Those are pretty extreme differences.

0 0 0 View Post & Replies See Original

2y

@fuzzychef 28!

Enter text to tokenize it:

Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz

49 521 27919 2304 3202 1134 3087 959 2150 82 9116 527 86 620 2150 82 559 40616 397 268 9116 527 2213 363 2150 82 3212 23773
28 tokens

ALT

1 0 0 View Post & Replies See Original

2y

@simon @fuzzychef now the challenge is to find a German word with the most tokens that has a Wikipedia page. Can you try https://en.wikipedia.org/wiki/Donaudampfschiffahrtselektrizit%C3%A4tenhauptbetriebswerkbauunterbeamtengesellschaft?

0 0 0 View Post & Replies See Original

2y

And here's a demo of my "llm" tool (https://github.com/simonw/llm) showing output from GPT-4 a token at a time - note how the word "Pelly" is two tokens but the word "Captain" in "Captain Gulliver" is only one.