Berkubernetus (@fuzzychef@m6n.io)

2y

Understanding GPT tokenizers: I wrote about how the tokenizers used by the various GPT models actually work, including an interactive tool for experimenting with their output https://simonwillison.net/2023/Jun/8/gpt-tokenizers/

1 0 0 View Post & Replies See Original

2y

The tokenizers have a strong bias towards English: "The dog eats the apples" is 5 tokens, "El perro come las manzanas" is 8 tokens, and many Japanese characters end up using two integer tokens for each character of text.

3 0 0 View Post & Replies See Original

2y

@simon How many tokens is Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz?

1 0 0 View Post & Replies See Original

2y

@fuzzychef 28!

Enter text to tokenize it:

Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz

49 521 27919 2304 3202 1134 3087 959 2150 82 9116 527 86 620 2150 82 559 40616 397 268 9116 527 2213 363 2150 82 3212 23773
28 tokens

ALT

1 0 0 View Post & Replies See Original

2y

@simon @fuzzychef now the challenge is to find a German word with the most tokens that has a Wikipedia page. Can you try https://en.wikipedia.org/wiki/Donaudampfschiffahrtselektrizit%C3%A4tenhauptbetriebswerkbauunterbeamtengesellschaft?

0 0 0 View Post & Replies See Original