Simon Willison (@simon@simonwillison.net)

2y

Understanding GPT tokenizers: I wrote about how the tokenizers used by the various GPT models actually work, including an interactive tool for experimenting with their output https://simonwillison.net/2023/Jun/8/gpt-tokenizers/

1 0 0 View Post & Replies See Original

2y

The tokenizers have a strong bias towards English: "The dog eats the apples" is 5 tokens, "El perro come las manzanas" is 8 tokens, and many Japanese characters end up using two integer tokens for each character of text.

3 0 0 View Post & Replies See Original

2y

Here's the interactive @observablehq notebook I built to help demonstrate how the tokenizers work: https://observablehq.com/@simonw/gpt-tokenizer

1 0 0 View Post & Replies See Original

2y

And here's a demo of my "llm" tool (https://github.com/simonw/llm) showing output from GPT-4 a token at a time - note how the word "Pelly" is two tokens but the word "Captain" in "Captain Gulliver" is only one.

1 0 0 View Post & Replies See Original

2y

(Captain Gulliver is a genuinely excellent name for a pet pelican)

1 0 0 View Post & Replies See Original

2y

@simon https://www.imore.com/animal-crossing-new-horizons-how-help-gulliver

0 0 0 View Post & Replies See Original