Byte pair encoding

Byte pair encoding^[1]^[2] (also known as BPE, or digram coding)^[3] is an algorithm, first described in 1994 by Philip Gage, for encoding strings of text into smaller strings by creating and using a translation table.^[4] A slightly-modified version of the algorithm is used in large language model tokenizers.

The original version of the algorithm focused on compression. It replaces the highest-frequency pair of bytes with a new byte that was not contained in the initial dataset. A lookup table of the replacements is required to rebuild the initial dataset. The modified version builds "tokens" (units of recognition) that match varying amounts of source text, from single characters (including single digits or single punctuation marks) to whole words (even long compound words).^[5]^[6]^[7]

^ Gage, Philip (1994). "A New Algorithm for Data Compression". The C User Journal.
^ "A New Algorithm for Data Compression". Dr. Dobb's Journal. 1 February 1994. Retrieved 10 August 2020.
^ Witten, Ian H.; Moffat, Alistair; Bell, Timothy C. (1994). Managing Gigabytes. New York: Van Nostrand Reinhold. ISBN 978-0-442-01863-4.
^ "Byte Pair Encoding". Archived from the original on 2016-03-26.
^ Sennrich, Rico; Birch, Alexandra; Haddow, Barry (2015-08-31). "Neural Machine Translation of Rare Words with Subword Units". arXiv:1508.07909 [cs.CL].
^ Brown, Tom B.; Mann, Benjamin; Ryde r, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini (2020-06-04). "Language Models are Few-Shot Learners". arXiv:2005.14165 [cs.CL].
^ "google/sentencepiece". Google. 2021-03-02. Retrieved 2021-03-02.

[CUsersJ_Gage_1994-1] Gage, Philip (1994). "A New Algorithm for Data Compression". The C User Journal.

[2] "A New Algorithm for Data Compression". Dr. Dobb's Journal. 1 February 1994. Retrieved 10 August 2020.

[3] Witten, Ian H.; Moffat, Alistair; Bell, Timothy C. (1994). Managing Gigabytes. New York: Van Nostrand Reinhold. ISBN 978-0-442-01863-4.

[4] "Byte Pair Encoding". Archived from the original on 2016-03-26.

[5] Sennrich, Rico; Birch, Alexandra; Haddow, Barry (2015-08-31). "Neural Machine Translation of Rare Words with Subword Units". arXiv:1508.07909 [cs.CL].

[6] Brown, Tom B.; Mann, Benjamin; Ryde r, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini (2020-06-04). "Language Models are Few-Shot Learners". arXiv:2005.14165 [cs.CL].

[7] "google/sentencepiece". Google. 2021-03-02. Retrieved 2021-03-02.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

Our website is made possible by displaying online advertisements to our visitors. Please consider supporting us by disabling your ad blocker.

Byte pair encoding

Our website is made possible by displaying online advertisements to our visitors.
Please consider supporting us by disabling your ad blocker.