Question 1

What does the R-package 'sentencepiece' do?

Accepted Answer

Text Tokenization using Byte Pair Encoding and Unigram Modelling. Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling. Wraps the 'sentencepiece' library [https://github.com/google/sentencepiece](https://github.com/google/sentencepiece) which provides a language independent tokenizer to split text in words and smaller subword units. The techniques are explained in the paper "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" by Taku Kudo and John Richardson (2018) [doi:10.18653/v1/D18-2012](https://doi.org/10.18653%2Fv1%2FD18-2012). Provides as well straightforward access to pretrained byte pair encoding models and subword embeddings trained on Wikipedia using 'word2vec', as described in "BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages" by Benjamin Heinzerling and Michael Strube (2018) [http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf](http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf).

Question 2

Who maintains sentencepiece?

Accepted Answer

Jan Wijffels

Question 3

Who authored sentencepiece?

Accepted Answer

BNOSAC, Google Inc., Yuta Mori, The Abseil Authors, Kenton Varda (Google Inc.), Sanjay Ghemawat (Google Inc.), Jeff Dean (Google Inc.), Laszlo Csomor (Google Inc.), Wink Saville (Google Inc.), Jim Meehan (Google Inc.), Chris Atenasio (Google Inc.), Jason Hsueh (Google Inc.), Anton Carver (Google Inc.), Maxim Lifantsev (Google Inc.), Susumu Yata, Daisuke Okanohara, Benjamin Heinzerling

Question 4

What is the current version of sentencepiece?

Accepted Answer

The current version of the R-package '0.2.3' is 0.2.3

Question 5

When was the last release of sentencepiece?

Accepted Answer

The last release of the R-package '0.2.3' was 11/13/2022

Question 6

Where can I search for the R-package 'sentencepiece'?

Accepted Answer

You can search for the R-package 'sentencepiece' on CRAN/E at https://cran-e.com

sentencepiece

Documentation

Team

Jan Wijffels

BNOSAC

Google Inc.

Yuta Mori

The Abseil Authors

Kenton Varda (Google Inc.)

Sanjay Ghemawat (Google Inc.)

Jeff Dean (Google Inc.)

Laszlo Csomor (Google Inc.)

Wink Saville (Google Inc.)

Jim Meehan (Google Inc.)

Chris Atenasio (Google Inc.)

Jason Hsueh (Google Inc.)

Anton Carver (Google Inc.)

Maxim Lifantsev (Google Inc.)

Susumu Yata

Daisuke Okanohara

Benjamin Heinzerling

Insights

Last 30 days

Last 365 days

Binaries

Dependencies