sentencepiece
Text Tokenization using Byte Pair Encoding and Unigram Modelling
Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling. Wraps the 'sentencepiece' library https://github.com/google/sentencepiece which provides a language independent tokenizer to split text in words and smaller subword units. The techniques are explained in the paper "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" by Taku Kudo and John Richardson (2018) doi:10.18653/v1/D18-2012. Provides as well straightforward access to pretrained byte pair encoding models and subword embeddings trained on Wikipedia using 'word2vec', as described in "BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages" by Benjamin Heinzerling and Michael Strube (2018) http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf.
- Version0.2.3
- R version≥ 2.10
- LicenseMPL-2.0
- Needs compilation?Yes
- Last release11/13/2022
Documentation
Team
Jan Wijffels
BNOSAC
Show author detailsRolesCopyright holderGoogle Inc.
Yuta Mori
Show author detailsRolesContributor, Copyright holderThe Abseil Authors
Show author detailsRolesContributor, Copyright holderKenton Varda (Google Inc.)
Show author detailsRolesContributor, Copyright holderSanjay Ghemawat (Google Inc.)
Show author detailsRolesContributor, Copyright holderJeff Dean (Google Inc.)
Show author detailsRolesContributor, Copyright holderLaszlo Csomor (Google Inc.)
Show author detailsRolesContributor, Copyright holderWink Saville (Google Inc.)
Show author detailsRolesContributor, Copyright holderJim Meehan (Google Inc.)
Show author detailsRolesContributor, Copyright holderChris Atenasio (Google Inc.)
Show author detailsRolesContributor, Copyright holderJason Hsueh (Google Inc.)
Show author detailsRolesContributor, Copyright holderAnton Carver (Google Inc.)
Show author detailsRolesContributor, Copyright holderMaxim Lifantsev (Google Inc.)
Show author detailsRolesContributor, Copyright holderSusumu Yata
Show author detailsRolesContributor, Copyright holderDaisuke Okanohara
Show author detailsRolesContributor, Copyright holderBenjamin Heinzerling
Show author detailsRolesContributor, Copyright holder
Insights
Last 30 days
Last 365 days
The following line graph shows the downloads per day. You can hover over the graph to see the exact number of downloads per day.
Data provided by CRAN
Binaries
Dependencies
- Imports1 package
- Suggests2 packages
- Linking To1 package
- Reverse Suggests1 package