sentencepiece

Text Tokenization using Byte Pair Encoding and Unigram Modelling

CRAN Package

Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling. Wraps the 'sentencepiece' library https://github.com/google/sentencepiece which provides a language independent tokenizer to split text in words and smaller subword units. The techniques are explained in the paper "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" by Taku Kudo and John Richardson (2018) doi:10.18653/v1/D18-2012. Provides as well straightforward access to pretrained byte pair encoding models and subword embeddings trained on Wikipedia using 'word2vec', as described in "BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages" by Benjamin Heinzerling and Michael Strube (2018) http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf.

  • Version0.2.3
  • R version≥ 2.10
  • LicenseMPL-2.0
  • Needs compilation?Yes
  • Last release11/13/2022

Documentation


Team


Insights

Last 30 days

Last 365 days

The following line graph shows the downloads per day. You can hover over the graph to see the exact number of downloads per day.

Data provided by CRAN


Binaries


Dependencies

  • Imports1 package
  • Suggests2 packages
  • Linking To1 package
  • Reverse Suggests1 package