tokenizers
Fast, Consistent Tokenization of Natural Language Text
Convert natural language text into tokens. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, shingled characters, lines, Penn Treebank, regular expressions, as well as functions for counting characters, words, and sentences, and a function for splitting longer texts into separate documents, each with the same number of words. The tokenizers have a consistent interface, and the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'.
- Version0.3.0
- R versionunknown
- LicenseMIT
- LicenseLICENSE
- Needs compilation?Yes
- tokenizers citation info
- Last release12/22/2022
Documentation
Team
Lincoln Mullen
Os Keyes
Show author detailsRolesContributorDmitriy Selivanov
Kenneth Benoit
Show author detailsRolesContributorJeffrey Arnold
Show author detailsRolesContributor
Insights
Last 30 days
This package has been downloaded 35,693 times in the last 30 days. The academic equivalent of having a dedicated subreddit. There are fans, and maybe even a few trolls! The following heatmap shows the distribution of downloads per day. Yesterday, it was downloaded 1,391 times.
The following line graph shows the downloads per day. You can hover over the graph to see the exact number of downloads per day.
Last 365 days
This package has been downloaded 352,802 times in the last 365 days. This is the kind of download count that makes grant committees nod approvingly. A job well done, even the stoic reviewers might be impressed! The day with the most downloads was Apr 30, 2024 with 1,746 downloads.
The following line graph shows the downloads per day. You can hover over the graph to see the exact number of downloads per day.
Data provided by CRAN
Binaries
Dependencies
- Imports3 packages
- Suggests5 packages
- Linking To1 package
- Reverse Imports11 packages
- Reverse Suggests2 packages
- Reverse Enhances1 package