--- title: "Overview of the inverseRegex Package" author: "Jasper Watson" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{overview} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r, echo = FALSE} library(inverseRegex) ``` # inverseRegex The inverseRegex package allows users to reverse engineer regular expression patterns for R objects. Individual characters that make up an object are categorised into common groups and encoded into run-lengths. For example, the phrase "Hello World!" can be translated to `"[[:upper:]][[:lower:]]{4} [[:upper:]][[:lower:]]{4}!"`. This could be useful to summarise a dataset without viewing all individual entries or to aid in data cleaning. One could check that a column of dates all follow a "nnnn-nn-nn" format or that a column of strings consisted entirely of alphabetic characters (no zeros entered instead of the letter O for example). ## Usage The main function to use is `inverseRegex(x)` which will identify the different characters that make up the input object `x`. The different groups that can be identified are - `'[[:digit:]]'` - `'[[:lower:]]'` - `'[[:upper:]]'` - `'[[:alpha:]]'` - `'[[:alnum:]]'` - `'[[:space:]]'` - `'[[:punct:]]'` See `?regex` for an explanation of their meanings. By default the only groups that will be identified are `[[:digit:]]`, `[[:upper:]]`, and `[[:lower:]]`, with any other characters being left as is. This can altered with the following arguments: - `combineCases`: Use `'[[:alpha:]]'` instead of `'[[:lower:]]'` and `'[[:upper:]]'`. - `combineAlphanumeric`: Use '`[[:alnum:]]`' instead of '`[[:digit:]]`', '`[[:lower:]]`', '`[[:upper:]]`', and '`[[:alpha:]]`'. - `combinePunctuation`: Use '`[[:punct:]]`' instead of leaving punctuation characters as is. - `combineSpace`: Use '`[[:space:]]`' instead of leaving space characters as is. Some examples of these arguments are below: ```{r, echo = TRUE} inverseRegex('1aA') inverseRegex('1aA', combineCases = TRUE) inverseRegex('1aA', combineAlphanumeric = TRUE) inverseRegex('Hello World!') inverseRegex('Hello World!', combineSpace = TRUE, combinePunctuation = TRUE) ``` Users can also specify the different run lengths that will be identified. The `inverseRegex` function has an argument called `numbersToKeep` which allows the user to specify what lengths of repeated sequences should be identified explicitly. The default value is `c(2, 3, 4, 5, 10)`. Run lengths not requested will be identified with a `+`. ```{r, echo = TRUE} inverseRegex('abcd1234567') inverseRegex('abcd1234567', numbersToKeep = NULL) inverseRegex('abcd1234567', numbersToKeep = 1:10) ``` ## Non-character Inputs Many objects with a class other than `character` are supported, including `logical`, `integer`, `numeric`, `Date`, `POSIXct`, `factor`, `matrix`, `data.frame`, and `tibble`. They are all (except `logical`) converted to characters first and then the collection of regex patterns returned either as character vectors or as the same class as the input object if it was a matrix, data frame, or tibble. See `?inverseRegex` for a full description of how they are treated. If users need a different character conversion method they can do it themselves prior to calling `inverseRegex`. Special mention of numerics and data frames will be given here: ### Inputs of Class `numeric` An attempt has been made to convert numeric values into characters as directly as possible without losing or adding any information. When passed a numeric vector `inverseRegex` will convert it to character using: `vapply(x, format, character(1), nsmall = 1)`. This will force at least one decimal place for all entries but will not add extra decimal places beyond that unless they were present in the individual input element; it will however remove trailing decimal zeros. For example: ```{r, echo = TRUE} vapply(c(1, 1.0, 1.10, 1.12, 1.123), format, character(1), nsmall = 1) inverseRegex(c(1, 1.0, 1.10, 1.12, 1.123), numbersToKeep = 2:10) ## Vectors of class integer are just converted using as.character. inverseRegex(1L) ``` Numerics are treated differently if they are present in a matrix, data frame, or tibble. In the case of a matrix if it has a mode of numeric then the entire object will be converted to character using `trimws(format(x))`. For data frames and tibbles each column of type numeric will be converted using `trimws(format(x))`. This means that unlike for numeric vectors described above, all numeric entries in matrices, data frames, and tibbles will have the same number of decimal places. ```{r, echo = TRUE} inverseRegex(c(1, 1.0, 1.10, 1.12, 1.123)) inverseRegex(data.frame(a = c(1, 1.0, 1.10, 1.12, 1.123))) ``` ### Inputs of Class `data.frame` When giving a data frame `inverseRegex` will return a data frame of similar dimensions with each column representing an individual call to inverseRegex. ```{r, echo = TRUE} unique(inverseRegex(iris, numbersToKeep = 2:10)) ``` ## Identifying Rare Patterns One of the main use cases of the package is to identify irregular entries in a dataset. To this end there is a function `occurrencesLessThan` which will call `inverseRegex` and return logical values with `TRUE` giving the location of any regex patterns that occur less than a certain number of times. ```{r, echo = TRUE} occurrencesLessThan(c(LETTERS, 1)) ## When called on a data frame occurrencesLessThan will assess each column individually. x <- iris x$Species <- as.character(x$Species) x[27, 'Species'] <- 'set0sa' unique(occurrencesLessThan(x)) ``` What constitutes a "rare" pattern can be specified with the `fraction` or `n` arguments. See `?occurrencesLessThan` for a full description.