The inverseRegex package allows users to reverse engineer regular
expression patterns for R objects. Individual characters that make up an
object are categorised into common groups and encoded into run-lengths.
For example, the phrase “Hello World!” can be translated to
"[[:upper:]][[:lower:]]{4} [[:upper:]][[:lower:]]{4}!"
.
This could be useful to summarise a dataset without viewing all individual entries or to aid in data cleaning. One could check that a column of dates all follow a “nnnn-nn-nn” format or that a column of strings consisted entirely of alphabetic characters (no zeros entered instead of the letter O for example).
The main function to use is inverseRegex(x)
which will
identify the different characters that make up the input object
x
. The different groups that can be identified are -
'[[:digit:]]'
- '[[:lower:]]'
-
'[[:upper:]]'
- '[[:alpha:]]'
-
'[[:alnum:]]'
- '[[:space:]]'
-
'[[:punct:]]'
See ?regex
for an explanation of their meanings.
By default the only groups that will be identified are
[[:digit:]]
, [[:upper:]]
, and
[[:lower:]]
, with any other characters being left as is.
This can altered with the following arguments:
combineCases
: Use '[[:alpha:]]'
instead of
'[[:lower:]]'
and '[[:upper:]]'
.combineAlphanumeric
: Use ‘[[:alnum:]]
’
instead of ‘[[:digit:]]
’, ‘[[:lower:]]
’,
‘[[:upper:]]
’, and ‘[[:alpha:]]
’.combinePunctuation
: Use ‘[[:punct:]]
’
instead of leaving punctuation characters as is.combineSpace
: Use ‘[[:space:]]
’ instead of
leaving space characters as is.Some examples of these arguments are below:
inverseRegex('1aA')
#> [1] "[[:digit:]][[:lower:]][[:upper:]]"
inverseRegex('1aA', combineCases = TRUE)
#> [1] "[[:digit:]][[:alpha:]]{2}"
inverseRegex('1aA', combineAlphanumeric = TRUE)
#> [1] "[[:alnum:]]{3}"
inverseRegex('Hello World!')
#> [1] "[[:upper:]][[:lower:]]{4} [[:upper:]][[:lower:]]{4}!"
inverseRegex('Hello World!', combineSpace = TRUE, combinePunctuation = TRUE)
#> [1] "[[:upper:]][[:lower:]]{4}[[:space:]][[:upper:]][[:lower:]]{4}[[:punct:]]"
Users can also specify the different run lengths that will be
identified. The inverseRegex
function has an argument
called numbersToKeep
which allows the user to specify what
lengths of repeated sequences should be identified explicitly. The
default value is c(2, 3, 4, 5, 10)
. Run lengths not
requested will be identified with a +
.
Many objects with a class other than character
are
supported, including logical
, integer
,
numeric
, Date
, POSIXct
,
factor
, matrix
, data.frame
, and
tibble
. They are all (except logical
)
converted to characters first and then the collection of regex patterns
returned either as character vectors or as the same class as the input
object if it was a matrix, data frame, or tibble. See
?inverseRegex
for a full description of how they are
treated. If users need a different character conversion method they can
do it themselves prior to calling inverseRegex
.
Special mention of numerics and data frames will be given here:
numeric
An attempt has been made to convert numeric values into characters as
directly as possible without losing or adding any information. When
passed a numeric vector inverseRegex
will convert it to
character using:
vapply(x, format, character(1), nsmall = 1)
. This will
force at least one decimal place for all entries but will not add extra
decimal places beyond that unless they were present in the individual
input element; it will however remove trailing decimal zeros. For
example:
vapply(c(1, 1.0, 1.10, 1.12, 1.123), format, character(1), nsmall = 1)
#> [1] "1.0" "1.0" "1.1" "1.12" "1.123"
inverseRegex(c(1, 1.0, 1.10, 1.12, 1.123), numbersToKeep = 2:10)
#> [1] "[[:digit:]].[[:digit:]]" "[[:digit:]].[[:digit:]]"
#> [3] "[[:digit:]].[[:digit:]]" "[[:digit:]].[[:digit:]]{2}"
#> [5] "[[:digit:]].[[:digit:]]{3}"
## Vectors of class integer are just converted using as.character.
inverseRegex(1L)
#> [1] "[[:digit:]]"
Numerics are treated differently if they are present in a matrix,
data frame, or tibble. In the case of a matrix if it has a mode of
numeric then the entire object will be converted to character using
trimws(format(x))
. For data frames and tibbles each column
of type numeric will be converted using trimws(format(x))
.
This means that unlike for numeric vectors described above, all numeric
entries in matrices, data frames, and tibbles will have the same number
of decimal places.
inverseRegex(c(1, 1.0, 1.10, 1.12, 1.123))
#> [1] "[[:digit:]].[[:digit:]]" "[[:digit:]].[[:digit:]]"
#> [3] "[[:digit:]].[[:digit:]]" "[[:digit:]].[[:digit:]]{2}"
#> [5] "[[:digit:]].[[:digit:]]{3}"
inverseRegex(data.frame(a = c(1, 1.0, 1.10, 1.12, 1.123)))
#> a
#> 1 [[:digit:]].[[:digit:]]{3}
#> 2 [[:digit:]].[[:digit:]]{3}
#> 3 [[:digit:]].[[:digit:]]{3}
#> 4 [[:digit:]].[[:digit:]]{3}
#> 5 [[:digit:]].[[:digit:]]{3}
data.frame
When giving a data frame inverseRegex
will return a data
frame of similar dimensions with each column representing an individual
call to inverseRegex.
unique(inverseRegex(iris, numbersToKeep = 2:10))
#> Sepal.Length Sepal.Width Petal.Length
#> 1 [[:digit:]].[[:digit:]] [[:digit:]].[[:digit:]] [[:digit:]].[[:digit:]]
#> 51 [[:digit:]].[[:digit:]] [[:digit:]].[[:digit:]] [[:digit:]].[[:digit:]]
#> 101 [[:digit:]].[[:digit:]] [[:digit:]].[[:digit:]] [[:digit:]].[[:digit:]]
#> Petal.Width Species
#> 1 [[:digit:]].[[:digit:]] [[:lower:]]{6}
#> 51 [[:digit:]].[[:digit:]] [[:lower:]]{10}
#> 101 [[:digit:]].[[:digit:]] [[:lower:]]{9}
One of the main use cases of the package is to identify irregular
entries in a dataset. To this end there is a function
occurrencesLessThan
which will call
inverseRegex
and return logical values with
TRUE
giving the location of any regex patterns that occur
less than a certain number of times.
occurrencesLessThan(c(LETTERS, 1))
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [25] FALSE FALSE TRUE
## When called on a data frame occurrencesLessThan will assess each column individually.
x <- iris
x$Species <- as.character(x$Species)
x[27, 'Species'] <- 'set0sa'
unique(occurrencesLessThan(x))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 FALSE FALSE FALSE FALSE FALSE
#> 27 FALSE FALSE FALSE FALSE TRUE
What constitutes a “rare” pattern can be specified with the
fraction
or n
arguments. See
?occurrencesLessThan
for a full description.