agrep {base}R Documentation

Approximate String Matching (Fuzzy Matching)

Description

Searches for approximate matches to pattern (the first argument) within each element of the string x (the second argument) using the Levenshtein edit distance.

Usage

agrep(pattern, x, ignore.case = FALSE, value = FALSE,
      max.distance = 0.1, useBytes = FALSE)

Arguments

pattern a non-empty character string to be matched (not a regular expression!). Coerced by as.character to a string if possible.
x character vector where matches are sought. Coerced by as.character to a character vector if possible.
ignore.case if FALSE, the pattern matching is case sensitive and if TRUE, case is ignored during matching.
value if FALSE, a vector containing the (integer) indices of the matches determined is returned and if TRUE, a vector containing the matching elements themselves is returned.
max.distance Maximum distance allowed for a match. Expressed either as integer, or as a fraction of the pattern length (will be replaced by the smallest integer not less than the corresponding fraction of the pattern length), or a list with possible components
all:
maximal (overall) distance
insertions:
maximum number/fraction of insertions
deletions:
maximum number/fraction of deletions
substitutions:
maximum number/fraction of substitutions
If all is missing, it is set to 10%, the other components default to all. The component names can be abbreviated.
useBytes logical. in a multibyte locale, should the comparison be character-by-character (the default) or byte-by-byte.

Details

The Levenshtein edit distance is used as measure of approximateness: it is the total number of insertions, deletions and substitutions required to transform one string into another.

As from R 2.10.0 this uses tre by Ville Laurikari (http://http://laurikari.net/tre/), which supports MBCS character matching much better than the previous version.

The main effect of useBytes is to avoid errors/warnings about invalid inputs and spurious matches in multibyte locales. It inhibits the conversion of inputs with marked encodings, and is forced (with a warning) if any input is found which is marked as "bytes".

Value

Either a vector giving the indices of the elements that yielded a match, or, if value is TRUE, the matched elements (after coercion, preserving names but no other attributes).

Note

Since someone who read the description carelessly even filed a bug report on it, do note that this matches substrings of each element of x (just as grep does) and not whole elements.

Author(s)

Original version by David Meyer. Current version by Brian Ripley.

See Also

grep

Examples

agrep("lasy", "1 lazy 2")
agrep("lasy", c(" 1 lazy 2", "1 lasy 2"), max = list(sub = 0))
agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2)
agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2, value = TRUE)
agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2, ignore.case = TRUE)

[Package base version 2.13.1 Index]