`levitate`

is based on the Python fuzzywuzzy package for fuzzy string matching. An R port of this already exists, but unlike fuzzywuzzyR, `levitate`

is written entirely in R with no external dependencies on `reticulate`

or Python. It also offers a couple of extra bells and whistles in the form of vectorised functions.

View the docs at https://lewinfox.github.io/levitate/.

`levitate`

”?A common measure of string similarity is the **Lev**enshtein distance, and the name was available on CRAN.

Install the released version from CRAN:

`install.packages("levitate")`

Alternatively, you can install the development version from Github:

`devtools::install_github("lewinfox/levitate")`

`lev_distance()`

The edit distance is the number of additions, subtractions or substitutions needed to transform one string into another. Base R provides the `adist()`

function to compute this. `levitate`

provides `lev_distance()`

which is powered by the `stringdist`

package.

```
lev_distance("cat", "bat")
#> [1] 1
lev_distance("rat", "rats")
#> [1] 1
lev_distance("cat", "rats")
#> [1] 2
```

The function can accept vectorised input. Where the inputs have a `length()`

greater than 1 the results are returned as a vector unless `pairwise = FALSE`

, in which case a matrix is returned.

```
lev_distance(c("cat", "dog", "clog"), c("rat", "log", "frog"))
#> [1] 1 1 2
lev_distance(c("cat", "dog", "clog"), c("rat", "log", "frog"), pairwise = FALSE)
#> rat log frog
#> cat 1 3 4
#> dog 3 1 2
#> clog 4 1 2
```

If at least one (or both) of the inputs is scalar (length 1) the result will be a vector. The elements of the vector are named based on the longer input (unless `useNames = FALSE`

).

```
lev_distance(c("cat", "dog", "clog"), "rat")
#> cat dog clog
#> 1 3 4
lev_distance("cat", c("rat", "log", "frog", "other"))
#> rat log frog other
#> 1 3 4 5
lev_distance("cat", c("rat", "log", "frog", "other"), useNames = FALSE)
#> [1] 1 3 4 5
```

`lev_ratio()`

More useful than the edit distance, `lev_ratio()`

makes it easier to compare similarity across different strings. Identical strings will get a score of 1 and entirely dissimilar strings will get a score of 0.

This function behaves exactly like `lev_distance()`

:

`lev_partial_ratio()`

If `a`

and `b`

are different lengths, this function compares all the substrings of the longer string that are the same length as the shorter string and returns the highest `lev_ratio()`

of all of them. E.g. when comparing `"actor"`

and `"tractor"`

we would compare `"actor"`

with `"tract"`

, `"racto"`

and `"actor"`

and return the highest score (in this case 1).

```
lev_partial_ratio("actor", "tractor")
#> [1] 1
# What's actually happening is the max() of this result is being returned
lev_ratio("actor", c("tract", "racto", "actor"))
#> tract racto actor
#> 0.2 0.6 1.0
```

`lev_token_sort_ratio()`

The inputs are tokenised and the tokens are sorted alphabetically, then the resulting strings are compared.

```
x <- "Episode IV - Star Wars: A New Hope"
y <- "Star Wars Episode IV - New Hope"
# Because the order of words is different the simple approach gives a low match ratio.
lev_ratio(x, y)
#> [1] 0.3529412
# The sorted token approach ignores word order.
lev_token_sort_ratio(x, y)
#> [1] 0.9354839
```

`lev_token_set_ratio()`

Similar to `lev_token_sort_ratio()`

this function breaks the input down into tokens. It then identifies any common tokens between strings and creates three new strings:

```
x <- {common_tokens}
y <- {common_tokens}{remaining_unique_tokens_from_string_a}
z <- {common_tokens}{remaining_unique_tokens_from_string_b}
```

and performs three pairwise `lev_ratio()`

calculations between them (`x`

vs `y`

, `y`

vs `z`

and `x`

vs `z`

). The highest of those three ratios is returned.

```
x <- "the quick brown fox jumps over the lazy dog"
y <- "my lazy dog was jumped over by a quick brown fox"
lev_ratio(x, y)
#> [1] 0.2916667
lev_token_sort_ratio(x, y)
#> [1] 0.6458333
lev_token_set_ratio(x, y)
#> [1] 0.7435897
```

`fuzzywuzzy`

or `fuzzywuzzyR`

Results differ between `levitate`

and `fuzzywuzzy`

, not least because `stringdist`

offers several possible similarity measures. Be careful if you are porting code that relies on hard-coded or learned cutoffs for similarity measures.