Title: | Cluster Strings by Edit-Distance |
---|---|
Description: | Returns an edit-distance based clusterization of an input vector of strings. Each cluster will contain a set of strings w/ small mutual edit-distance (e.g., levenshtein, optimum-sequence-alignment, damerau-lev), as computed by stringdist::stringdist(). The set of all mutual edit-distances is then used by g graph algorithms (from package igraph) to single out subsets of high connectivity. |
Authors: | Dan S. Reznik |
Maintainer: | Dan S. Reznik <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0 |
Built: | 2024-11-10 05:15:30 UTC |
Source: | https://github.com/dan-reznik/clustringr |
Plot string clusters as graph.
cluster_plot(cluster, min_cluster_size = 2, label_size = 2.5, repel = T)
cluster_plot(cluster, min_cluster_size = 2, label_size = 2.5, repel = T)
cluster |
string clusters returned from 'cluster_strings()' |
min_cluster_size |
minimum size for clusters to be plotted. |
label_size |
how big should the cluster name fonts be. |
repel |
whether to "repel" (so cluster names won't overlap) |
a graph plot (using 'ggraph') of the string clusters.
s_vec <- c("alcool","alcohol","alcoholic","brandy","brandie","cachaça") s_clust <- cluster_strings(s_vec,method="lv",max_dist=3,algo="cc") cluster_plot(s_clust,min_cluster_size=1)
s_vec <- c("alcool","alcohol","alcoholic","brandy","brandie","cachaça") s_clust <- cluster_strings(s_vec,method="lv",max_dist=3,algo="cc") cluster_plot(s_clust,min_cluster_size=1)
Cluster Strings by Edit-Distance
cluster_strings(s_vec, clean = T, method = "osa", max_dist = 3, algo = "cc")
cluster_strings(s_vec, clean = T, method = "osa", max_dist = 3, algo = "cc")
s_vec |
a vector of character strings |
clean |
whether to space-squish and de-duplicate s_vec |
method |
one of "osa","lv","dl" (as in 'stringdist') |
max_dist |
max distance (typically damerau-levenshtein) between related strings. |
algo |
one of "cc" (connected components) or "eb" (edge betweeness) |
a data frame containing cluster membership for each input string
s_vec <- c("alcool","alcohol","alcoholic","brandy","brandie","cachaça") s_clust <- cluster_strings(s_vec,method="lv",max_dist=3,algo="cc") s_clust$df_clusters
s_vec <- c("alcool","alcohol","alcoholic","brandy","brandie","cachaça") s_clust <- cluster_strings(s_vec,method="lv",max_dist=3,algo="cc") s_clust$df_clusters
Dataframe listing all distinct words (length>3), their length, and frequency of appearance in text.
quijote_words
quijote_words
A data frame w/ ~22k rows and 3 cols:
the unique word, in Spanish
the word's length
number of appearances in text
http://www.gutenberg.org/cache/epub/2000/pg2000.txt