The process for recomposing the text is involved. The procedure here works because both the data table and the original text had to manually cleaned. As this cleaning and aligning is part and parcel to CL more broadly the specific steps have been left out. Tidy the begin_word
vector by unnesting it into individual words. In the tidytext
package, unnesting removes white spaces, punctuation, and special characters. The only special character that might remain is the curly quote [‘’]. These have a tendency to distort the matching and have been replaced with straight quotes [']. Once all the words are clean, they are collapsed back into sentences.
absalom_events_tidy_begin <- absalom_events %>%
unnest_tokens(begin_word, begin_word) %>%
mutate(begin_word = gsub("[‘’]", "'", begin_word)) %>%
group_by(EventID) %>%
summarize(begin_word = str_c(begin_word, collapse = " ")) %>%
ungroup()
Repeat the process for the end_word
vector.
absalom_events_tidy_end <- absalom_events %>%
unnest_tokens(end_word, end_word) %>%
mutate(end_word = gsub("[‘’]", "'", end_word)) %>%
group_by(EventID) %>%
summarize(end_word = str_c(end_word, collapse = " ")) %>%
ungroup()
Join
the tidied begin_word
and end_word
back to the data_frame