This uses the pdf_text from the pdftools package to perform keyword searches. Keyword locations indicating the line of the text as well as the page number that the keyword is found are returned.
keyword_search(x, keyword, path = FALSE, split_pdf = FALSE, surround_lines = FALSE, ignore_case = FALSE, remove_hyphen = TRUE, token_results = TRUE, heading_search = FALSE, heading_args = NULL, convert_sentence = TRUE, remove_equations = TRUE, split_pattern = "\\p{WHITE_SPACE}{3,}", ...)
x | Either the text of the pdf read in with the pdftools package or a path for the location of the pdf file. |
---|---|
keyword | The keyword(s) to be used to search in the text. Multiple keywords can be specified with a character vector. |
path | An optional path designation for the location of the pdf to be converted to text. The pdftools package is used for this conversion. |
split_pdf | TRUE/FALSE indicating whether to split the pdf using white space. This would be most useful with multicolumn pdf files. The split_pdf function attempts to recreate the column layout of the text into a single column starting with the left column and proceeding to the right. |
surround_lines | numeric/FALSE indicating whether the output should extract the surrouding lines of text in addition to the matching line. Default is FALSE, if not false, include a numeric number that indicates the additional number of surrounding lines that will be extracted. |
ignore_case | TRUE/FALSE/vector of TRUE/FALSE, indicating whether the case of the keyword matters. Default is FALSE meaning that case of the keyword is literal. If a vector, must be same length as the keyword vector. |
remove_hyphen | TRUE/FALSE indicating whether hyphenated words should be adjusted to combine onto a single line. Default is TRUE. |
token_results | TRUE/FALSE indicating whether the results text returned
should be split into tokens. See the tokenizers package and
|
heading_search | TRUE/FALSE indicating whether to search for headings in the pdf. |
heading_args | A list of arguments to pass on to the
|
convert_sentence | TRUE/FALSE indicating if individual lines of PDF file should be collapsed into a single large paragraph to perform keyword searching. Default is TRUE |
remove_equations | TRUE/FALSE indicating if equations should be removed. Default behavior is to search for the following regex: "\([0-9]1,\)$", essentially this matches a literal parenthesis, followed by at least one number followed by another parenthesis at the end of the text line. This will not detect other patterns or detect the entire equation if it is a multi-row equation. |
split_pattern | Regular expression pattern used to split multicolumn
PDF files using |
... | token_function to pass to |
A tibble data frame that contains the keyword, location of match, the line of text match, and optionally the tokens associated with the line of text match.
file <- system.file('pdf', '1501.00450.pdf', package = 'pdfsearch') keyword_search(file, keyword = c('repeated measures', 'mixed effects'), path = TRUE)#> # A tibble: 9 x 5 #> keyword page_num line_num line_text token_text #> <chr> <int> <int> <list> <list> #> 1 repeated measures 1 9 <chr [1]> <list [1]> #> 2 repeated measures 1 30 <chr [1]> <list [1]> #> 3 repeated measures 2 57 <chr [1]> <list [1]> #> 4 repeated measures 2 59 <chr [1]> <list [1]> #> 5 repeated measures 2 69 <chr [1]> <list [1]> #> 6 repeated measures 3 165 <chr [1]> <list [1]> #> 7 repeated measures 3 176 <chr [1]> <list [1]> #> 8 repeated measures 3 181 <chr [1]> <list [1]> #> 9 repeated measures 4 308 <chr [1]> <list [1]># Add surrounding text keyword_search(file, keyword = c('variance', 'mixed effects'), path = TRUE, surround_lines = 1)#> # A tibble: 64 x 5 #> keyword page_num line_num line_text token_text #> <chr> <int> <int> <list> <list> #> 1 variance 1 4 <chr [3]> <list [3]> #> 2 variance 1 10 <chr [3]> <list [3]> #> 3 variance 1 21 <chr [3]> <list [3]> #> 4 variance 1 31 <chr [3]> <list [3]> #> 5 variance 1 33 <chr [3]> <list [3]> #> 6 variance 1 38 <chr [3]> <list [3]> #> 7 variance 1 40 <chr [3]> <list [3]> #> 8 variance 2 72 <chr [3]> <list [3]> #> 9 variance 2 73 <chr [3]> <list [3]> #> 10 variance 2 75 <chr [3]> <list [3]> #> # … with 54 more rows# split pdf keyword_search(file, keyword = c('repeated measures', 'mixed effects'), path = TRUE, split_pdf = TRUE, remove_hyphen = FALSE)#> # A tibble: 0 x 5 #> # … with 5 variables: keyword <chr>, page_num <int>, line_num <int>, #> # line_text <list>, token_text <list>