The ability to extract the location of the text and separate by sections. The function will return the headings with their location in the pdf.
heading_search(x, headings, path = FALSE, pdf_toc = FALSE, full_line = FALSE, ignore_case = FALSE, split_pdf = FALSE, convert_sentence = FALSE)
x | Either the text of the pdf read in with the pdftools package or a path for the location of the pdf file. |
---|---|
headings | A character vector representing the headings to search for. Can be NULL if pdf_toc = TRUE. |
path | An optional path designation for the location of the pdf to be converted to text. The pdftools package is used for this conversion. |
pdf_toc | TRUE/FALSE whether the pdf_toc function should be used from
the |
full_line | TRUE/FALSE indicating whether the headings should reside on their own line. This can create problems with multiple column pdfs. |
ignore_case | TRUE/FALSE/vector of TRUE/FALSE, indicating whether the case of the keyword matters. Default is FALSE meaning that case of the headings keywords are literal. If a vector, must be same length as the headings vector. |
split_pdf | TRUE/FALSE indicating whether to split the pdf using white space. This would be most useful with multicolumn pdf files. The split_pdf function attempts to recreate the column layout of the text into a single column starting with the left column and proceeding to the right. |
convert_sentence | TRUE/FALSE indicating if individual lines of PDF file should be collapsed into a single large paragraph to perform keyword searching. Default is FALSE |
file <- system.file('pdf', '1501.00450.pdf', package = 'pdfsearch') heading_search(file, headings = c('abstract', 'introduction'), path = TRUE)#> # A tibble: 1 x 5 #> keyword page_num line_num line_text token_text #> <chr> <int> <int> <list> <list> #> 1 introduction 4 226 <chr [1]> <list [1]>