Function to locate sections of pdf — heading

The ability to extract the location of the text and separate by sections. The function will return the headings with their location in the pdf.

heading_search(
  x,
  headings,
  path = FALSE,
  pdf_toc = FALSE,
  full_line = FALSE,
  ignore_case = FALSE,
  split_pdf = FALSE,
  convert_sentence = FALSE
)

Arguments

x: Either the text of the pdf read in with the pdftools package or a path for the location of the pdf file.
headings: A character vector representing the headings to search for. Can be NULL if pdf_toc = TRUE.
path: An optional path designation for the location of the pdf to be converted to text. The pdftools package is used for this conversion.
pdf_toc: TRUE/FALSE whether the pdf_toc function should be used from the pdftools package. This is most useful if the pdf has the table of contents embedded within the pdf. Must specify path = TRUE if pdf_toc = TRUE.
full_line: TRUE/FALSE indicating whether the headings should reside on their own line. This can create problems with multiple column pdfs.
ignore_case: TRUE/FALSE/vector of TRUE/FALSE, indicating whether the case of the keyword matters. Default is FALSE meaning that case of the headings keywords are literal. If a vector, must be same length as the headings vector.
split_pdf: TRUE/FALSE indicating whether to split the pdf using white space. This would be most useful with multicolumn pdf files. The split_pdf function attempts to recreate the column layout of the text into a single column starting with the left column and proceeding to the right.
convert_sentence: TRUE/FALSE indicating if individual lines of PDF file should be collapsed into a single large paragraph to perform keyword searching. Default is FALSE

Examples

file <- system.file('pdf', '1501.00450.pdf', package = 'pdfsearch')

heading_search(file, headings = c('abstract', 'introduction'),
  path = TRUE)
#> # A tibble: 1 × 5
#>   keyword      page_num line_num line_text token_text
#>   <chr>           <int>    <int> <list>    <list>    
#> 1 introduction        3      233 <chr [1]> <list [1]>