Function to extract tables

extract_tables(
  x,
  path = FALSE,
  split_pdf = FALSE,
  remove_equations = TRUE,
  delimiter = "\\s{2,}",
  delimiter_table = "\\s{2,}",
  split_pattern = "\\p{WHITE_SPACE}{3,}",
  replacement = "\\/",
  col_names = FALSE
)

Arguments

x

Either the text of the pdf read in with the pdftools package or a path for the location of the pdf file.

path

An optional path designation for the location of the pdf to be converted to text. The pdftools package is used for this conversion.

split_pdf

TRUE/FALSE indicating whether to split the pdf using white space. This would be most useful with multicolumn pdf files. The split_pdf function attempts to recreate the column layout of the text into a single column starting with the left column and proceeding to the right.

remove_equations

TRUE/FALSE indicating if equations should be removed. Default behavior is to search for a literal parenthesis, followed by at least one number followed by another parenthesis at the end of the text line. This will not detect other patterns or detect the entire equation if it is a multi-row equation.

delimiter

A delimiter used to detect tables. The default is two consecutive blank white spaces.

delimiter_table

A delimiter used to separate table cells. The default value is two consecutive blank white spaces.

split_pattern

Regular expression pattern used to split multicolumn PDF files using stringi::stri_split_regex. Default pattern is to split based on three or more consecutive white space characters.

replacement

A delimiter used to separate table cells after the replacement of white space is done.

col_names

TRUE/FALSE value passed to `readr::read_delim` to indicate if column names should be used. Default value is FALSE which means column names will be generic (i.e. X1, X2, etc). A value of TRUE would take the values from the first row of data extracted.