The ability to extract the location of the text and separate by sections. The function will return the headings with their location in the pdf.
heading_search(
x,
headings,
path = FALSE,
pdf_toc = FALSE,
full_line = FALSE,
ignore_case = FALSE,
split_pdf = FALSE,
split_method = c("regex", "coordinates"),
column_count = c("auto", "1", "2"),
mask_nonprose = FALSE,
nonprose_digit_ratio = 0.35,
nonprose_symbol_ratio = 0.15,
nonprose_short_token_max = 3,
remove_section_headers = FALSE,
remove_page_headers = FALSE,
remove_page_footers = FALSE,
page_margin_ratio = 0.08,
remove_repeated_furniture = FALSE,
repeated_edge_n = 3,
repeated_edge_min_pages = 4,
remove_captions = FALSE,
caption_continuation_max = 2,
table_mode = c("keep", "remove", "only"),
table_min_numeric_tokens = 3,
table_min_digit_ratio = 0.18,
table_min_block_lines = 2,
table_block_max_gap = 3,
table_include_headers = TRUE,
table_header_lookback = 3,
table_include_notes = FALSE,
table_note_lookahead = 2,
concatenate_pages = FALSE,
convert_sentence = FALSE
)Either the text of the pdf read in with the pdftools package or a path for the location of the pdf file.
A character vector representing the headings to search for. Can be NULL if pdf_toc = TRUE.
An optional path designation for the location of the pdf to be converted to text. The pdftools package is used for this conversion.
TRUE/FALSE whether the pdf_toc function should be used from the pdftools package. This is most useful if the pdf has the table of contents embedded within the pdf. Must specify path = TRUE if pdf_toc = TRUE.
TRUE/FALSE indicating whether the headings should reside on their own line. This can create problems with multiple column pdfs.
TRUE/FALSE/vector of TRUE/FALSE, indicating whether the case of the keyword matters. Default is FALSE meaning that case of the headings keywords are literal. If a vector, must be same length as the headings vector.
TRUE/FALSE indicating whether to split the pdf using white space. This would be most useful with multicolumn pdf files. The split_pdf function attempts to recreate the column layout of the text into a single column starting with the left column and proceeding to the right.
Method used for splitting multicolumn PDF text.
Defaults to "regex". Use "coordinates" to split with
pdftools::pdf_data() token coordinates.
Expected number of columns for coordinate splitting.
Options are "auto", "1", or "2". Used when
split_method = "coordinates".
TRUE/FALSE indicating if non-prose lines (likely equations, tables, figure/table captions) should be removed when using coordinate splitting.
Numeric threshold for classifying a line as non-prose based on digit character ratio.
Numeric threshold for classifying a line as non-prose based on math-symbol character ratio.
Maximum token count for short symbolic lines to classify as non-prose.
TRUE/FALSE indicating if section-header-like lines should be removed when using coordinate splitting.
TRUE/FALSE indicating if page-header furniture (e.g., arXiv identifiers, emails, URLs) should be removed when using coordinate splitting.
TRUE/FALSE indicating if page-footer furniture (e.g., page numbers, copyright markers) should be removed when using coordinate splitting.
Numeric ratio used to define top and bottom page bands for header/footer removal.
TRUE/FALSE indicating if repeated text found in the first/last lines across many pages should be removed.
Number of lines from top and bottom of each page to consider for repeated edge-line detection.
Minimum number of pages an edge line must appear on before being removed.
TRUE/FALSE indicating if figure/table caption lines should be removed.
Number of additional lines after a caption start line to remove when they appear to be caption continuations.
How to handle detected table blocks. "keep" keeps all lines, "remove" excludes table blocks, and "only" keeps only table blocks.
Minimum numeric tokens used to classify a line as table-like.
Minimum digit-character ratio used to classify a line as table-like.
Minimum number of adjacent table-like lines for a block to be treated as a table block.
Maximum gap (in lines) allowed between table-like lines inside one table block.
TRUE/FALSE indicating if table header lines adjacent to detected table blocks should be included in table blocks.
Number of lines above a detected table block to inspect for header rows.
TRUE/FALSE indicating if trailing note/source lines should be included with detected table blocks.
Number of lines after a detected table block to inspect for note/source rows.
TRUE/FALSE indicating if page text should be
concatenated after column rectification and cleaning, before sentence
conversion. This is only used when convert_sentence = TRUE.
TRUE/FALSE indicating if individual lines of PDF file should be collapsed into a single large paragraph to perform keyword searching. Default is FALSE
file <- system.file('pdf', '1501.00450.pdf', package = 'pdfsearch')
heading_search(file, headings = c('abstract', 'introduction'),
path = TRUE)
#> # A tibble: 1 × 5
#> keyword page_num line_num line_text token_text
#> <chr> <int> <int> <list> <list>
#> 1 introduction 3 233 <chr [1]> <list [1]>