vignettes/multicolumn_and_tables.Rmd
multicolumn_and_tables.RmdThis vignette demonstrates newer pdfsearch workflows
for:
keyword_search().extract_tables().
library(pdfsearch)
file <- system.file("pdf", "LeBeauetal2020-gcq.pdf", package = "pdfsearch")The split_method = "coordinates" option uses token
coordinates from pdftools::pdf_data() and can be more
robust than whitespace-only splitting.
Use column_count to control how column order is
handled:
"auto": infer number of columns."1": force single-column reading order."2": force left-column then right-column order.
res_coord <- keyword_search(
file,
keyword = c("repeated measures", "mixed effect"),
path = TRUE,
split_pdf = TRUE,
split_method = "coordinates",
column_count = "2",
remove_hyphen = TRUE
)
head(res_coord)
#> # A tibble: 0 × 5
#> # ℹ 5 variables: keyword <chr>, page_num <int>, line_num <int>,
#> # line_text <list>, token_text <list>Several options are available to reduce non-body text before keyword searching. This includes page headers, footers, section headings, repeated furniture, and captions. These can be particularly helpful for multi-column documents where such elements may be more prevalent. The goal of removing these is to better align column text and keep the sentence structure and keyword proximity intact.
res_clean <- keyword_search(
file,
keyword = "variance",
path = TRUE,
split_pdf = TRUE,
split_method = "coordinates",
column_count = "2",
remove_section_headers = TRUE,
remove_page_headers = TRUE,
remove_page_footers = TRUE,
remove_repeated_furniture = TRUE,
repeated_edge_n = 2,
repeated_edge_min_pages = 4,
remove_captions = TRUE,
caption_continuation_max = 2
)
head(res_clean)
#> # A tibble: 6 × 5
#> keyword page_num line_num line_text token_text
#> <chr> <int> <int> <list> <list>
#> 1 variance 4 85 <chr [1]> <list [1]>
#> 2 variance 4 87 <chr [1]> <list [1]>
#> 3 variance 4 89 <chr [1]> <list [1]>
#> 4 variance 7 185 <chr [1]> <list [1]>
#> 5 variance 17 467 <chr [1]> <list [1]>
#> 6 variance 18 618 <chr [1]> <list [1]>keyword_search()
Use table_mode to choose whether table-like blocks are
searched:
"keep": include all text (default)."remove": exclude table-like blocks from search."only": search only table-like blocks.Additional options can improve table-only extraction:
table_include_headers: include nearby table header rows
(default TRUE).table_header_lookback: number of lines above detected
table blocks to inspect for header rows (default 3).table_include_notes: include trailing note/source
rows.table_note_lookahead: number of lines after detected
blocks to inspect for notes.table_block_max_gap: maximum number of non-table lines
allowed before a block is split. Increase this when tables are
fragmented.When specifying table_mode = 'remove', the same cleaning
options above are applied to table blocks as well, which can help ensure
that only body text is retained for keyword searching. When using
table_mode = 'only', the cleaning options are not applied
since the focus is on analyzing tables specifically.
res_keep <- keyword_search(
file,
keyword = "0.83",
path = TRUE,
split_pdf = TRUE,
split_method = "coordinates",
table_mode = "keep",
convert_sentence = FALSE
)
res_remove <- keyword_search(
file,
keyword = "0.83",
path = TRUE,
split_pdf = TRUE,
split_method = "coordinates",
table_mode = "remove",
convert_sentence = FALSE
)
res_only <- keyword_search(
file,
keyword = "0.83",
path = TRUE,
split_pdf = TRUE,
split_method = "coordinates",
table_mode = "only",
table_include_headers = TRUE,
table_header_lookback = 3,
table_block_max_gap = 3,
table_include_notes = FALSE,
table_note_lookahead = 2,
convert_sentence = FALSE
)
c(
keep = nrow(res_keep),
remove = nrow(res_remove),
only = nrow(res_only)
)
#> keep remove only
#> 4 2 2extract_tables()
extract_tables() now supports coordinate splitting and
output modes:
"parsed": list of parsed table data frames."blocks": metadata plus raw block lines."both": both parsed tables and block metadata.It also supports table-block tuning options:
table_include_headers,
table_header_lookback
table_include_notes,
table_note_lookahead
table_min_numeric_tokens,
table_min_digit_ratio, table_min_block_lines,
and table_block_max_gap
merge_across_pages for continuation tables that span
adjacent pages
tab_blocks <- extract_tables(
file,
path = TRUE,
split_pdf = TRUE,
split_method = "coordinates",
column_count = "2",
remove_section_headers = TRUE,
remove_page_headers = TRUE,
remove_page_footers = TRUE,
remove_repeated_furniture = TRUE,
remove_captions = TRUE,
table_include_headers = TRUE,
table_header_lookback = 3,
table_block_max_gap = 3,
table_include_notes = FALSE,
table_note_lookahead = 2,
merge_across_pages = TRUE,
output = "blocks"
)
head(tab_blocks)
#> # A tibble: 3 × 6
#> page_num block_id line_start line_end line_text page_end
#> <int> <int> <dbl> <int> <list> <int>
#> 1 8 1 1 10 <chr [10]> 8
#> 2 9 1 1 16 <chr [35]> 10
#> 3 14 1 1 29 <chr [29]> 14
tab_parsed <- extract_tables(
file,
path = TRUE,
split_pdf = TRUE,
split_method = "coordinates",
column_count = "1",
remove_section_headers = TRUE,
remove_page_headers = TRUE,
remove_page_footers = TRUE,
remove_repeated_furniture = TRUE,
remove_captions = TRUE,
table_include_headers = TRUE,
table_header_lookback = 3,
table_block_max_gap = 3,
table_include_notes = FALSE,
table_note_lookahead = 3,
merge_across_pages = TRUE,
output = "parsed"
)
length(tab_parsed)
#> [1] 3
if (length(tab_parsed) > 0) {
head(tab_parsed[[1]])
}
#> # A tibble: 6 × 1
#> X1
#> <chr>
#> 1 Table 1. Fit Statistics for Two-Parameter IRT Multigroup Models by Subject.
#> 2 Subject M 2 RMSEA [CI] CFI M 2 RMSEA [CI] CFI
#> 3 English 3930 (2340), p < .01 0.019 [0.018, 0.020] 0.893 2320 (740), p < .01 0…
#> 4 Math 1630 (1283), p < .01 0.012 [0.010, 0.014] 0.949 805 (405), p < .01 0.023…
#> 5 Reading 1895 (1307), p < .01 0.015 [0.014, 0.017] 0.965 902 (405), p < .01 0.…
#> 6 Science 1818 (1146), p < .01 0.018 [0.016, 0.019] 0.936 948 (350), p < .01 0.…One primary element to test is the number of columns from the PDF. If
the tables span multiple columns, but the text is in multiple columns
you would want to ensure column_count = 1 is specified when
extracting the tables. This will ensure the table is not truncated to
only include half of the table.
The table detector is controlled by several additional options that can be tuned for better performance on specific documents. The key parameters are:
table_min_numeric_tokens: minimum number of
numeric-looking tokens required for a line to be considered table-like.
Larger values are stricter.table_min_digit_ratio: minimum proportion of digit
characters in a line for table-like classification. Larger values reduce
prose false positives.table_min_block_lines: minimum number of adjacent
table-like lines needed to keep a block.table_block_max_gap: maximum number of non-table lines
allowed between table-like lines when merging a block. Increase this
when tables are split.table_include_headers: include nearby table headers and
column-label rows.table_header_lookback: number of lines above a detected
block to inspect for headers.table_include_notes: include trailing
Note. or Source. rows.table_note_lookahead: number of lines after a block to
inspect for note lines.merge_across_pages: if TRUE, continuation
blocks across adjacent pages are merged when they appear to be one
table.A practical tuning workflow:
table_block_max_gap.table_min_numeric_tokens and/or
table_min_digit_ratio.table_include_headers = TRUE and increase
table_header_lookback.merge_across_pages = TRUE.If desired, sentence conversion can be done after pages are concatenated. This has the benefit of allowing the sentence conversion to work across pages and ensuring proper context and allowing for better keyword proximity when sentences are split across page breaks.
res_cross_page <- keyword_search(
file,
keyword = "fixed effects",
path = TRUE,
split_pdf = TRUE,
split_method = "coordinates",
column_count = "2",
remove_section_headers = TRUE,
remove_page_headers = TRUE,
remove_page_footers = TRUE,
convert_sentence = TRUE,
concatenate_pages = TRUE
)
head(res_cross_page)
#> # A tibble: 0 × 5
#> # ℹ 5 variables: keyword <chr>, page_num <int>, line_num <int>,
#> # line_text <list>, token_text <list>For dense multi-column journal articles, a practical default is:
split_method = "coordinates"column_count = "2"remove_section_headers = TRUEremove_page_headers = TRUEremove_page_footers = TRUEremove_repeated_furniture = TRUEremove_captions = TRUEtable_mode = "remove" for prose-focused keyword
searchUse table_mode = "only" or
extract_tables(..., output = "blocks") when the goal is
specifically to analyze tables. If table headers are being missed, set
table_include_headers = TRUE and increase
table_header_lookback. If the table continues across pages,
use merge_across_pages = TRUE.