R/compare_mask_effect.r
compare_mask_effect.RdRuns keyword_search twice with coordinate splitting:
once with mask_nonprose = FALSE and once with
mask_nonprose = TRUE. Returns a compact summary for A/B checks.
compare_mask_effect(
x,
keyword,
path = FALSE,
column_count = c("auto", "1", "2"),
nonprose_digit_ratio = 0.35,
nonprose_symbol_ratio = 0.15,
nonprose_short_token_max = 3,
...
)Either the text of the pdf read in with the pdftools package or a path for the location of the pdf file.
The keyword(s) to be used to search in the text. Multiple keywords can be specified with a character vector.
An optional path designation for the location of the pdf to be converted to text. The pdftools package is used for this conversion. Must be TRUE for coordinate splitting.
Expected number of columns for coordinate splitting. Options are "auto", "1", or "2".
Numeric threshold for classifying a line as non-prose based on digit character ratio.
Numeric threshold for classifying a line as non-prose based on math-symbol character ratio.
Maximum token count for short symbolic lines to classify as non-prose.
Additional arguments passed to keyword_search.
A tibble data frame with one row per mode ("unmasked", "masked") and the number of matches.
file <- system.file('pdf', '1501.00450.pdf', package = 'pdfsearch')
compare_mask_effect(file, keyword = "error", path = TRUE, column_count = "2")
#> # A tibble: 2 × 2
#> mode num_matches
#> <chr> <int>
#> 1 unmasked 2
#> 2 masked 2