Determine row or column on the fly

Find the location of a variable not based on it's columns/rows, but based on a regular expression or function

.find(
  fun = NULL,
  pattern = NULL,
  col = NULL,
  row = NULL,
  invert = FALSE,
  relative = FALSE
)

Arguments

fun: [character(1)]
function to identify columns or rows in the input table on the fly.
pattern: [character(1)]
character string containing a regular expression to identify columns or rows in the input table on the fly.
col: [integerish(1)]
optionally, in case this function should only be applied to certain columns, provides this here.
row: [integerish(1)]
optionally, in case this function should only be applied to certain rows, provides this here.
invert: [logical(1)]
whether or not the identified columns or rows should be inverted, i.e., all other columns or rows should be selected.
relative: [logical(1)]
whether or not the values provided in col or row are relative to the cluster position(s) or whether they are absolute positions, i.e, refer to the overall table.

Value

the index values where the target was found.

Details

This functions is basically a wild-card for when columns or rows are not known ad-hoc, but have to be assigned on the fly. This can be very helpful when several tables contain the same variables, but the arrangement may be slightly different.

How does this work

The first step in using any schema is validating it via the function validateSchema. This happens by default in reorganise, but can also be done manually, for example when debugging complicated schema descriptions.

In case that function encounters a schema that wants to find columns or rows on the fly via .find, it combines all cells of columns and all cells of rows into one character string and matches the regular expression or function on those. Columns/rows that have a match are returned as the respective column/row value.

Examples

# use regular expressions to find cell positions
(input <- tabs2shift$clusters_messy)
#> # A tibble: 13 × 7
#>    X1          X2        X3         X4          X5        X6         X7    
#>    <chr>       <chr>     <chr>      <chr>       <chr>     <chr>      <chr> 
#>  1 commodities harvested production NA          NA        NA         NA    
#>  2 unit 1      NA        NA         NA          NA        NA         NA    
#>  3 soybean     1111      1112       year 1      NA        NA         NA    
#>  4 maize       1121      1122       year 1      NA        NA         NA    
#>  5 soybean     1211      1212       year 2      NA        NA         NA    
#>  6 maize       1221      1222       year 2      NA        NA         NA    
#>  7 NA          NA        NA         NA          NA        NA         NA    
#>  8 commodities harvested production commodities harvested production NA    
#>  9 unit 2      NA        NA         unit 3      NA        NA         NA    
#> 10 soybean     2111      2112       soybean     3111      3112       year 1
#> 11 maize       2121      2122       maize       3121      3122       year 1
#> 12 soybean     2211      2212       soybean     3211      3212       year 2
#> 13 maize       2221      2222       maize       3221      3222       year 2

schema <- setCluster(id = "territories",
                     left = .find(pattern = "comm*"), top = .find(pattern = "comm*")) %>%
  setIDVar(name = "territories", columns = c(1, 1, 4), rows = c(2, 9, 9)) %>%
  setIDVar(name = "year", columns = 4, rows = c(3:6), distinct = TRUE) %>%
  setIDVar(name = "commodities", columns = c(1, 1, 4)) %>%
  setObsVar(name = "harvested", columns = c(2, 2, 5)) %>%
  setObsVar(name = "production", columns = c(3, 3, 6))

schema
#>   1 cluster
#>     origin : list(by = ~"comm*", col = NULL, row = NULL, invert = FALSE, relative = FALSE)|list(by = ~"comm*", col = NULL, row = NULL, invert = FALSE, relative = FALSE)  (row|col)
#>     id     : territories
#> 
#>    variable      type       row    col    dist 
#>   ------------- ---------- ------ ------ ------  
#>    territories   id         2, 9   1, 4   F  
#>    year          id         3:6    4      T  
#>    commodities   id                1, 4   F  
#>    harvested     observed          2, 5   F  
#>    production    observed          3, 6   F  
validateSchema(schema = schema, input = input)
#>   3 clusters
#>     origin : 1|1, 8|1, 8|4  (row|col)
#>     id     : territories
#> 
#>   filter  [rows 3, 4, 5, 6, 10, 11, 12, 13]
#> 
#>    variable      type       row    top    col    dist 
#>   ------------- ---------- ------ ------ ------ ------  
#>    territories   id         2, 9          1, 4   F  
#>    year          id         3:6           4      T  
#>    commodities   id                       1, 4   F  
#>    harvested     observed          1, 8   2, 5   F  
#>    production    observed          1, 8   3, 6   F  

# use a function to find rows
(input <- tabs2shift$messy_rows)
#> # A tibble: 16 × 7
#>    X1          X2     X3          X4             X5        X6         X7       
#>    <chr>       <chr>  <chr>       <chr>          <chr>     <chr>      <chr>    
#>  1 territories period commodities other_observed harvested production empty_col
#>  2 1           -      all         xyz            999       999        NA       
#>  3 2           -      none        xyz            999       999        NA       
#>  4 NA          NA     NA          NA             NA        NA         NA       
#>  5 3           -      xyz         xyz            999       999        NA       
#>  6 4           -      5           xyz            999       999        NA       
#>  7 unit 1      year 1 soybean     xyz            1111      1112       NA       
#>  8 unit 1      year 1 maize       xyz            1121      1122       NA       
#>  9 unit 1      year 2 soybean     xyz            1211      1212       NA       
#> 10 unit 1      year 2 maize       xyz            1221      1222       NA       
#> 11 5           -      all         NA             999       999        NA       
#> 12 6           -      none        NA             999       999        NA       
#> 13 unit 2      year 1 soybean     xyz            2111      2112       NA       
#> 14 unit 2      year 1 maize       xyz            2121      2122       NA       
#> 15 unit 2      year 2 soybean     xyz            2211      2212       NA       
#> 16 unit 2      year 2 maize       xyz            2221      2222       NA       

schema <-
  setFilter(rows = .find(fun = is.numeric, col = 1, invert = TRUE)) %>%
  setIDVar(name = "territories", columns = 1) %>%
  setIDVar(name = "year", columns = 2) %>%
  setIDVar(name = "commodities", columns = 3) %>%
  setObsVar(name = "harvested", columns = 5) %>%
  setObsVar(name = "production", columns = 6)

reorganise(schema = schema, input = input)
#> filling NA-values in downwards direction in column 'territories'.
#> filling NA-values in downwards direction in column 'year'.
#> filling NA-values in downwards direction in column 'commodities'.
#> # A tibble: 8 × 5
#>   territories year   commodities harvested production
#>   <chr>       <chr>  <chr>           <dbl>      <dbl>
#> 1 unit 1      year 1 soybean          1111       1112
#> 2 unit 1      year 1 maize            1121       1122
#> 3 unit 1      year 2 soybean          1211       1212
#> 4 unit 1      year 2 maize            1221       1222
#> 5 unit 2      year 1 soybean          2111       2112
#> 6 unit 2      year 1 maize            2121       2122
#> 7 unit 2      year 2 soybean          2211       2212
#> 8 unit 2      year 2 maize            2221       2222