Find the location of a variable not based on it's columns/rows, but based on a regular expression or function
.find(
fun = NULL,
pattern = NULL,
col = NULL,
row = NULL,
invert = FALSE,
relative = FALSE
)
[character(1)
]
function to identify columns or rows in
the input table on the fly.
[character(1)
]
character string containing a regular
expression to identify columns or rows in the input table on the fly.
[integerish(1)
]
optionally, in case this function should
only be applied to certain columns, provides this here.
[integerish(1)
]
optionally, in case this function should
only be applied to certain rows, provides this here.
[logical(1)
]
whether or not the identified columns or
rows should be inverted, i.e., all other columns or rows should be
selected.
[logical(1)
]
whether or not the values provided in
col
or row
are relative to the cluster position(s) or whether
they are absolute positions, i.e, refer to the overall table.
the index values where the target was found.
This functions is basically a wild-card for when columns or rows are not known ad-hoc, but have to be assigned on the fly. This can be very helpful when several tables contain the same variables, but the arrangement may be slightly different.
The first step in using any schema is validating
it via the function validateSchema
. This happens by default
in reorganise
, but can also be done manually, for example
when debugging complicated schema descriptions.
In case that function encounters a schema that wants to find columns or
rows on the fly via .find
, it combines all cells of columns and all
cells of rows into one character string and matches the regular expression
or function on those. Columns/rows that have a match are returned as the
respective column/row value.
# use regular expressions to find cell positions
(input <- tabs2shift$clusters_messy)
#> # A tibble: 13 × 7
#> X1 X2 X3 X4 X5 X6 X7
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 commodities harvested production NA NA NA NA
#> 2 unit 1 NA NA NA NA NA NA
#> 3 soybean 1111 1112 year 1 NA NA NA
#> 4 maize 1121 1122 year 1 NA NA NA
#> 5 soybean 1211 1212 year 2 NA NA NA
#> 6 maize 1221 1222 year 2 NA NA NA
#> 7 NA NA NA NA NA NA NA
#> 8 commodities harvested production commodities harvested production NA
#> 9 unit 2 NA NA unit 3 NA NA NA
#> 10 soybean 2111 2112 soybean 3111 3112 year 1
#> 11 maize 2121 2122 maize 3121 3122 year 1
#> 12 soybean 2211 2212 soybean 3211 3212 year 2
#> 13 maize 2221 2222 maize 3221 3222 year 2
schema <- setCluster(id = "territories",
left = .find(pattern = "comm*"), top = .find(pattern = "comm*")) %>%
setIDVar(name = "territories", columns = c(1, 1, 4), rows = c(2, 9, 9)) %>%
setIDVar(name = "year", columns = 4, rows = c(3:6), distinct = TRUE) %>%
setIDVar(name = "commodities", columns = c(1, 1, 4)) %>%
setObsVar(name = "harvested", columns = c(2, 2, 5)) %>%
setObsVar(name = "production", columns = c(3, 3, 6))
schema
#> 1 cluster
#> origin : list(by = ~"comm*", col = NULL, row = NULL, invert = FALSE, relative = FALSE)|list(by = ~"comm*", col = NULL, row = NULL, invert = FALSE, relative = FALSE) (row|col)
#> id : territories
#>
#> variable type row col dist
#> ------------- ---------- ------ ------ ------
#> territories id 2, 9 1, 4 F
#> year id 3:6 4 T
#> commodities id 1, 4 F
#> harvested observed 2, 5 F
#> production observed 3, 6 F
validateSchema(schema = schema, input = input)
#> 3 clusters
#> origin : 1|1, 8|1, 8|4 (row|col)
#> id : territories
#>
#> filter [rows 3, 4, 5, 6, 10, 11, 12, 13]
#>
#> variable type row top col dist
#> ------------- ---------- ------ ------ ------ ------
#> territories id 2, 9 1, 4 F
#> year id 3:6 4 T
#> commodities id 1, 4 F
#> harvested observed 1, 8 2, 5 F
#> production observed 1, 8 3, 6 F
# use a function to find rows
(input <- tabs2shift$messy_rows)
#> # A tibble: 16 × 7
#> X1 X2 X3 X4 X5 X6 X7
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 territories period commodities other_observed harvested production empty_col
#> 2 1 - all xyz 999 999 NA
#> 3 2 - none xyz 999 999 NA
#> 4 NA NA NA NA NA NA NA
#> 5 3 - xyz xyz 999 999 NA
#> 6 4 - 5 xyz 999 999 NA
#> 7 unit 1 year 1 soybean xyz 1111 1112 NA
#> 8 unit 1 year 1 maize xyz 1121 1122 NA
#> 9 unit 1 year 2 soybean xyz 1211 1212 NA
#> 10 unit 1 year 2 maize xyz 1221 1222 NA
#> 11 5 - all NA 999 999 NA
#> 12 6 - none NA 999 999 NA
#> 13 unit 2 year 1 soybean xyz 2111 2112 NA
#> 14 unit 2 year 1 maize xyz 2121 2122 NA
#> 15 unit 2 year 2 soybean xyz 2211 2212 NA
#> 16 unit 2 year 2 maize xyz 2221 2222 NA
schema <-
setFilter(rows = .find(fun = is.numeric, col = 1, invert = TRUE)) %>%
setIDVar(name = "territories", columns = 1) %>%
setIDVar(name = "year", columns = 2) %>%
setIDVar(name = "commodities", columns = 3) %>%
setObsVar(name = "harvested", columns = 5) %>%
setObsVar(name = "production", columns = 6)
reorganise(schema = schema, input = input)
#> filling NA-values in downwards direction in column 'territories'.
#> filling NA-values in downwards direction in column 'year'.
#> filling NA-values in downwards direction in column 'commodities'.
#> # A tibble: 8 × 5
#> territories year commodities harvested production
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 unit 1 year 1 soybean 1111 1112
#> 2 unit 1 year 1 maize 1121 1122
#> 3 unit 1 year 2 soybean 1211 1212
#> 4 unit 1 year 2 maize 1221 1222
#> 5 unit 2 year 1 soybean 2111 2112
#> 6 unit 2 year 1 maize 2121 2122
#> 7 unit 2 year 2 soybean 2211 2212
#> 8 unit 2 year 2 maize 2221 2222