Check and update schema descriptions — validateSchema • tabshiftr

This function takes a raw schema description and updates values that were only given as wildcard or implied values. It is automatically called by reorganise, but can also be used in concert with the getters to debug a schema.

validateSchema(schema = NULL, input = NULL)

Arguments

schema: [symbol(1)]
the schema description.
input: [data.frame(1)]
an input for which to check a schema description.

Value

An updated schema description

Details

The core idea of a schema description is that it can be written in a very generic way, as long as it describes sufficiently where in a table what variable can be found. A very generic way can be via using the function .find to identify the initially unknown cell-locations of a variable on-the-fly, for example when it is merely known that a variable must be in the table, but not where it is.

validateSchema matches a schema with an input table and inserts the accordingly evaluated positions (of clusters, filters and variables), adapts some of the meta-data and ensures formal consistency of the schema.

Examples

# build a schema for an already tidy table
(tidyTab <- tabs2shift$tidy)
#> # A tibble: 10 × 7
#>    X1          X2     X3          X4             X5        X6         X7       
#>    <chr>       <chr>  <chr>       <chr>          <chr>     <chr>      <chr>    
#>  1 territories period commodities other_observed harvested production empty_col
#>  2 unit 1      year 1 soybean     xyz            1111      1112       NA       
#>  3 unit 1      year 1 maize       xyz            1121      1122       NA       
#>  4 unit 1      year 2 soybean     xyz            1211      1212       NA       
#>  5 unit 1      year 2 maize       xyz            1221      1222       NA       
#>  6 NA          NA     NA          NA             NA        NA         NA       
#>  7 unit 2      year 1 soybean     xyz            2111      2112       NA       
#>  8 unit 2      year 1 maize       xyz            2121      2122       NA       
#>  9 unit 2      year 2 soybean     xyz            2211      2212       NA       
#> 10 unit 2      year 2 maize       xyz            2221      2222       NA       

schema <-
  setIDVar(name = "territories", col = 1) %>%
  setIDVar(name = "year", col = .find(pattern = "period")) %>%
  setIDVar(name = "commodities", col = 3) %>%
  setObsVar(name = "harvested", col = 5) %>%
  setObsVar(name = "production", col = 6)

# before ...
schema
#>   1 cluster (whole spreadsheet)
#> 
#>    variable      type       col       
#>   ------------- ---------- --------  
#>    territories   id         1        
#>    year          id         period   
#>    commodities   id         3        
#>    harvested     observed   5        
#>    production    observed   6        

# ... after
validateSchema(schema = schema, input = tidyTab)
#>   1 cluster
#>     origin : 1|1  (row|col)
#> 
#>   filter  [rows 2, 3, 4, 5, 7, 8, 9, 10]
#> 
#>    variable      type       top   col    
#>   ------------- ---------- ----- -----  
#>    territories   id               1     
#>    year          id               2     
#>    commodities   id               3     
#>    harvested     observed   1     5     
#>    production    observed   1     6