Tables as places where data are recorded can be pretty messy. The ‘tidy’ paradigm in R proposes that data are organised so that variables are recorded in columns, observations in rows and that there is only one value per cell (Wickham 2014). This, however, is only one interpretation of how data should be organised (Broman and Woo 2018) and especially when scraping data off the internet, one frequently encounters spreadsheets that don’t follow this paradigm.
The tidyr
package is one of the most popular tools to
bring data into a tidy format. However, up until today it is limited to
tables that are already organised into topologically coherent
(rectangular) chunks and any messiness beyond that requires dedicated
scripts for reorganization. In tabshiftr
we try to describe
and work with a further dimension of messiness, where data are available
as so-called disorganized (messy) data, data that are not yet
arranged into rectangular form.
The approach of tabshiftr
is based on describing the
arrangement of such tables in a so-called schema
description, which is then the basis for automatic reorganization via
the function reorganise()
. Typically there is an input and
an output schema, describing the arrangement of the input and output
tables, respectively. The advantage of this procedure is that input and
output tables exist explicitly and the schema maps the transformation of
the data. As we want to end up with tidy tables, the output schema is
pre-determined by a tidy table of the included variables and the input
schema needs to be put together by you, the
user.
Data can be disorganised according to multiple dimensions. To understand those dimensions, we first need to understand the nature of data. Data "of the same kind" are collected in a variable, which is always a combination of a name and the values. In a table, names are typically in the topmost row and values are in the column below that name (Tab. 1). Conceptually, there are two types of variables in any table:
Moreover, a table is part of a series when other tables of that series contain the same variables with slightly different values for the identifying variables, irrespective of how the distinct tables of that series are arranged.
Here, we do use the term spreadsheet to denote what would typically be seen as table, as the data we deal with here are typically "disorganised" and thus not what would be recognised as a table. Even though data in many spreadsheets are disorganised, they are mostly not non-systematic. Especially in complex spreadsheets, one often encounters a situation where a set of variables occurs more than once with the same or very similar arrangement, which we call cluster here. Data that are part of clusters are split along one of the, typically categorical, variables (the cluster ID), with the aim to increase the visual accessibility or direct the focus for human readers (Tab. 2). This may also be the case where data are split up into several files or spreadsheets of a file, where the cluster ID can be found in the file or spreasheet name or the meta-data. In many cases, the cluster ID is an implicit variable, a variable that is not immediately recogniseable as such.
To set up a schema description, go through the following questions step by step and provide the respective answer in the respective function. Linked tables can serve as examples to compare against and right after the decision tree there is a step-by-step hands-on example of how to build a schema description.
Variables: Clarify which are the identifying variables and which are the observed variables. Make sure not to mistake a listed observed variable (Tab. 10) as an identifying variable.
Format: Provide potentially information about
the table format in setFormat()
.
Clusters: In case there are clusters, provide in
setCluster()
Identifying variables: provide in
setIDVar()
Observed variable: provide in
setObsVar()
tabshiftr
, tables need to be read in while
treating any header rows as data, i.e., by not setting the
first row as header
input <- read_csv(file = ...,
col_names = FALSE,
col_types = cols(.default = "c"))
reorganise()
takes care of reformatting the data-types
into the most permissive data type that does not introduce
NA
s where there should be data, i.e, if a variable can be
numeric, it is formatted as numeric column, otherwise it will be stored
as a character column.unit1/2/3
values disappear, which makes it
easier to see that the respective column is actually a tidy column of
commodities.As an example, we show here how to build a schema description for a
table that has a wide identifying variable and a listed observed
variable. This table contains additionally some dummy information one
would typically encounter in tables, such as empty_col
s and
rows and data that are not immediately of interest
(other_observed
).
kable(input <- tabs2shift$listed_column_wide)
X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 |
---|---|---|---|---|---|---|---|
territories | period | . | dimension | other_observed | soybean | maize | empty_col |
unit 1 | year 1 | . | harvested | xyz | 1111 | 1121 | . |
unit 1 | year 1 | . | production | xyz | 1112 | 1122 | . |
unit 1 | year 2 | . | harvested | xyz | 1211 | 1221 | . |
unit 1 | year 2 | . | production | xyz | 1212 | 1222 | . |
. | . | . | . | . | . | . | . |
unit 2 | year 1 | . | harvested | xyz | 2111 | 2121 | . |
unit 2 | year 1 | . | production | xyz | 2112 | 2122 | . |
unit 2 | year 2 | . | harvested | xyz | 2211 | 2221 | . |
unit 2 | year 2 | . | production | xyz | 2212 | 2222 | . |
In this case we don’t need to set clusters and can start immediately
with setting the first id variable territories
, which is in
the first column and otherwise tidy. The order by which we set the
variables determines where they ocurr in the output table. Any of the
setters start by default with an empty schema, in case none is provided
to them from a previous setter, thus none needs to be provided at the
beginning of a schema.
schema <- setIDVar(name = "territories", columns = 1)
Since version 0.3.0, tabshiftr
comes with getters that
allow to debug the current schema description. To do this, however, the
schema first needs to be validated. This is in order to make sure that
all the generic information are evaluated with the respective input.
After that, a getter can be used to extract the respective information,
for example the reorganised id variables with
getIDVars()
.
validateSchema(schema = schema, input = input) %>%
getIDVars(input = input)
#> filling NA-values in downwards direction in column 'territories'.
#> [[1]]
#> [[1]]$territories
#> # A tibble: 9 × 1
#> X1
#> <chr>
#> 1 territories
#> 2 unit 1
#> 3 unit 1
#> 4 unit 1
#> 5 unit 1
#> 6 unit 2
#> 7 unit 2
#> 8 unit 2
#> 9 unit 2
After seeing that our specification results in a meaningful output,
we can continue setting the other id variables years
(tidy
and in column 2) and commodities
(spread over two columns
and the values are in the first row). Note, how we pipe the previous
schema into the next setter. This results in the next variable being
added to that schema.
schema <- schema %>%
setIDVar(name = "year", columns = 2) %>%
setIDVar(name = "commodities", columns = c(6, 7), rows = 1)
Validating and checking for id variables again results in the following
validateSchema(schema = schema, input = input) %>%
getIDVars(input = input)
#> filling NA-values in downwards direction in column 'territories'.
#> filling NA-values in downwards direction in column 'year'.
#> [[1]]
#> [[1]]$territories
#> # A tibble: 8 × 1
#> X1
#> <chr>
#> 1 unit 1
#> 2 unit 1
#> 3 unit 1
#> 4 unit 1
#> 5 unit 2
#> 6 unit 2
#> 7 unit 2
#> 8 unit 2
#>
#> [[1]]$year
#> # A tibble: 8 × 1
#> X2
#> <chr>
#> 1 year 1
#> 2 year 1
#> 3 year 2
#> 4 year 2
#> 5 year 1
#> 6 year 1
#> 7 year 2
#> 8 year 2
#>
#> [[1]]$commodities
#> # A tibble: 1 × 2
#> X6 X7
#> <chr> <chr>
#> 1 soybean maize
The id variable commodities
is clearly wide (more than
one column) and its’ values are not repeated four times, as it should
be, judging by the combination of the other variables. However, this is
an expected tentative output that will be handled in a later step and
the id variables have been specified correctly.
Next, we set the listed observed variables. Listed means
that the column names of the observed variables are treated as if they
were the values of an identifying variable (in column 4), while the
values are in the columns 6 and 7. In this case, the values need to be
filtered by value
(i.e., the values of that variable are
found in columns 6 and 7, where column 4 contains
value
).
schema <- schema %>%
setObsVar(name = "harvested", columns = c(6, 7), key = 4, value = "harvested") %>%
setObsVar(name = "production", columns = c(6, 7), key = 4, value = "production")
We then get the following observed variables, which is also an expected tentative output.
validateSchema(schema = schema, input = input) %>%
getObsVars(input = input)
#> [[1]]
#> [[1]]$listed
#> # A tibble: 8 × 3
#> key X6 X7
#> <chr> <chr> <chr>
#> 1 harvested 1111 1121
#> 2 production 1112 1122
#> 3 harvested 1211 1221
#> 4 production 1212 1222
#> 5 harvested 2111 2121
#> 6 production 2112 2122
#> 7 harvested 2211 2221
#> 8 production 2212 2222
From both, the output of getIDVars
and
getObsVars
we can calculate how many and which combinations
of data exist (e.g., the two columns in the observed variables
correspond to the two values of the identifying variable
commodities
) and that they still need to be pivoted to be
in a tidy arrangement.
The reorganise()
function carries out the steps of
validating, extracting the variables, pivoting the tentative output and
putting the final table together automatically, so it merely requires
the finalized (non-validated) schema
and the
input
table.
schema # has a pretty print function
#> 1 cluster (whole spreadsheet)
#>
#> variable type row col key value
#> ------------- ---------- ----- ----- ----- ------------
#> territories id 1
#> year id 2
#> commodities id 1 6:7
#> harvested observed 6:7 4 harvested
#> production observed 6:7 4 production
reorganise(input = input, schema = schema)
#> filling NA-values in downwards direction in column 'territories'.
#> filling NA-values in downwards direction in column 'year'.
#> # A tibble: 8 × 5
#> territories year commodities harvested production
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 unit 1 year 1 soybean 1111 1112
#> 2 unit 1 year 1 maize 1121 1122
#> 3 unit 1 year 2 soybean 1211 1212
#> 4 unit 1 year 2 maize 1221 1222
#> 5 unit 2 year 1 soybean 2111 2112
#> 6 unit 2 year 1 maize 2121 2122
#> 7 unit 2 year 2 soybean 2211 2212
#> 8 unit 2 year 2 maize 2221 2222
In this section we look at some examples of disorganized data, discuss the dimension along which they are disorganized and show which schema description should be used to reorganize them.
All of the following examples contain an other_observed
,
an empty_col
column and an empty row, which serve the
purpose of dummy information or formatting that could be found in any
table and should not disturb the process of reorganizing. You can run
all the examples by simply loading the schema and calling
reorganise(input = tabs2shift$..., schema = schema)
with
the respective table that is plotted for this example.
In case the observed variables are arranged into individual columns (Tab. 3), we have tidy data (Wickham 2014), which are largely already in the target arrangement. The tidy table may however, still contain unneeded data, need different names, or transformation factors for the values.
kable(tabs2shift$tidy)
schema <-
setIDVar(name = "territories", columns = 1) %>%
setIDVar(name = "year", columns = 2) %>%
setIDVar(name = "commodities", columns = 3) %>%
setObsVar(name = "harvested", columns = 5) %>%
setObsVar(name = "production", columns = 6, factor = 0.1)
reorganise(input = tabs2shift$tidy, schema = schema)
#> filling NA-values in downwards direction in column 'territories'.
#> filling NA-values in downwards direction in column 'year'.
#> filling NA-values in downwards direction in column 'commodities'.
#> # A tibble: 8 × 5
#> territories year commodities harvested production
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 unit 1 year 1 soybean 1111 111.
#> 2 unit 1 year 1 maize 1121 112.
#> 3 unit 1 year 2 soybean 1211 121.
#> 4 unit 1 year 2 maize 1221 122.
#> 5 unit 2 year 1 soybean 2111 211.
#> 6 unit 2 year 1 maize 2121 212.
#> 7 unit 2 year 2 soybean 2211 221.
#> 8 unit 2 year 2 maize 2221 222.
Sometimes it may be the case that the number of variables is not the same as there are columns because either one variable is spread over several column, or one column contains several variables.
In the former case, columns need to be merged (Tab. 4) and in the latter case, columns need to be
split via regular
expressions (Tab. 5). For example,
.+?(?=_)
gives everything up until the first _
and (?<=\\_).*
everything after the _
.
kable(input <- tabs2shift$split_column)
schema <-
setIDVar(name = "territories", columns = 1) %>%
setIDVar(name = "year", columns = c(2, 4), merge = " ") %>%
setIDVar(name = "commodities", columns = 5) %>%
setObsVar(name = "harvested", columns = 6) %>%
setObsVar(name = "production", columns = 7)
kable(tabs2shift$merged_column)
When data are split up into clusters that are stored in separate files or spreadsheets, the cluster ID is often recorded not in the table as an explicit variable, but is only provided in the file or table name. In those cases, we have to register this cluster ID as an identifying variable nevertheless, to output a consistent table.
kable(input <- tabs2shift$implicit_variable)
In case identifying variables are factors with a small number of levels, those levels may be falsely used as names of other variables, where they would be next to each other and thus "wide" (Tab. 7). Those other variables (both identifying and observed variables) would then be "nested" in the wide identifying variables. In those cases we have to record for the identifying variable(s) the columns and the row in which the values of the identifying variable are found (they will look like they are part of the header). For the observed variable(s) we need to record the columns and the row where the name of that variable is found.
kable(input <- tabs2shift$one_wide_id)
schema <-
setIDVar(name = "territories", columns = 1) %>%
setIDVar(name = "year", columns = 3) %>%
setIDVar(name = "commodities", columns = c(4, 6), rows = 1) %>%
setObsVar(name = "harvested", columns = c(4, 6), top = 2) %>%
setObsVar(name = "production", columns = c(5, 7), top = 2)
The same is also true in case the identifying variables are actually
nested within the observed variables, i.e., the names of the observed
variables are on top of the names of the id variables (Tab. 8). However, if an observed variables is in
the topmost row (and there are no clusters), top = 1
can be
omitted.
kable(input <- tabs2shift$wide_obs)
schema <-
setIDVar(name = "territories", columns = 1) %>%
setIDVar(name = "year", columns = 2) %>%
setIDVar(name = "commodities", columns = c(3:6), rows = 2) %>%
setObsVar(name = "harvested", columns = c(3, 4)) %>%
setObsVar(name = "production", columns = c(5, 6))
In case several variables are nested within other variables, we have to specify for all nested or nesting variables in which respective rows their values sit.
kable(input <- tabs2shift$two_wide_id)
schema <-
setIDVar(name = "territories", columns = 1) %>%
setIDVar(name = "year", columns = c(2, 6), rows = 1) %>%
setIDVar(name = "commodities", columns = c(2, 4, 6, 8), rows = 2) %>%
setObsVar(name = "harvested", columns = c(2, 4, 6, 8), top = 3) %>%
setObsVar(name = "production", columns = c(3, 5, 7, 9), top = 3)
Some tables contain a column where the names of observed variables
(harvested
and production
) are treated as if
they were the values of an identifying variable
(dimension
), while the values are presented in only one
column (values
) (Tab. 10). To end up
with tidy data in those cases, we need to extract the values associated
with the observed variables. Thus, we define the observed variables and
specify the key =
in which the variable names sit, and the
value =
the variable name has, to extract that
variable.
kable(input <- tabs2shift$listed_column)
schema <-
setIDVar(name = "territories", columns = 1) %>%
setIDVar(name = "year", columns = 2) %>%
setIDVar(name = "commodities", columns = 3) %>%
setObsVar(name = "harvested", columns = 7, key = 6, value = "harvested") %>%
setObsVar(name = "production", columns = 7, key = 6, value = "production")
Moreover, (several) identifying variables may be wide additionally and we have to proceed as mentioned above, by providing the columns and the rows of the variable values (which appear to be names).
kable(input <- tabs2shift$listed_column_wide)
schema <-
setIDVar(name = "territories", columns = 1) %>%
setIDVar(name = "year", columns = 2) %>%
setIDVar(name = "commodities", columns = c(6, 7), rows = 1) %>%
setObsVar(name = "harvested", columns = c(6, 7), key = 4, value = "harvested") %>%
setObsVar(name = "production", columns = c(6, 7), key = 4, value = "production")
Clusters are typically of the same arrangement within one table, they can be repeated along rows (horizontally) or along columns (vertically), but also a non-systematic distribution is possible. Moreover, clusters may be nested into some grouping variable in a similar way that data are nested into clusters. A table could also be treated like a cluster when the table is not only composed of the table, but perhaps also text in the form of some table description (that may be scattered in the document and) that does not allow the table to start at the table origin in the topmost left cell.
In case clusters are sitting right next to each other in the same origin row (Tab. 12), it is sufficient to provide the topmost row and all leftmost columns at which a new cluster starts.
kable(input <- tabs2shift$clusters_horizontal)
schema <- setCluster(id = "territories",
left = c(1, 6), top = 2) %>%
setIDVar(name = "territories", columns = c(1, 6), rows = 2) %>%
setIDVar(name = "year", columns = c(2, 7)) %>%
setIDVar(name = "commodities", columns = c(1, 6)) %>%
setObsVar(name = "harvested", columns = c(3, 8)) %>%
setObsVar(name = "production", columns = c(4, 9))
For vertically arranged clusters (Tab. 13), just like for the horizontal case, the respective rows and columns need to be provided.
kable(input <- tabs2shift$clusters_vertical)
schema <- setCluster(id = "territories",
left = 1, top = c(3, 9)) %>%
setIDVar(name = "territories", columns = 1, rows = c(3, 9)) %>%
setIDVar(name = "year", columns = 2) %>%
setIDVar(name = "commodities", columns = 5) %>%
setObsVar(name = "harvested", columns = 6) %>%
setObsVar(name = "production", columns = 7)
The previous two types of clusters are clusters of identifying
variables, but it may also be the case that the observed variables are
split up into distinct clusters. Here, we need to specify first of all
setClusters(..., id = "observed")
to indicate that clusters
are observed variables. Next, we need to set up the observed variables
so that they contain "key = "cluster"
and in
value
the number of the cluster this variable can be found
in.
kable(input <- tabs2shift$clusters_observed)
schema <- setCluster(id = "observed",
left = 1, top = c(2, 12)) %>%
setIDVar(name = "territories", columns = 2) %>%
setIDVar(name = "year", columns = 3) %>%
setIDVar(name = "commodities", columns = 5) %>%
setObsVar(name = "harvested", columns = 7, key = "cluster", value = 1) %>%
setObsVar(name = "production", columns = 7, key = "cluster", value = 2)
When (some) clusters are nested into a (grouping) variable of
interest, not only the cluster positions need to be specified, but also
their relation to the grouping variable. Similar to the cluster ID, this
group ID also needs to be specified as an identifying variable and needs
to be provided as group = ...
and the membership of each
cluster to a group needs to be specified in member = ...
.
The cluster position needs to be specified just as it would be without
groups.
kable(input <- tabs2shift$clusters_nested)
schema <- setCluster(id = "sublevel",
group = "territories", member = c(1, 1, 2),
left = 1, top = c(3, 8, 15)) %>%
setIDVar(name = "territories", columns = 1, rows = c(2, 14)) %>%
setIDVar(name = "sublevel", columns = 1, rows = c(3, 8, 15)) %>%
setIDVar(name = "year", columns = 7) %>%
setIDVar(name = "commodities", columns = 2) %>%
setObsVar(name = "harvested", columns = 5) %>%
setObsVar(name = "production", columns = 6)
When not all identifying variables can be provided relative to the
cluster origin, for example because they are missing for some clusters,
it makes more sense to define such a variable as a distinct variable.
This is done by providing row
and col
as
absolute values and setting distinct = TRUE
. Other
variables that are all arranged in the same way in each cluster can be
specified so that their row and column indices are given relative to the
cluster position (relative = TRUE
), as shown in the
alternative shema below.
kable(input <- tabs2shift$clusters_messy)
schema <- setCluster(id = "territories",
left = c(1, 1, 4), top = c(1, 8, 8)) %>%
setIDVar(name = "territories", columns = c(1, 1, 4), rows = c(2, 9, 9)) %>%
setIDVar(name = "year", columns = 4, rows = c(3:6), distinct = TRUE) %>%
setIDVar(name = "commodities", columns = c(1, 1, 4)) %>%
setObsVar(name = "harvested", columns = c(2, 2, 5)) %>%
setObsVar(name = "production", columns = c(3, 3, 6))
schema_alt <- setCluster(id = "territories",
left = c(1, 1, 4), top = c(1, 8, 8)) %>%
setIDVar(name = "territories", columns = 1, rows = .find(row = 2, relative = TRUE)) %>%
setIDVar(name = "year", columns = 4, rows = c(3:6), distinct = TRUE) %>%
setIDVar(name = "commodities", columns = .find(col = 1, relative = TRUE)) %>%
setObsVar(name = "harvested", columns = .find(col = 2, relative = TRUE)) %>%
setObsVar(name = "production", columns = .find(col = 3, relative = TRUE))
In case several clusters are neither aligned along a row nor a column, and are all of differing size, the respective information need to be provided at the same index of the respective property. For example, three clusters, where the first cluster starts at (1,1) and is 3 by 4 cells in size, where the second clusters starts at (5,2) and is 5 by 5 cells in size, and so on, needs to be specified as below.
schema <- setCluster(id = ...,
left = c(1, 2, 5), top = c(1, 5, 1),
width = c(3, 5, 2), height = c(4, 5, 3),
...) %>%
setIDVar(name = "territories", columns = .find(col = 1, relative = TRUE)) %>%
...
Additionally, given that at least the tables within each cluster are
all arranged in the same way, the contained variables can be specified
so that their row and column indices are given relative to the cluster
position (relative = TRUE
). If also that is not the case,
the row and column values for each cluster need to be provided for the
respective variables in the same way as for cluster positions.