A schema
stores the information of where which information is stored
in a table of data.
cluster
[list(1)
]
description of
clusters
in the table.
format
[list(1)
]
description of the table
format
variables
[named list(.)
]
description of
identifying
and observed
variables.
This section outlines the currently recommended strategy for setting up schema descriptions. For example tables and the respective schemas, see the vignette.
Variables: Clarify which are the identifying variables and which are the observed variables. Make sure not to mistake a listed observed variable as identifying variable.
Clusters: Determine whether there are clusters and if so, find
the origin (top left cell) of each cluster and provide the required
information in setCluster(top = ..., left =
...)
. It is advised to treat a table that contains meta-data in the top
rows as cluster, as this is often the case with implicit variables. All
variables need to be specified in each cluster (in case clusters are all
organised in the same arrangement), or relative = TRUE
can be used.
Data may be organised into clusters a) whenever a set of variables occurs
more than once in the same table, nested into another variable, or b) when
the data are organised into separate spreadsheets or files according to one
of the variables (depending on the context, these issues can also be solved
differently). In both cases the variable responsible for clustering (the
cluster ID) can be either an identifying variable, or a categorical
observed variable:
in case the cluster ID is an identifying variable, provide its name
in setCluster(id = ...)
and specify it as an
identifying variable (setIDVar
)
in case it is a observed variable, provide simply
setCluster(..., id = "observed")
.
Meta-data: Provide potentially information about the format
(setFormat
).
Identifying variables: Determine the following:
is the variable available at all? This is particularly important when
the data are split up into tables that are in spreadsheets or files. Often
the variable that splits up the data (and thus identifies the clusters) is
not explicitly available in the table anymore. In such a case, provide the
value in setIDVar(..., value = ...)
.
all columns in which the variable values sit.
in case the variable is in several columns, determine additionally the row in which its values sit. In this case, the values will look like they are part of a header.
in case the variable must be split off of another column, provide a
regular expression that results in the target subset via
setIDVar(..., split = ...)
.
in case the variable is distinct from the main table, provide the
explicit (non-relative) position and set
setIDVar(..., distinct = TRUE)
.
Observed variable: Determine the following:
all columns in which the values of the variable sit.
the conversion factor.
in case the variable is not tidy, go through the following cases one after the other:
in case the variable is nested in a wide identifying variable, determine in addition to the columns in which the values sit also the rows in which the variable name sits.
in case the names of the variable are given as a value of an
identifying variable, give the column name as
setObsVar(..., key = ...)
, together with the name
of the respective observed variable (as it appears in the table) in
values
.
in case the name of the variable is the ID of clusters, specify
setObsVar(..., key = "cluster", value = ...)
,
where values
has the cluster number the variable refers to.