Skip to content

Static analysis of data/ exports in R packages #8521

@lionel-

Description

@lionel-

Follow-up to #1325 and posit-dev/ark#870.

We now examine the DESCRIPTION and NAMESPACE files of a package to statically determine its exports, but datasets are not declared in these files.

While there is no top-level static declarations for exported datasets (https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Data-in-packages), Rd documentation has specific markup \docType{data} for datasets: https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Documenting-data-sets (thanks for the pointer @DavisVaughan). Fortunately, this is linted by R CMD check, which means we can reliably use these markups as data export declarations:

* checking for missing documentation entries ... WARNING
Undocumented data sets:
  'penguins'

The data/ folder in a source repository may contain:

  • txt and CSV files with raw data
  • R files with R code defining the dataset
  • save() images

What happens to these data sources depends on whether the LazyData field in DESCRIPTION is set to true:

  • If true, the data is loaded, serialized with compression, and packaged with an index. Both the data and index files are RDS files that are lazy-loaded along with the namespace: https://github.com/r-devel/r-svn/blob/fb6baa45/src/library/base/R/namespace.R#L585-L590.

    When the package is attached, the data objects are directly exported to the search path: https://github.com/r-devel/r-svn/blob/fb6baa45069d41a798901c5d39667f9c1a8e1a87/src/library/base/R/namespace.R#L134-L136. This means they can be referenced after a library() call.

  • If false, the data files are installed as is. Let's call these files "data exporter files". A library(mypkg) call doesn't automatically export the datasets to the search path, instead it makes data exporter symbols available in data() calls. The user must pass exporters to data(), e.g. data(mydata). The exporter then exports one or more datasets to the search path.

    If library(mypkg) has not been called, the exporters are not in scope and the user must explicitly qualify them: data(mydata, package = "mypkg"). I've not checked what happens when exporter symbols conflict but I would guess they are masked depending on the order of the library() calls.

  • A key thing to note is that a single exporter may export multiple dataset symbols. The dataset doc files must state which data exporter they belong to in the \usage section.

So we'll have two support these two ways of exporting data symbols, depending on the use of lazy-loaded data.

Plan:

  • Detect all \docType{data} markups in man/ and collect the \name{}s. These are the datasets exported by the package.

  • If the Lazydata DESCRIPTION field is true, export these objects at top-level.

  • If false, detect base file names in the installed data folder, and export these symbols to a special search path specific to data() context.

  • When data(package = "foo") is set, narrow the special data search path to the installed data exporters of the package foo. Match the exporter to the relevant set of datasets and export those.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions