Work frictionless with 'Data Packages' (<https://frictionlessdata.io/specs/data-package/>). Allows to load and validate any descriptor for a data package profile, create and modify descriptors and provides expose methods for reading and streaming data in the package. When a descriptor is a 'Tabular Data Package', it uses the 'Table Schema' package (<https://CRAN.R-project.org/package=tableschema.r>) and exposes its functionality, for each resource object in the resources field.
A Data Package consists of:
Metadata that describes the structure and contents of the package.
Resources such as data files that form the contents of the package.
The Data Package metadata is stored in a "descriptor". This descriptor is what makes a collection of data a Data Package. The structure of this descriptor is the main content of the specification below.
In addition to this descriptor a data package will include other resources such as data files. The Data Package specification does NOT impose any requirements on their form or structure and can therefore be used for packaging any kind of data.
The data included in the package may be provided as:
Files bundled locally with the package descriptor.
Remote resources, referenced by URL.
"Inline" data which is included directly in the descriptor.
Jsolite package is internally used
to convert json data to list objects. The input parameters of functions could be json strings,
files or lists and the outputs are in list format to easily further process your data in R environment
and exported as desired. It is recommended to use helpers.from.json.to.list or
helpers.from.list.to.json to convert json objects to lists and vice versa. More details about
handling json you can see jsonlite documentation or vignettes here.
Several example data packages can be found in the datasets organization on github, including:
#
The descriptor is the central file in a Data Package. It provides:
General metadata such as the package's title, license, publisher etc
A list of the data "resources" that make up the package including their location on disk or online and other relevant information (including, possibly, schema information about these data resources in a structured form)
A Data Package descriptor MUST be a valid JSON object. (JSON is defined in
RFC 4627). When available as a file it
MUST be named datapackage.json and it MUST be placed in the top-level directory
(relative to any other resources provided as part of the data package).
The descriptor MUST contain a resources property describing the data resources.
All other properties are considered metadata properties. The descriptor MAY contain any number of
other metadata properties. The following sections provides a description of required and optional
metadata properties for a Data Package descriptor.
Adherence to the specification does not imply that additional, non-specified properties cannot be used:
a descriptor MAY include any number of properties in additional to those described as required and optional properties.
For example, if you were storing time series data and wanted to list the temporal coverage of the data in the Data Package
you could add a property temporal (cf Dublin Core):
"temporal": {
"name": "19th Century",
"start": "1800-01-01",
"end": "1899-12-31"
}
This flexibility enables specific communities to extend Data Packages as appropriate for the data they manage. As an example, the Tabular Data Package specification extends Data Package to the case where all the data is tabular and stored in CSV.
Packaged data resources are described in the resources property of the package descriptor.
This property MUST be an array/list of objects. Each object MUST follow the
Data Resource specification.
See also Resource Class
#
The resources property is required, with at least one resource.
In addition to the required properties, the following properties SHOULD be included
in every package descriptor:
nameA short url-usable (and preferably human-readable) name of the package. This MUST be lower-case and
contain only alphanumeric characters along with ".", "_" or "-" characters. It will function as a unique identifier
and therefore SHOULD be unique in relation to any registry in which this package will be deposited
(and preferably globally unique).The name SHOULD be invariant, meaning that it SHOULD NOT change when a data package is updated,
unless the new package version should be considered a distinct package, e.g. due to significant changes in structure or
interpretation. Version distinction SHOULD be left to the version property. As a corollary, the name also
SHOULD NOT include an indication of time range covered.
idA property reserved for globally unique identifiers. Examples of identifiers that are unique include UUIDs and DOIs.A common usage pattern for Data Packages is as a packaging format within the bounds of a system or platform.
In these cases, a unique identifier for a package is desired for common data handling workflows, such as updating
an existing package. While at the level of the specification, global uniqueness cannot be validated, consumers using
the id property MUST ensure identifiers are globally unique.Examples:
{"id": "b03ec84-77fd-4270-813b-0c698943f7ce"}
{"id": "https://doi.org/10.1594/PANGAEA.726855"}
licensesThe license(s) under which the package is provided.This property is not legally binding and does not guarantee the package is licensed
under the terms defined in this property.licenses MUST be an array. Each item in the array is a License.
Each MUST be an object. The object MUST contain a name property
and/or a path property. It MAY contain a title property.Here is an example:
"licenses": [{
"name": "ODC-PDDL-1.0",
"path": "http://opendatacommons.org/licenses/pddl/",
"title": "Open Data Commons Public Domain Dedication and License v1.0"
}]
name: The name MUST be an Open Definition license ID.
path: A url-or-path string,
that is a fully qualified HTTP address, or a relative POSIX path (see
the url-or-path definition in Data Resource for details).
title: A human-readable title.
profileA string identifying the profile of this descriptor as per the profiles specification.
{"profile": "tabular-data-package"}
{"profile": "http://example.com/my-profiles-json-schema.json"}
The following are commonly used properties that the package descriptor MAY contain:
titleA string providing a title or one sentence description for this package.
descriptionA description of the package. The description MUST be markdown
formatted -- this also allows for simple plain text as plain text is itself valid markdown. The first paragraph
(up to the first double line break) should be usable as summary information for the package.
homepageA URL for the home on the web that is related to this data package.
versionA version string identifying the version of the package. It should conform to the Semantic Versioning requirements and should follow the Data Package Version pattern.
sourcesThe raw sources for this data package. It MUST be an array of Source objects. Each Source object MUST have a title and MAY have path and/or email properties. Example:
"sources": [{
"title": "World Bank and OECD",
"path": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"
}]
title: Title of the source (e.g. document or organization name).
path: A url-or-path string,
that is a fully qualified HTTP address, or a relative POSIX path (see
the url-or-path definition in Data Resource for details).
email: An email address.
contributorsThe people or organizations who contributed to this Data Package. It MUST be an array.
Each entry is a Contributor and MUST be an object.
A Contributor MUST have a title property and MAY contain path,
email, role and organization properties. An example of the object structure is as follows:Example:
"contributors": [{
"title": "Joe Bloggs",
"email": "joe@bloggs.com",
"path": "http://www.bloggs.com",
"role": "author"
}]
title: Name/Title of the contributor (name for person, name/title of organization).
path: A fully qualified http URL pointing to a relevant location online for the contributor.
email: An email address.
role: A string describing the role of the contributor. It MUST be one of: author,
publisher, maintainer, wrangler, and contributor. Defaults to contributor.
Note on semantics: use of the "author" property does not imply that that person was the original creator of
the data in the data package - merely that they created and/or maintain the data package. It is common for data
packages to "package" up data from elsewhere. The original origin of the data can be indicated with the sources
property - see above.
organization: A string describing the organization this contributor is affiliated to.
imageAn image to use for this data package. For example, when showing the package in a listing.The value of the image property MUST be a string pointing to the location of the image.
The string must be a url-or-path,
that is a fully qualified HTTP address, or a relative POSIX path (see
the url-or-path definition in Data Resource for details).
createdThe datetime on which this was created.Note: semantics may vary between publishers -- for some this is the datetime the data was created,
for others the datetime the package was created.The datetime must conform to the string formats for datetime as described in
RFC3339.Example:
{"created": "1985-04-12T23:20:50.52Z"}
Jsolite package is internally used to convert json data to list objects. The input parameters of functions could be json strings,
files or lists and the outputs are in list format to easily further process your data in R environment and exported as desired.
It is recommended to use helpers.from.json.to.list or helpers.from.list.to.json to convert json objects to lists and vice versa.
More details about handling json you can see jsonlite documentation or vignettes here.
Term array refers to json arrays which if converted in R will be list objects.
The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT,
SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL
in this package documents are to be interpreted as described in RFC 2119.