Title: | Tools to Create, Use, and Convert ecocomDP Data |
---|---|
Description: | Work with the Ecological Community Data Design Pattern. 'ecocomDP' is a flexible data model for harmonizing ecological community surveys, in a research question agnostic format, from source data published across repositories, and with methods that keep the derived data up-to-date as the underlying sources change. Described in O'Brien et al. (2021), <doi:10.1016/j.ecoinf.2021.101374>. |
Authors: | Colin Smith [aut, cre, cph] , Eric Sokol [aut] , Margaret O'Brien [aut] , Matt Bitters [ctb], Melissa Chen [ctb], Savannah Gonzales [ctb], Matt Helmus [ctb], Brendan Hobart [ctb], Ruvi Jaimes [ctb], Lara Janson [ctb], Marta Jarzyna [ctb], Michael Just [ctb], Daijiang Li [ctb], Wynne Moss [ctb], Kari Norman [ctb], Stephanie Parker [ctb], Rafael Rangel [ctb] , Natalie Robinson [ctb], Thilina Surasinghe [ctb], Kyle Zollo-Venecek [ctb] |
Maintainer: | Colin Smith <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.3.2 |
Built: | 2024-11-17 05:42:20 UTC |
Source: | https://github.com/ediorg/ecocomdp |
View the collection of dataset- and attribute-level annotations from existing ecocomDP datasets.
annotation_dictionary()
annotation_dictionary()
Use the search field to find the annotation terms and URIs.
## Not run: View(annotation_dictionary()) ## End(Not run)
## Not run: View(annotation_dictionary()) ## End(Not run)
A fully joined and flat version of EDI data package knb-lter-hfr.118.33 (Ant Assemblages in Hemlock Removal Experiment at Harvard Forest since 2003) with all relevant ecocomDP L1 identifiers and content added. Use this dataset as an input to the L0_flat
argument of the "create" functions.
ants_L0_flat
ants_L0_flat
A data frame with 2931 rows and 45 variables:
dates
block
plot number
treatment type
location of grid with respect to moose exclosure
trap type
applies only to pitfall cups
ant subfamily
head length. We used trait definitions from Del Toro et al. (2015) and filled in missing species' data with information from Ellison et al.
eye length relative to body size
femur length relative to body size
size of colony for each species
feeding preference for each species
nest substrate
primary habitat
secondary habitat associations
whether or not a seed dispersing species
whether or not a slavemaking species
classifications based on behavioral interactions with other ants
biogeographic affinity based on available occurrence records
where trait information was found. Full citations for literature are as follows: Del Toro, I., R.R. Silva, and A.M. Ellison. 2015. Predicated impacts of climatic change on ant functional diversity and distributions in eastern North American forests. Diversity and Distributions 21:781-791; Ellison, A.M., N.J. Gotelli, G. Alpert, and E.J. Farnsworth. 2012. A field guide to the ants of New England. Yale University Press, New Haven, Connecticut, USA.
units for "hl" variable
units for "rel" variable
units for "rll" variable
variables of the primary observation table
values of variable_name
units of variable_name
the observation id
the location id
the event id
approximate latitude of study area
approximate longitude of study area
approximate elevation of study area
name of organism
the taxon id
the taxon rank
the authority system taxon_name was resolved to
the id of taxon_name in authority_system
the identifier of this ecocomDP dataset
the identifier of the source dataset
number of years the survey has been ongoing
number of years during the survey that samples were taken
the standard deviation between surveys in years
number of unique taxa in this dataset
the study area in meters squared
https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-hfr&identifier=118&revision=33
The the ecocomDP (L1) formatted version of EDI data package knb-lter-hfr.118.33 (Ant Assemblages in Hemlock Removal Experiment at Harvard Forest since 2003) read from the EDI API with read_data(id = "edi.193.5")
. Use this dataset as an input to data "use" functions.
ants_L1
ants_L1
A list of:
The dataset identifier
See source url for metadata
A list of data frames, each an ecocomDP table
Is NULL because there are no validation issues for this dataset
https://portal.edirepository.org/nis/mapbrowse?scope=edi&identifier=193&revision=5
Calculate geo_extent_bounding_box_m2 for the dataset_summary table
calc_geo_extent_bounding_box_m2(west, east, north, south)
calc_geo_extent_bounding_box_m2(west, east, north, south)
west |
(numeric) West longitude in decimal degrees and negative if west of the prime meridian. |
east |
(numeric) East longitude in decimal degrees and negative if west of the prime meridian. |
north |
(numeric) North latitude in decimal degrees and negative if south of the equator. |
south |
(numeric) South latitude in decimal degrees and negative if south of the equator. |
(numeric) Area of study site in meters squared.
Calculate length_of_survey_years for the dataset_summary table
calc_length_of_survey_years(dates)
calc_length_of_survey_years(dates)
dates |
(Date) Dates from the L0 source dataset encompassing the entire study duration. |
(numeric) Number of years the study has been ongoing.
Calculate number_of_years_sampled for the dataset_summary table
calc_number_of_years_sampled(dates)
calc_number_of_years_sampled(dates)
dates |
(Date) Dates from the L0 source dataset encompassing the entire study duration. |
(numeric) Number of survey years in which a sample was taken.
Calculate std_dev_interval_betw_years for the dataset_summary table
calc_std_dev_interval_betw_years(dates)
calc_std_dev_interval_betw_years(dates)
dates |
(Date) Dates from the L0 source dataset encompassing the entire study duration. |
(numeric) The standard deviation between sampling events (in years).
Convert a dataset to the Darwin Core Archive format
convert_to_dwca( path, core_name, source_id, derived_id, url = NULL, user_id, user_domain )
convert_to_dwca( path, core_name, source_id, derived_id, url = NULL, user_id, user_domain )
path |
(character) Path to which the DwC-A data objects and EML will be written. |
core_name |
(character) The central table of the DwC-A dataset being created. Can be: "event" (event core). Occurrence core is not yet supported. |
source_id |
(character) Identifier of an ecocomDP dataset published in a supported repository. Currently, the EDI Data Repository is supported. |
derived_id |
(character) Identifier of the DwC-A dataset being created. |
url |
(character) URL to the publicly accessible directory containing DwC-A data objects. This argument supports direct download of the data entities by a data repository and is used for automated revisioning and publication. |
user_id |
(character) Identifier of user account associated with the data repository in which this ecocomDP dataset will be archived. Only |
user_domain |
(character) Domain (data repository) the |
Reads in an ecocomDP dataset from a supported repository and converts it to a DwC-A package.
DwC-A tables, meta.xml, and corresponding EML metadata.
## Not run: # Create directory for DwC-A outputs mypath <- paste0(tempdir(), "/data") dir.create(mypath) # Convert an EDI published ecocomDP dataset to a DwC-A convert_to_dwca( path = mypath, core_name = "event", source_id = "edi.193.5", derived_id = "edi.834.2", user_id = "ecocomdp", user_domain = "EDI") dir(mypath) # Clean up unlink(mypath, recursive = TRUE) ## End(Not run)
## Not run: # Create directory for DwC-A outputs mypath <- paste0(tempdir(), "/data") dir.create(mypath) # Convert an EDI published ecocomDP dataset to a DwC-A convert_to_dwca( path = mypath, core_name = "event", source_id = "edi.193.5", derived_id = "edi.834.2", user_id = "ecocomdp", user_domain = "EDI") dir(mypath) # Clean up unlink(mypath, recursive = TRUE) ## End(Not run)
Create the dataset_summary table
create_dataset_summary( L0_flat, package_id, original_package_id = NULL, length_of_survey_years, number_of_years_sampled, std_dev_interval_betw_years, max_num_taxa, geo_extent_bounding_box_m2 = NULL )
create_dataset_summary( L0_flat, package_id, original_package_id = NULL, length_of_survey_years, number_of_years_sampled, std_dev_interval_betw_years, max_num_taxa, geo_extent_bounding_box_m2 = NULL )
L0_flat |
(tbl_df, tbl, data.frame) The fully joined source L0 dataset, in "flat" format (see details). |
package_id |
(character) Column in |
original_package_id |
(character) An optional column in |
length_of_survey_years |
(character) Column in |
number_of_years_sampled |
(character) Column in |
std_dev_interval_betw_years |
(character) Column in |
max_num_taxa |
(character) Column in |
geo_extent_bounding_box_m2 |
(character) An optional column in |
This function collects specified columns from L0_flat
and returns distinct rows.
"flat" format refers to the fully joined source L0 dataset in "wide" form with the exception of the core observation variables, which are in "long" form (i.e. using the variable_name, value, unit columns of the observation table). This "flat" format is the "widest" an L1 ecocomDP dataset can be consistently spread due to the frequent occurrence of L0 source datasets with > 1 core observation variable.
(tbl_df, tbl, data.frame) The dataset_summary table.
flat <- ants_L0_flat dataset_summary <- create_dataset_summary( L0_flat = flat, package_id = "package_id", original_package_id = "original_package_id", length_of_survey_years = "length_of_survey_years", number_of_years_sampled = "number_of_years_sampled", std_dev_interval_betw_years = "std_dev_interval_betw_years", max_num_taxa = "max_num_taxa", geo_extent_bounding_box_m2 = "geo_extent_bounding_box_m2") dataset_summary
flat <- ants_L0_flat dataset_summary <- create_dataset_summary( L0_flat = flat, package_id = "package_id", original_package_id = "original_package_id", length_of_survey_years = "length_of_survey_years", number_of_years_sampled = "number_of_years_sampled", std_dev_interval_betw_years = "std_dev_interval_betw_years", max_num_taxa = "max_num_taxa", geo_extent_bounding_box_m2 = "geo_extent_bounding_box_m2") dataset_summary
Create EML metadata
create_eml( path, source_id, derived_id, script, script_description, is_about = NULL, contact, user_id, user_domain, basis_of_record = NULL, url = NULL )
create_eml( path, source_id, derived_id, script, script_description, is_about = NULL, contact, user_id, user_domain, basis_of_record = NULL, url = NULL )
path |
(character) Path to the directory containing ecocomDP tables, conversion script, and where EML metadata will be written. |
source_id |
(character) Identifier of a data package published in a supported repository. Currently, the EDI Data Repository is supported. |
derived_id |
(character) Identifier of the dataset being created. |
script |
(character) Name of file used to convert |
script_description |
(character) Description of |
is_about |
(named character) An optional argument for specifying dataset level annotations describing what this dataset "is about". |
contact |
(data.frame) Contact information for the person that created this ecocomDP dataset, containing these columns:
|
user_id |
(character) Identifier of user associated with |
user_domain |
(character) Domain (data repository) the |
basis_of_record |
(character) An optional argument to facilitate creation of a Darwin Core record from this dataset using |
url |
(character) URL to the publicly accessible directory containing ecocomDP tables, conversion script, and EML metadata. This argument supports direct download of the data entities by a data repository and is used for automated revisioning and publication. |
This function creates an EML record for an ecocomDP by combining metadata from source_id
with boiler-plate metadata describing the ecocomDP model. Changes to the source_id
EML include:
<access> Adds user_id
to the list of principals granted read and write access to the ecocomDP data package this EML describes.
<title> Adds a note that this is a derived data package in the ecocomDP format.
<pubDate> Adds the date this EML was created.
<abstract> Adds a note that this is a derived data package in the ecocomDP format.
<keywordSet Adds the "ecocomDP" keyword to enable search and discovery of all ecocomDP data packages in the data repository it is published, and 7 terms from the LTER Controlled vocabulary: "communities", "community composition", "community dynamics", "community patterns", "species composition", "species diversity", and "species richness". Darwin Core Terms listed under basis_of_record
are listed and used by convert_to_dwca()
to create a Darwin Core Archive of this ecocomDP data package.
<intellectualRights> Keeps intact the original intellectual rights license source_id
was released under, or uses CCO if missing.
<taxonomicCoverage> Appends to the taxonomic coverage element with data supplied in the ecocomDP taxon table.
<contact> Adds the ecocomDP creator as a point of contact.
<methodStep> Adds a note that this data package was created by the script
, and adds provenance metadata noting that this is a derived dataset and describes where the source_id
can be accessed.
<dataTables> Replaces the source_id
table metadata with descriptions of the the ecocomDP tables.
<otherEntity> Adds script
and script_description
. otherEntities of source_id
are removed.
<annotations> Adds boilerplate annotations describing the ecocomDP at the dataset, entity, and entity attribute levels.
Taxa listed in the taxon table, and resolved to one of the supported authority systems (i.e. ITIS, WORMS, or GBIF), will have their full taxonomic hierarchy expanded, including any common names for each level.
An EML metadata file.
## Not run: # Create directory with ecocomDP tables for create_eml() mypath <- paste0(tempdir(), "/data") dir.create(mypath) inpts <- c(ants_L1$tables, path = mypath) do.call(write_tables, inpts) file.copy(system.file("extdata", "create_ecocomDP.R", package = "ecocomDP"), mypath) dir(mypath) # Describe, with annotations, what the source L0 dataset "is about" dataset_annotations <- c( `species abundance` = "http://purl.dataone.org/odo/ECSO_00001688", Population = "http://purl.dataone.org/odo/ECSO_00000311", `level of ecological disturbance` = "http://purl.dataone.org/odo/ECSO_00002588", `type of ecological disturbance` = "http://purl.dataone.org/odo/ECSO_00002589") # Add self as contact information incase questions arise additional_contact <- data.frame( givenName = 'Colin', surName = 'Smith', organizationName = 'Environmental Data Initiative', electronicMailAddress = '[email protected]', stringsAsFactors = FALSE) # Create EML eml <- create_eml( path = mypath, source_id = "knb-lter-hfr.118.33", derived_id = "edi.193.5", is_about = dataset_annotations, script = "create_ecocomDP.R", script_description = "A function for converting knb-lter-hrf.118 to ecocomDP", contact = additional_contact, user_id = 'ecocomdp', user_domain = 'EDI', basis_of_record = "HumanObservation") dir(mypath) View(eml) # Clean up unlink(mypath, recursive = TRUE) ## End(Not run)
## Not run: # Create directory with ecocomDP tables for create_eml() mypath <- paste0(tempdir(), "/data") dir.create(mypath) inpts <- c(ants_L1$tables, path = mypath) do.call(write_tables, inpts) file.copy(system.file("extdata", "create_ecocomDP.R", package = "ecocomDP"), mypath) dir(mypath) # Describe, with annotations, what the source L0 dataset "is about" dataset_annotations <- c( `species abundance` = "http://purl.dataone.org/odo/ECSO_00001688", Population = "http://purl.dataone.org/odo/ECSO_00000311", `level of ecological disturbance` = "http://purl.dataone.org/odo/ECSO_00002588", `type of ecological disturbance` = "http://purl.dataone.org/odo/ECSO_00002589") # Add self as contact information incase questions arise additional_contact <- data.frame( givenName = 'Colin', surName = 'Smith', organizationName = 'Environmental Data Initiative', electronicMailAddress = '[email protected]', stringsAsFactors = FALSE) # Create EML eml <- create_eml( path = mypath, source_id = "knb-lter-hfr.118.33", derived_id = "edi.193.5", is_about = dataset_annotations, script = "create_ecocomDP.R", script_description = "A function for converting knb-lter-hrf.118 to ecocomDP", contact = additional_contact, user_id = 'ecocomdp', user_domain = 'EDI', basis_of_record = "HumanObservation") dir(mypath) View(eml) # Clean up unlink(mypath, recursive = TRUE) ## End(Not run)
Create the location table
create_location( L0_flat, location_id, location_name, latitude = NULL, longitude = NULL, elevation = NULL )
create_location( L0_flat, location_id, location_name, latitude = NULL, longitude = NULL, elevation = NULL )
L0_flat |
(tbl_df, tbl, data.frame) The fully joined source L0 dataset, in "flat" format (see details). |
location_id |
(character) Column in |
location_name |
(character) One or more columns in |
latitude |
(character) An optional column in |
longitude |
(character) An optional column in |
elevation |
(character) An optional column in |
This function collects specified columns from L0_flat
, creates data frames for each location_name
, assigns latitude
, longitude
, and elevation
to the lowest nesting level (i.e. the observation level) returning NA
for higher levels (these will have to be filled manually afterwards), and determines the relationships between location_id and parent_location_id from L0_flat
and location_name
.
To prevent the listing of duplicate location_name values, and to enable the return of location_name
columns by flatten_data()
, location_name values are suffixed with the column they came from according to: paste0(<column name>, "__", <column value>)
. Example: A column named "plot" with values "1", "2", "3", in L0_flat
would be listed in the resulting location table under the location_name column as "1", "2", "3" and therefore no way to discern these values correspond with "plot". Applying the above listed solution returns "plot__1", "plot__2", "plot__3" in the location table and returns the column "plot" with values c("1", "2", "3") by flatten_data()
.
"flat" format refers to the fully joined source L0 dataset in "wide" form with the exception of the core observation variables, which are in "long" form (i.e. using the variable_name, value, unit columns of the observation table). This "flat" format is the "widest" an L1 ecocomDP dataset can be consistently spread due to the frequent occurrence of L0 source datasets with > 1 core observation variable.
Additionally, latitude, longitude, and elevation of sites nested above the observation level will have to be manually added after the location table is returned.
(tbl_df, tbl, data.frame) The location table.
flat <- ants_L0_flat location <- create_location( L0_flat = flat, location_id = "location_id", location_name = c("block", "plot"), latitude = "latitude", longitude = "longitude", elevation = "elevation") location
flat <- ants_L0_flat location <- create_location( L0_flat = flat, location_id = "location_id", location_name = c("block", "plot"), latitude = "latitude", longitude = "longitude", elevation = "elevation") location
Create the location_ancillary table
create_location_ancillary( L0_flat, location_id, datetime = NULL, variable_name, unit = NULL )
create_location_ancillary( L0_flat, location_id, datetime = NULL, variable_name, unit = NULL )
L0_flat |
(tbl_df, tbl, data.frame) The fully joined source L0 dataset, in "flat" format (see details). |
location_id |
(character) Column in |
datetime |
(character) An optional column in |
variable_name |
(character) Columns in |
unit |
(character) An optional column in |
This function collects specified columns from L0_flat
, converts into long (attribute-value) form by gathering variable_name
. Regular expression matching joins unit
to any associated variable_name
and is listed in the resulting table's "unit" column.
"flat" format refers to the fully joined source L0 dataset in "wide" form with the exception of the core observation variables, which are in "long" form (i.e. using the variable_name, value, unit columns of the observation table). This "flat" format is the "widest" an L1 ecocomDP dataset can be consistently spread due to the frequent occurrence of L0 source datasets with > 1 core observation variable.
(tbl_df, tbl, data.frame) The location_ancillary table.
flat <- ants_L0_flat location_ancillary <- create_location_ancillary( L0_flat = flat, location_id = "location_id", variable_name = "treatment") location_ancillary
flat <- ants_L0_flat location_ancillary <- create_location_ancillary( L0_flat = flat, location_id = "location_id", variable_name = "treatment") location_ancillary
Create the observation table
create_observation( L0_flat, observation_id, event_id = NULL, package_id, location_id, datetime, taxon_id, variable_name, value, unit = NULL )
create_observation( L0_flat, observation_id, event_id = NULL, package_id, location_id, datetime, taxon_id, variable_name, value, unit = NULL )
L0_flat |
(tbl_df, tbl, data.frame) The fully joined source L0 dataset, in "flat" format (see details). |
observation_id |
(character) Column in |
event_id |
(character) An optional column in |
package_id |
(character) Column in |
location_id |
(character) Column in |
datetime |
(character) Column in |
taxon_id |
(character) Column in |
variable_name |
(character) Column in |
value |
(character) Column in |
unit |
(character) An optional column in |
This function collects specified columns from L0_flat
and returns distinct rows.
"flat" format refers to the fully joined source L0 dataset in "wide" form with the exception of the core observation variables, which are in "long" form (i.e. using the variable_name, value, unit columns of the observation table). This "flat" format is the "widest" an L1 ecocomDP dataset can be consistently spread due to the frequent occurrence of L0 source datasets with > 1 core observation variable.
(tbl_df, tbl, data.frame) The observation table.
flat <- ants_L0_flat observation <- create_observation( L0_flat = flat, observation_id = "observation_id", event_id = "event_id", package_id = "package_id", location_id = "location_id", datetime = "datetime", taxon_id = "taxon_id", variable_name = "variable_name", value = "value", unit = "unit") observation
flat <- ants_L0_flat observation <- create_observation( L0_flat = flat, observation_id = "observation_id", event_id = "event_id", package_id = "package_id", location_id = "location_id", datetime = "datetime", taxon_id = "taxon_id", variable_name = "variable_name", value = "value", unit = "unit") observation
Create the observation_ancillary table
create_observation_ancillary( L0_flat, observation_id, variable_name, unit = NULL )
create_observation_ancillary( L0_flat, observation_id, variable_name, unit = NULL )
L0_flat |
(tbl_df, tbl, data.frame) The fully joined source L0 dataset, in "flat" format (see details). |
observation_id |
(character) Column in |
variable_name |
(character) Columns in |
unit |
(character) An optional column in |
This function collects specified columns from L0_flat
, converts into long (attribute-value) form by gathering variable_name
. Regular expression matching joins unit
to any associated variable_name
and is listed in the resulting table's "unit" column.
"flat" format refers to the fully joined source L0 dataset in "wide" form with the exception of the core observation variables, which are in "long" form (i.e. using the variable_name, value, unit columns of the observation table). This "flat" format is the "widest" an L1 ecocomDP dataset can be consistently spread due to the frequent occurrence of L0 source datasets with > 1 core observation variable.
(tbl_df, tbl, data.frame) The observation_ancillary table.
flat <- ants_L0_flat observation_ancillary <- create_observation_ancillary( L0_flat = flat, observation_id = "observation_id", variable_name = c("trap.type", "trap.num", "moose.cage")) observation_ancillary
flat <- ants_L0_flat observation_ancillary <- create_observation_ancillary( L0_flat = flat, observation_id = "observation_id", variable_name = c("trap.type", "trap.num", "moose.cage")) observation_ancillary
Create the taxon table
create_taxon( L0_flat, taxon_id, taxon_rank = NULL, taxon_name, authority_system = NULL, authority_taxon_id = NULL )
create_taxon( L0_flat, taxon_id, taxon_rank = NULL, taxon_name, authority_system = NULL, authority_taxon_id = NULL )
L0_flat |
(tbl_df, tbl, data.frame) The fully joined source L0 dataset, in "flat" format (see details). |
taxon_id |
(character) Column in |
taxon_rank |
(character) An optional column in |
taxon_name |
(character) Column in |
authority_system |
(character) An optional column in |
authority_taxon_id |
(character) An optional column in |
This function collects specified columns from L0_flat
and returns distinct rows.
Taxa listed in the taxon table, and resolved to one of the supported authority systems (i.e. ITIS, WORMS, or GBIF), will have their full taxonomic hierarchy expanded, including any common names for each level.
"flat" format refers to the fully joined source L0 dataset in "wide" form with the exception of the core observation variables, which are in "long" form (i.e. using the variable_name, value, unit columns of the observation table). This "flat" format is the "widest" an L1 ecocomDP dataset can be consistently spread due to the frequent occurrence of L0 source datasets with > 1 core observation variable.
(tbl_df, tbl, data.frame) The taxon table.
flat <- ants_L0_flat taxon <- create_taxon( L0_flat = flat, taxon_id = "taxon_id", taxon_rank = "taxon_rank", taxon_name = "taxon_name", authority_system = "authority_system", authority_taxon_id = "authority_taxon_id") taxon
flat <- ants_L0_flat taxon <- create_taxon( L0_flat = flat, taxon_id = "taxon_id", taxon_rank = "taxon_rank", taxon_name = "taxon_name", authority_system = "authority_system", authority_taxon_id = "authority_taxon_id") taxon
Create the taxon_ancillary table
create_taxon_ancillary( L0_flat, taxon_id, datetime = NULL, variable_name, unit = NULL, author = NULL )
create_taxon_ancillary( L0_flat, taxon_id, datetime = NULL, variable_name, unit = NULL, author = NULL )
L0_flat |
(tbl_df, tbl, data.frame) The fully joined source L0 dataset, in "flat" format (see details). |
taxon_id |
(character) Column in |
datetime |
(character) An optional in |
variable_name |
(character) Columns in |
unit |
(character) An optional column in |
author |
(character) An optional column in |
This function collects specified columns from L0_flat
, converts into long (attribute-value) form by gathering variable_name
. Regular expression matching joins unit
to any associated variable_name
and is listed in the resulting table's "unit" column.
"flat" format refers to the fully joined source L0 dataset in "wide" form with the exception of the core observation variables, which are in "long" form (i.e. using the variable_name, value, unit columns of the observation table). This "flat" format is the "widest" an L1 ecocomDP dataset can be consistently spread due to the frequent occurrence of L0 source datasets with > 1 core observation variable.
(tbl_df, tbl, data.frame) The taxon_ancillary table.
flat <- ants_L0_flat taxon_ancillary <- create_taxon_ancillary( L0_flat = flat, taxon_id = "taxon_id", variable_name = c( "subfamily", "hl", "rel", "rll", "colony.size", "feeding.preference", "nest.substrate", "primary.habitat", "secondary.habitat", "seed.disperser", "slavemaker.sp", "behavior", "biogeographic.affinity", "source"), unit = c("unit_hl", "unit_rel", "unit_rll")) taxon_ancillary
flat <- ants_L0_flat taxon_ancillary <- create_taxon_ancillary( L0_flat = flat, taxon_id = "taxon_id", variable_name = c( "subfamily", "hl", "rel", "rll", "colony.size", "feeding.preference", "nest.substrate", "primary.habitat", "secondary.habitat", "seed.disperser", "slavemaker.sp", "behavior", "biogeographic.affinity", "source"), unit = c("unit_hl", "unit_rel", "unit_rll")) taxon_ancillary
Create the variable_mapping table
create_variable_mapping( observation, observation_ancillary = NULL, location_ancillary = NULL, taxon_ancillary = NULL )
create_variable_mapping( observation, observation_ancillary = NULL, location_ancillary = NULL, taxon_ancillary = NULL )
observation |
(tbl_df, tbl, data.frame) The observation table. |
observation_ancillary |
(tbl_df, tbl, data.frame) The optional observation_ancillary table. |
location_ancillary |
(tbl_df, tbl, data.frame) The optional location_ancillary table. |
taxon_ancillary |
(tbl_df, tbl, data.frame) The optional taxon_ancillary table. |
This function collects specified data tables, extracts unique variable_name values from each, converts into long (attribute-value) form with the table name and variable_name values to the resulting table's "table_name" and "variable_name" columns, respectively. The resulting table's "mapped_system", "mapped_id", and "mapped_label" are filled with NA
and are to be manually filled.
(tbl_df, tbl, data.frame) The variable_mapping table.
flat <- ants_L0_flat # Create inputs to variable_mapping() observation <- create_observation( L0_flat = flat, observation_id = "observation_id", event_id = "event_id", package_id = "package_id", location_id = "location_id", datetime = "datetime", taxon_id = "taxon_id", variable_name = "variable_name", value = "value", unit = "unit") observation_ancillary <- create_observation_ancillary( L0_flat = flat, observation_id = "observation_id", variable_name = c("trap.type", "trap.num", "moose.cage")) location_ancillary <- create_location_ancillary( L0_flat = flat, location_id = "location_id", variable_name = "treatment") taxon_ancillary <- create_taxon_ancillary( L0_flat = flat, taxon_id = "taxon_id", variable_name = c( "subfamily", "hl", "rel", "rll", "colony.size", "feeding.preference", "nest.substrate", "primary.habitat", "secondary.habitat", "seed.disperser", "slavemaker.sp", "behavior", "biogeographic.affinity", "source"), unit = c("unit_hl", "unit_rel", "unit_rll")) # Create variable_mapping table variable_mapping <- create_variable_mapping( observation = observation, observation_ancillary = observation_ancillary, location_ancillary = location_ancillary, taxon_ancillary = taxon_ancillary) variable_mapping
flat <- ants_L0_flat # Create inputs to variable_mapping() observation <- create_observation( L0_flat = flat, observation_id = "observation_id", event_id = "event_id", package_id = "package_id", location_id = "location_id", datetime = "datetime", taxon_id = "taxon_id", variable_name = "variable_name", value = "value", unit = "unit") observation_ancillary <- create_observation_ancillary( L0_flat = flat, observation_id = "observation_id", variable_name = c("trap.type", "trap.num", "moose.cage")) location_ancillary <- create_location_ancillary( L0_flat = flat, location_id = "location_id", variable_name = "treatment") taxon_ancillary <- create_taxon_ancillary( L0_flat = flat, taxon_id = "taxon_id", variable_name = c( "subfamily", "hl", "rel", "rll", "colony.size", "feeding.preference", "nest.substrate", "primary.habitat", "secondary.habitat", "seed.disperser", "slavemaker.sp", "behavior", "biogeographic.affinity", "source"), unit = c("unit_hl", "unit_rel", "unit_rll")) # Create variable_mapping table variable_mapping <- create_variable_mapping( observation = observation, observation_ancillary = observation_ancillary, location_ancillary = location_ancillary, taxon_ancillary = taxon_ancillary) variable_mapping
Flatten a dataset
flatten_data(data)
flatten_data(data)
data |
(list) The dataset object returned by |
The "flat" format refers to the fully joined source L0 dataset in "wide" form with the exception of the core observation variables, which are in "long" form (i.e. using the variable_name, value, unit columns of the observation table). This "flat" format is the "widest" an L1 ecocomDP dataset can be consistently spread due to the frequent occurrence of L0 source datasets with > 1 core observation variable.
(tbl_df, tbl, data.frame) A single flat table created by joining and spreading all tables
, except the observation table. See details for more information on this "flat" format.
Warnings/Errors from flatten_data()
can most often be fixed by addressing any validation issues reported by read_data()
(e.g. non-unique composite keys).
Ancillary identifiers are dropped from the returned object.
# Flatten a dataset object flat <- flatten_data(ants_L1) flat # Flatten a list of tables tables <- ants_L1$tables flat <- flatten_data(tables) flat
# Flatten a dataset object flat <- flatten_data(ants_L1) flat # Flatten a list of tables tables <- ants_L1$tables flat <- flatten_data(tables) flat
Plot dates and times samples were collected or observations were made
plot_sample_space_time( data, id = NA_character_, alpha = 1, color_var = "package_id", shape_var = "package_id", observation = NULL )
plot_sample_space_time( data, id = NA_character_, alpha = 1, color_var = "package_id", shape_var = "package_id", observation = NULL )
data |
(list or tbl_df, tbl, data.frame) The dataset object returned by |
id |
(character) Identifier of dataset to be used in plot subtitles. Is automatically assigned when |
alpha |
(numeric) Alpha-transparency scale of data points. Useful when many data points overlap. Allowed values are between 0 and 1, where 1 is 100% opaque. Default is 1. |
color_var |
(character) Name of column to use to assign colors to the points on the plot |
shape_var |
(character) Name of column to use to assign shapes to the points on the plot |
observation |
(tbl_df, tbl, data.frame) DEPRECATED: Use |
The data
parameter accepts a range of input types but ultimately requires the 9 columns of the observation table.
(gg, ggplot) A gg, ggplot object if assigned to a variable, otherwise a plot to your active graphics device
## Not run: # Read a dataset of interest dataset <- read_data("edi.193.5") # Plot the dataset plot_sample_space_time(dataset) # Flatten the dataset, manipulate, then plot dataset %>% flatten_data() %>% dplyr::filter(lubridate::as_date(datetime) > "2003-07-01") %>% dplyr::filter(as.numeric(location_id) > 4) %>% plot_sample_space_time() ## End(Not run) # Plot the example dataset plot_sample_space_time(ants_L1)
## Not run: # Read a dataset of interest dataset <- read_data("edi.193.5") # Plot the dataset plot_sample_space_time(dataset) # Flatten the dataset, manipulate, then plot dataset %>% flatten_data() %>% dplyr::filter(lubridate::as_date(datetime) > "2003-07-01") %>% dplyr::filter(as.numeric(location_id) > 4) %>% plot_sample_space_time() ## End(Not run) # Plot the example dataset plot_sample_space_time(ants_L1)
Plot sites on US map
plot_sites( data, id = NA_character_, alpha = 1, labels = TRUE, color_var = "package_id", shape_var = "package_id" )
plot_sites( data, id = NA_character_, alpha = 1, labels = TRUE, color_var = "package_id", shape_var = "package_id" )
data |
(list or tbl_df, tbl, data.frame) The dataset object returned by |
id |
(character) Identifier of dataset to be used in plot subtitles. Is automatically assigned when |
alpha |
(numeric) Alpha-transparency scale of data points. Useful when many data points overlap. Allowed values are between 0 and 1, where 1 is 100% opaque. Default is 1. |
labels |
(logical) Argument to show labels of each US state. Default is TRUE. |
color_var |
(character) Name of column to use to assign colors to the points on the plot |
shape_var |
(character) Name of column to use to assign shapes to the points on the plot |
The data
parameter accepts a range of input types but ultimately requires the 14 columns of the combined observation and location tables.
(gg, ggplot) A gg, ggplot object if assigned to a variable, otherwise a plot to your active graphics device
## Not run: library(dplyr) # Read a dataset of interest dataset <- read_data("edi.193.5") # Plot the dataset plot_sites(dataset) # Flatten dataset then plot dataset %>% flatten_data() %>% plot_sites() # Download a NEON dataset dataset2 <- read_data( id = "neon.ecocomdp.20120.001.001", site= c('COMO','LECO'), startdate = "2017-06", enddate = "2021-03", token = Sys.getenv("NEON_TOKEN"), # option to use a NEON token check.size = FALSE) # Combine the two datasets and plot. This requires the datasets be first # flattened and then stacked. flattened_data1 <- dataset %>% flatten_data() flattened_data2 <- dataset2 %>% flatten_data() stacked_data <- bind_rows(flattened_data1,flattened_data2) plot_sites(stacked_data) ## End(Not run) # Plot the example dataset plot_sites(ants_L1)
## Not run: library(dplyr) # Read a dataset of interest dataset <- read_data("edi.193.5") # Plot the dataset plot_sites(dataset) # Flatten dataset then plot dataset %>% flatten_data() %>% plot_sites() # Download a NEON dataset dataset2 <- read_data( id = "neon.ecocomdp.20120.001.001", site= c('COMO','LECO'), startdate = "2017-06", enddate = "2021-03", token = Sys.getenv("NEON_TOKEN"), # option to use a NEON token check.size = FALSE) # Combine the two datasets and plot. This requires the datasets be first # flattened and then stacked. flattened_data1 <- dataset %>% flatten_data() flattened_data2 <- dataset2 %>% flatten_data() stacked_data <- bind_rows(flattened_data1,flattened_data2) plot_sites(stacked_data) ## End(Not run) # Plot the example dataset plot_sites(ants_L1)
Plot taxon abundances averaged across observation records for each taxon. Abundances are reported using the units provided in the dataset. In some cases, these counts are not standardized to sampling effort.
plot_taxa_abund( data, id = NA_character_, min_relative_abundance = 0, trans = "identity", facet_var = NA_character_, color_var = NA_character_, facet_scales = "free", alpha = 1 )
plot_taxa_abund( data, id = NA_character_, min_relative_abundance = 0, trans = "identity", facet_var = NA_character_, color_var = NA_character_, facet_scales = "free", alpha = 1 )
data |
(list or tbl_df, tbl, data.frame) The dataset object returned by |
id |
(character) Identifier of dataset to be used in plot subtitles. Is automatically assigned when |
min_relative_abundance |
(numeric) Minimum relative abundance allowed for taxa included in the plot; a value between 0 and 1, inclusive. |
trans |
(character) Define the transform applied to the response variable; "identity" is default, "log1p" is x+1 transform. Built-in transformations include "asn", "atanh", "boxcox", "date", "exp", "hms", "identity", "log", "log10", "log1p", "log2", "logit", "modulus", "probability", "probit", "pseudo_log", "reciprocal", "reverse", "sqrt" and "time". |
facet_var |
(character) Name of column to use for faceting. Must be a column of the observation or taxon table. |
color_var |
(character) Name of column to use for plot colors. |
facet_scales |
(character) Should scales be free ("free", default value), fixed ("fixed"), or free in one dimension ("free_x", "free_y")? |
alpha |
(numeric) Alpha-transparency scale of data points. Useful when many data points overlap. Allowed values are between 0 and 1, where 1 is 100% opaque. Default is 1. |
The data
parameter accepts a range of input types but ultimately requires the 13 columns of the combined observation and taxon tables.
(gg, ggplot) A gg, ggplot object if assigned to a variable, otherwise a plot to your active graphics device
## Not run: # Read a dataset of interest dataset <- read_data("edi.193.5") # plot ecocomDP formatted dataset plot_taxa_abund(dataset) # plot flattened ecocomDP dataset, log(x+1) transform abundances plot_taxa_abund( data = flatten_data(dataset), trans = "log1p") # facet by location color by taxon_rank, log 10 transform plot_taxa_abund( data = dataset, facet_var = "location_id", color_var = "taxon_rank", trans = "log10") # facet by location, minimum rel. abund = 0.05, log 10 transform plot_taxa_abund( data = dataset, facet_var = "location_id", min_relative_abundance = 0.05, trans = "log1p") # color by location, log 10 transform plot_taxa_abund( data = dataset, color_var = "location_id", trans = "log10") # tidy syntax, flatten then filter data by date dataset %>% flatten_data() %>% dplyr::filter( lubridate::as_date(datetime) > "2003-07-01") %>% plot_taxa_abund( trans = "log1p", min_relative_abundance = 0.01) ## End(Not run) # Plot the example dataset plot_taxa_abund(ants_L1)
## Not run: # Read a dataset of interest dataset <- read_data("edi.193.5") # plot ecocomDP formatted dataset plot_taxa_abund(dataset) # plot flattened ecocomDP dataset, log(x+1) transform abundances plot_taxa_abund( data = flatten_data(dataset), trans = "log1p") # facet by location color by taxon_rank, log 10 transform plot_taxa_abund( data = dataset, facet_var = "location_id", color_var = "taxon_rank", trans = "log10") # facet by location, minimum rel. abund = 0.05, log 10 transform plot_taxa_abund( data = dataset, facet_var = "location_id", min_relative_abundance = 0.05, trans = "log1p") # color by location, log 10 transform plot_taxa_abund( data = dataset, color_var = "location_id", trans = "log10") # tidy syntax, flatten then filter data by date dataset %>% flatten_data() %>% dplyr::filter( lubridate::as_date(datetime) > "2003-07-01") %>% plot_taxa_abund( trans = "log1p", min_relative_abundance = 0.01) ## End(Not run) # Plot the example dataset plot_taxa_abund(ants_L1)
Plot taxa accumulation by site accumulation
plot_taxa_accum_sites(data, id = NA_character_, alpha = 1, observation = NULL)
plot_taxa_accum_sites(data, id = NA_character_, alpha = 1, observation = NULL)
data |
(list or tbl_df, tbl, data.frame) The dataset object returned by |
id |
(character) Identifier of dataset to be used in plot subtitles. Is automatically assigned when |
alpha |
(numeric) Alpha-transparency scale of data points. Useful when many data points overlap. Allowed values are between 0 and 1, where 1 is 100% opaque. Default is 1. |
observation |
(tbl_df, tbl, data.frame) DEPRECATED: Use |
The data
parameter accepts a range of input types but ultimately requires the 9 columns of the observation table.
(gg, ggplot) A gg, ggplot object if assigned to a variable, otherwise a plot to your active graphics device
## Not run: # Read a dataset of interest dataset <- read_data("edi.193.5") # Plot the dataset plot_taxa_accum_sites(dataset) # Flatten the dataset, manipulate, then plot dataset %>% flatten_data() %>% dplyr::filter(lubridate::as_date(datetime) > "2003-07-01") %>% plot_taxa_accum_sites() # Plot from the observation table directly plot_taxa_accum_sites(dataset$tables$observation) ## End(Not run) # Plot the example dataset plot_taxa_accum_sites(ants_L1)
## Not run: # Read a dataset of interest dataset <- read_data("edi.193.5") # Plot the dataset plot_taxa_accum_sites(dataset) # Flatten the dataset, manipulate, then plot dataset %>% flatten_data() %>% dplyr::filter(lubridate::as_date(datetime) > "2003-07-01") %>% plot_taxa_accum_sites() # Plot from the observation table directly plot_taxa_accum_sites(dataset$tables$observation) ## End(Not run) # Plot the example dataset plot_taxa_accum_sites(ants_L1)
Plot taxa accumulation through time
plot_taxa_accum_time(data, id = NA_character_, alpha = 1, observation = NULL)
plot_taxa_accum_time(data, id = NA_character_, alpha = 1, observation = NULL)
data |
(list or tbl_df, tbl, data.frame) The dataset object returned by |
id |
(character) Identifier of dataset to be used in plot subtitles. Is automatically assigned when |
alpha |
(numeric) Alpha-transparency scale of data points. Useful when many data points overlap. Allowed values are between 0 and 1, where 1 is 100% opaque. Default is 1. |
observation |
(tbl_df, tbl, data.frame) DEPRECATED: Use |
The data
parameter accepts a range of input types but ultimately requires the 9 columns of the observation table.
(gg, ggplot) A gg, ggplot object if assigned to a variable, otherwise a plot to your active graphics device
## Not run: # Read a dataset of interest dataset <- read_data("edi.193.5") # Plot the dataset plot_taxa_accum_time(dataset) # Flatten the dataset, manipulate, then plot dataset %>% flatten_data() %>% dplyr::filter(lubridate::as_date(datetime) > "2003-07-01") %>% plot_taxa_accum_time() # Plot from the observation table directly plot_taxa_accum_time(dataset$tables$observation) ## End(Not run) # Plot the example dataset plot_taxa_accum_time(ants_L1)
## Not run: # Read a dataset of interest dataset <- read_data("edi.193.5") # Plot the dataset plot_taxa_accum_time(dataset) # Flatten the dataset, manipulate, then plot dataset %>% flatten_data() %>% dplyr::filter(lubridate::as_date(datetime) > "2003-07-01") %>% plot_taxa_accum_time() # Plot from the observation table directly plot_taxa_accum_time(dataset$tables$observation) ## End(Not run) # Plot the example dataset plot_taxa_accum_time(ants_L1)
Plot diversity (taxa richness) through time
plot_taxa_diversity( data, id = NA_character_, time_window_size = "day", observation = NULL, alpha = 1 )
plot_taxa_diversity( data, id = NA_character_, time_window_size = "day", observation = NULL, alpha = 1 )
data |
(list or tbl_df, tbl, data.frame) The dataset object returned by |
id |
(character) Identifier of dataset to be used in plot subtitles. Is automatically assigned when |
time_window_size |
(character) Define the time window over which to aggregate observations for calculating richness. Can be: "day" or "year" |
observation |
(tbl_df, tbl, data.frame) DEPRECATED: Use |
alpha |
(numeric) Alpha-transparency scale of data points. Useful when many data points overlap. Allowed values are between 0 and 1, where 1 is 100% opaque. Default is 1. |
The data
parameter accepts a range of input types but ultimately requires the 9 columns of the observation table.
(gg, ggplot) A gg, ggplot object if assigned to a variable, otherwise a plot to your active graphics device
## Not run: # Read a dataset of interest dataset <- read_data("edi.193.5") # Plot the dataset plot_taxa_diversity(dataset) # Plot the dataset with observations aggregated by year plot_taxa_diversity(dataset, time_window_size = "year") # Flatten the dataset, manipulate, then plot dataset %>% flatten_data() %>% dplyr::filter( lubridate::as_date(datetime) > "2007-01-01") %>% plot_taxa_diversity() # Plot from the observation table directly plot_taxa_diversity(dataset$tables$observation) ## End(Not run) # Plot the example dataset plot_taxa_diversity(ants_L1)
## Not run: # Read a dataset of interest dataset <- read_data("edi.193.5") # Plot the dataset plot_taxa_diversity(dataset) # Plot the dataset with observations aggregated by year plot_taxa_diversity(dataset, time_window_size = "year") # Flatten the dataset, manipulate, then plot dataset %>% flatten_data() %>% dplyr::filter( lubridate::as_date(datetime) > "2007-01-01") %>% plot_taxa_diversity() # Plot from the observation table directly plot_taxa_diversity(dataset$tables$observation) ## End(Not run) # Plot the example dataset plot_taxa_diversity(ants_L1)
Plot taxon occurrence frequences as the number of 'event_id' by 'location_id' combinations in which a taxon is observed.
plot_taxa_occur_freq( data, id = NA_character_, min_occurrence = 0, facet_var = NA_character_, color_var = NA_character_, facet_scales = "free", alpha = 1 )
plot_taxa_occur_freq( data, id = NA_character_, min_occurrence = 0, facet_var = NA_character_, color_var = NA_character_, facet_scales = "free", alpha = 1 )
data |
(list or tbl_df, tbl, data.frame) The dataset object returned by |
id |
(character) Identifier of dataset to be used in plot subtitles. Is automatically assigned when |
min_occurrence |
(numeric) Minimum number of occurrences allowed for taxa included in the plot. |
facet_var |
(character) Name of column to use for faceting. Must be a column of the observation or taxon table. |
color_var |
(character) Name of column to use for plot colors. |
facet_scales |
(character) Should scales be free ("free", default value), fixed ("fixed"), or free in one dimension ("free_x", "free_y")? |
alpha |
(numeric) Alpha-transparency scale of data points. Useful when many data points overlap. Allowed values are between 0 and 1, where 1 is 100% opaque. Default is 1. |
The data
parameter accepts a range of input types but ultimately requires the 13 columns of the combined observation and taxon tables.
(gg, ggplot) A gg, ggplot object if assigned to a variable, otherwise a plot to your active graphics device.
## Not run: # Read a dataset of interest dataset <- read_data("edi.193.5") # Plot the dataset plot_taxa_occur_freq(dataset) # Facet by location and color by taxon_rank plot_taxa_occur_freq( data = dataset, facet_var = "location_id", color_var = "taxon_rank") # Color by location and only include taxa with >= 5 occurrences plot_taxa_occur_freq( data = dataset, color_var = "location_id", min_occurrence = 5) # Flatten, filter using a time cutoff, then plot dataset %>% flatten_data() %>% dplyr::filter(lubridate::as_date(datetime) > "2003-07-01") %>% plot_taxa_occur_freq() ## End(Not run) # Plot the example dataset plot_taxa_occur_freq(ants_L1)
## Not run: # Read a dataset of interest dataset <- read_data("edi.193.5") # Plot the dataset plot_taxa_occur_freq(dataset) # Facet by location and color by taxon_rank plot_taxa_occur_freq( data = dataset, facet_var = "location_id", color_var = "taxon_rank") # Color by location and only include taxa with >= 5 occurrences plot_taxa_occur_freq( data = dataset, color_var = "location_id", min_occurrence = 5) # Flatten, filter using a time cutoff, then plot dataset %>% flatten_data() %>% dplyr::filter(lubridate::as_date(datetime) > "2003-07-01") %>% plot_taxa_occur_freq() ## End(Not run) # Plot the example dataset plot_taxa_occur_freq(ants_L1)
Plot the number of observations that use each taxonomic rank in the dataset.
plot_taxa_rank( data, id = NA_character_, facet_var = NA_character_, facet_scales = "free_x", alpha = 1 )
plot_taxa_rank( data, id = NA_character_, facet_var = NA_character_, facet_scales = "free_x", alpha = 1 )
data |
(list or tbl_df, tbl, data.frame) The dataset object returned by |
id |
(character) Identifier of dataset to be used in plot subtitles. Is automatically assigned when |
facet_var |
(character) Name of column to use for faceting. Must be a column of the observation or taxon table. |
facet_scales |
(character) Should scales be free ("free", default value), fixed ("fixed"), or free in one dimension ("free_x", "free_y")? |
alpha |
(numeric) Alpha-transparency scale of data points. Useful when many data points overlap. Allowed values are between 0 and 1, where 1 is 100% opaque. Default is 1. |
The data
parameter accepts a range of input types but ultimately requires the 13 columns of the combined observation and taxon tables.
(gg, ggplot) A gg, ggplot object if assigned to a variable, otherwise a plot to your active graphics device
## Not run: # Read a dataset of interest dataset <- read_data( id = "neon.ecocomdp.20120.001.001", site= c('COMO','LECO'), startdate = "2017-06", enddate = "2019-09", check.size = FALSE) # Plot the dataset plot_taxa_rank(dataset) # Plot with facet by location plot_taxa_rank(dataset, facet_var = "location_id") # Flatten the dataset, manipulate, then plot dataset %>% flatten_data() %>% dplyr::filter(lubridate::as_date(datetime) > "2003-07-01") %>% dplyr::filter(grepl("COMO",location_id)) %>% plot_taxa_rank() ## End(Not run) # Plot the example dataset plot_taxa_rank(ants_L1)
## Not run: # Read a dataset of interest dataset <- read_data( id = "neon.ecocomdp.20120.001.001", site= c('COMO','LECO'), startdate = "2017-06", enddate = "2019-09", check.size = FALSE) # Plot the dataset plot_taxa_rank(dataset) # Plot with facet by location plot_taxa_rank(dataset, facet_var = "location_id") # Flatten the dataset, manipulate, then plot dataset %>% flatten_data() %>% dplyr::filter(lubridate::as_date(datetime) > "2003-07-01") %>% dplyr::filter(grepl("COMO",location_id)) %>% plot_taxa_rank() ## End(Not run) # Plot the example dataset plot_taxa_rank(ants_L1)
Read published data
read_data( id = NULL, parse_datetime = TRUE, unique_keys = FALSE, site = "all", startdate = NA, enddate = NA, package = "basic", check.size = FALSE, nCores = 1, forceParallel = FALSE, token = NA, neon.data.save.dir = NULL, neon.data.read.path = NULL, ..., from = NULL, format = "new" )
read_data( id = NULL, parse_datetime = TRUE, unique_keys = FALSE, site = "all", startdate = NA, enddate = NA, package = "basic", check.size = FALSE, nCores = 1, forceParallel = FALSE, token = NA, neon.data.save.dir = NULL, neon.data.read.path = NULL, ..., from = NULL, format = "new" )
id |
(character) Identifier of dataset to read. Identifiers are listed in the "id" column of the |
parse_datetime |
(logical) Parse datetime values if TRUE, otherwise return as character strings. |
unique_keys |
(logical) Whether to create globally unique primary keys (and associated foreign keys). Useful in maintaining referential integrity when working with multiple datasets. If TRUE, |
site |
(character) For NEON data, a character vector of site codes to filter data on. Sites are listed in the "sites" column of the |
startdate |
(character) For NEON data, the start date to filter on in the form YYYY-MM. Defaults to NA, meaning all available dates. |
enddate |
(character) For NEON data, the end date to filter on in the form YYYY-MM. Defaults to NA, meaning all available dates. |
package |
(character) For NEON data, either 'basic' or 'expanded', indicating which data package to download. Defaults to basic. |
check.size |
(logical) For NEON data, should the user approve the total file size before downloading? Defaults to FALSE. |
nCores |
(integer) For NEON data, the number of cores to parallelize the stacking procedure. Defaults to 1. |
forceParallel |
(logical) For NEON data, if the data volume to be processed does not meet minimum requirements to run in parallel, this overrides. Defaults to FALSE. |
token |
(character) For NEON data, a user specific API token (generated within neon.datascience user accounts). |
neon.data.save.dir |
(character) For NEON data, an optional and experimental argument (i.e. may not be supported in future releases), indicating the directory where NEON source data should be saved upon download from the NEON API. Data are downloaded using |
neon.data.read.path |
(character) For NEON data, an optional and experimental argument (i.e. may not be supported in future releases), defining a path to read in an .rds file of 'stacked NEON data' from |
... |
For NEON data, other arguments to |
from |
(character) Full path of file to be read (if .rds), or path to directory containing saved datasets (if .csv). |
format |
(character) Format of returned object, which can be: "new" (the new implementation) or "old" (the original implementation; deprecated). In the new format, the top most level of nesting containing the "id" field has been moved to the same level as the "tables", "metadata", and "validation_issues" fields. |
Validation checks are applied to each dataset ensuring it complies with the ecocomDP model. A warning is issued when any validation checks fail. All datasets are returned, even if they fail validation.
Column classes are coerced to those defined in the ecocomDP specification.
Validation happens each time files are read, from source APIs or local environments.
Details for read_data()
function regarding NEON data: Using this function to read data with an id
that begins with "neon.ecocomdp" will result in a query to download NEON data from the NEON Data Portal API using neonUtilities::loadByProduct()
. If a query includes provisional data (or if you are not sure if the query includes provisional data), we recommend saving a copy of the data in the original format provided by NEON in addition to the derived ecocomDP data package. To do this, provide a directory path using the neon.data.read.path
argument. For example, the query my_ecocomdp_data <- read_data(id = "neon.ecocomdp.10022.001.001", neon.data.save.dir = "my_neon_data")
will download the data for NEON Data Product ID DP1.10022.001 (ground beetles in pitfall traps) and convert it to the ecocomDP data model. In doing so, a copy of the original NEON download will be saved in the directory "my_ neon_data with the filename
"DP1.10022.001_<timestamp>.RDS" and the derived data package in the ecocomDP format will be stored in your R environment in an object named "my_ecocomdp_data". Further, if you wish to reload a previously downloaded NEON dataset into the ecocomDP format, you can do so using my_ecocomdp_data <- read_data(id = "neon.ecocomdp.10022.001.001", neon.data.read.path =
"my_neon_data/DP1.10022.001_<timestamp>.RDS")
Provisional NEON data. Despite NEON's controlled data entry, at times, errors are found in published data; for example, an analytical lab may adjust its calibration curve and re-calculate past analyses, or field scientists may discover a past misidentification. In these cases, Level 0 data are edited and the data are re-processed to Level 1 and re-published. Published data files include a time stamp in the file name; a new time stamp indicates data have been re-published and may contain differences from previously published data. Data are subject to re-processing at any time during an initial provisional period; data releases are never re-processed. All records downloaded from the NEON API will have a "release" field. For any provisional record, the value of this field will be "PROVISIONAL", otherwise, this field will have a value indicating the version of the release to which the record belongs. More details can be found at https://www.neonscience.org/data-samples/data-management/data-revisions-releases.
(list) A dataset with the structure:
id - Dataset identifier
metadata - List of info about the dataset. NOTE: This object is underdevelopment and content may change in future releases.
tables - List of dataset tables as data.frames.
validation_issues - List of validation issues. If the dataset fails any validation checks, then descriptions of each issue are listed here.
This function may not work between 01:00 - 03:00 UTC on Wednesdays due to regular maintenance of the EDI Data Repository.
## Not run: # Read from EDI dataset <- read_data("edi.193.5") str(dataset, max.level = 2) # Read from NEON (full dataset) dataset <- read_data("neon.ecocomdp.20120.001.001") # Read from NEON with filters (partial dataset) dataset <- read_data( id = "neon.ecocomdp.20120.001.001", site = c("COMO", "LECO", "SUGG"), startdate = "2017-06", enddate = "2019-09", check.size = FALSE) # Read with datetimes as character dataset <- read_data("edi.193.5", parse_datetime = FALSE) is.character(dataset$tables$observation$datetime) # Read from saved .rds save_data(dataset, tempdir()) dataset <- read_data(from = paste0(tempdir(), "/dataset.rds")) # Read from saved .csv save_data(dataset, tempdir(), type = ".csv")# Save as .csv dataset <- read_data(from = tempdir()) ## End(Not run)
## Not run: # Read from EDI dataset <- read_data("edi.193.5") str(dataset, max.level = 2) # Read from NEON (full dataset) dataset <- read_data("neon.ecocomdp.20120.001.001") # Read from NEON with filters (partial dataset) dataset <- read_data( id = "neon.ecocomdp.20120.001.001", site = c("COMO", "LECO", "SUGG"), startdate = "2017-06", enddate = "2019-09", check.size = FALSE) # Read with datetimes as character dataset <- read_data("edi.193.5", parse_datetime = FALSE) is.character(dataset$tables$observation$datetime) # Read from saved .rds save_data(dataset, tempdir()) dataset <- read_data(from = paste0(tempdir(), "/dataset.rds")) # Read from saved .csv save_data(dataset, tempdir(), type = ".csv")# Save as .csv dataset <- read_data(from = tempdir()) ## End(Not run)
Save a dataset
save_data(dataset, path, type = ".rds", name = NULL)
save_data(dataset, path, type = ".rds", name = NULL)
dataset |
(list) One or more datasets of the structure returned by |
path |
(character) Path to the directory in which |
type |
(character) Type of file to save the |
name |
(character) An optional argument for setting the saved file name (for .rds) if you'd like it to be different than |
.rds |
If |
.csv |
If |
Subsequent calls won't overwrite files or directories
# Create directory for the data mypath <- paste0(tempdir(), "/data") dir.create(mypath) # Save as .rds save_data(ants_L1, mypath) dir(mypath) # Save as .rds with the name "mydata" save_data(ants_L1, mypath, name = "mydata") dir(mypath) # Save as .csv save_data(ants_L1, mypath, type = ".csv") dir(mypath) ## Not run: # Save multiple datasets ids <- c("edi.193.5", "edi.303.2", "edi.290.2") datasets <- lapply(ids, read_data) save_data(datasets, mypath) dir(mypath) ## End(Not run) # Clean up unlink(mypath, recursive = TRUE)
# Create directory for the data mypath <- paste0(tempdir(), "/data") dir.create(mypath) # Save as .rds save_data(ants_L1, mypath) dir(mypath) # Save as .rds with the name "mydata" save_data(ants_L1, mypath, name = "mydata") dir(mypath) # Save as .csv save_data(ants_L1, mypath, type = ".csv") dir(mypath) ## Not run: # Save multiple datasets ids <- c("edi.193.5", "edi.303.2", "edi.290.2") datasets <- lapply(ids, read_data) save_data(datasets, mypath) dir(mypath) ## End(Not run) # Clean up unlink(mypath, recursive = TRUE)
Search published data
search_data(text, taxa, num_taxa, num_years, sd_years, area, boolean = "AND")
search_data(text, taxa, num_taxa, num_years, sd_years, area, boolean = "AND")
text |
(character) Text to search for in dataset titles, descriptions, and abstracts. Datasets matching any exact words or phrase will be returned. Can be a regular expression as used by |
taxa |
(character) Taxonomic names to search on. To effectively search the taxonomic tree, it is advisable to start with specific taxonomic names and then gradually broaden the search to higher rank levels when needed. For instance, if searching for "Astragalus gracilis" (species) doesn't produce any results, try expanding the search to "Astragalus" (Genus), "Fabaceae" (Family), and so on. This approach accounts for variations in organism identification, ensuring a more comprehensive exploration of the taxonomic hierarchy. |
num_taxa |
(numeric) Minimum and maximum number of taxa the dataset should contain. Any datasets within this range will be returned. |
num_years |
(numeric) Minimum and maximum number of years sampled the dataset should contain. Any datasets within this range will be returned. |
sd_years |
(numeric) Minimum and maximum standard deviation between survey dates (in years). Any datasets within this range will be returned. |
area |
(numeric) Bounding coordinates within which the data should originate. Accepted values are in decimal degrees and in the order: North, East, South, West. Any datasets with overlapping areas or contained points will be returned. |
boolean |
(character) Boolean operator to use when searching |
Currently, to accommodate multiple L1 versions of NEON data products, search results for a NEON L0 will also list all the L1 versions available for the match. This method is based on the assumption that the summary data among L1 versions is the same, which may need to be addressed in the future. A list of L0 and corresponding L1 identifiers are listed in /inst/L1_versions.txt. Each L1 version is accompanied by qualifying text that's appended to the title, abstract, and descriptions for comprehension of the differences among L1 versions.
(tbl_df, tbl, data.frame) Search results with these feilds:
source - Source from which the dataset originates. Currently supported are "EDI" and "NEON".
id - Identifier of the dataset.
title - Title of the dataset.
description - Description of dataset. Only returned for NEON datasets.
abstract - Abstract of dataset.
years - Number of years sampled.
sampling_interval - Standard deviation between sampling events in years.
sites - Sites names or abbreviations. Only returned for NEON datasets.
url - URL to dataset.
source_id - Identifier of source L0 dataset.
source_id_url - URL to source L0 dataset.
This function may not work between 01:00 - 03:00 UTC on Wednesdays due to regular maintenance of the EDI Data Repository.
## Not run: # Empty search returns all available datasets search_data() # "text" searches titles, descriptions, and abstracts search_data(text = "Lake") # "taxa" searches taxonomic ranks for a match search_data(taxa = "Plantae") # "num_years" searches the number of years sampled search_data(num_years = c(10, 20)) # Use any combination of search fields to find the data you're looking for search_data( text = c("Lake", "River"), taxa = c("Plantae", "Animalia"), num_taxa = c(0, 10), num_years = c(10, 100), sd_years = c(.01, 100), area = c(47.1, -86.7, 42.5, -92), boolean = "OR") ## End(Not run)
## Not run: # Empty search returns all available datasets search_data() # "text" searches titles, descriptions, and abstracts search_data(text = "Lake") # "taxa" searches taxonomic ranks for a match search_data(taxa = "Plantae") # "num_years" searches the number of years sampled search_data(num_years = c(10, 20)) # Use any combination of search fields to find the data you're looking for search_data( text = c("Lake", "River"), taxa = c("Plantae", "Animalia"), num_taxa = c(0, 10), num_years = c(10, 100), sd_years = c(.01, 100), area = c(47.1, -86.7, 42.5, -92), boolean = "OR") ## End(Not run)
Validate tables against the model
validate_data(dataset = NULL, path = NULL)
validate_data(dataset = NULL, path = NULL)
dataset |
(list) A dataset of the structure returned by |
path |
(character) Path to a directory containing ecocomDP tables as files. |
Validation checks:
File names - File names are the ecocomDP table names.
Table presence - Required tables are present.
Column names - Column names of all tables match the model.
Column presence - Required columns are present.
Column classes - Column classes match the model specification.
Datetime format - Date and time formats follow the model specification.
Primary keys - Primary keys of tables are unique.
Composite keys - Composite keys (unique constraints) of each table are unique.
Referential integrity - Foreign keys have a corresponding primary key.
Coordinate format - Values are in decimal degree format.
Coordinate range - Values are within -90 to 90 and -180 to 180.
Elevation - Values are less than Mount Everest (8848 m) and greater than Mariana Trench (-10984 m).
Variable mapping - variable_name is in table_name.
Mapped_id - values in mapped_id are valid URIs
(list) If any checks fail, then a list of validation issues are returned along with a warning. If no issues are found then NULL is returned.
This function is used by ecocomDP creators (to ensure what has been created is valid), maintainers (to improve the quality of archived ecocomDP datasets), and users (to ensure the data being used is free of error).
## Not run: # Write a set of ecocomDP tables to file for validation mydir <- paste0(tempdir(), "/dataset") dir.create(mydir) write_tables( path = mydir, observation = ants_L1$tables$observation, observation_ancillary = ants_L1$tables$observation_ancillary, location = ants_L1$tables$location, location_ancillary = ants_L1$tables$location_ancillary, taxon = ants_L1$tables$taxon, taxon_ancillary = ants_L1$tables$taxon_ancillary, dataset_summary = ants_L1$tables$dataset_summary, variable_mapping = ants_L1$tables$variable_mapping) # Validate validate_data(path = mydir) # Clean up unlink(mydir, recursive = TRUE) ## End(Not run)
## Not run: # Write a set of ecocomDP tables to file for validation mydir <- paste0(tempdir(), "/dataset") dir.create(mydir) write_tables( path = mydir, observation = ants_L1$tables$observation, observation_ancillary = ants_L1$tables$observation_ancillary, location = ants_L1$tables$location, location_ancillary = ants_L1$tables$location_ancillary, taxon = ants_L1$tables$taxon, taxon_ancillary = ants_L1$tables$taxon_ancillary, dataset_summary = ants_L1$tables$dataset_summary, variable_mapping = ants_L1$tables$variable_mapping) # Validate validate_data(path = mydir) # Clean up unlink(mydir, recursive = TRUE) ## End(Not run)
Write tables to file
write_tables( path, sep = ",", observation = NULL, location = NULL, taxon = NULL, dataset_summary = NULL, observation_ancillary = NULL, location_ancillary = NULL, taxon_ancillary = NULL, variable_mapping = NULL )
write_tables( path, sep = ",", observation = NULL, location = NULL, taxon = NULL, dataset_summary = NULL, observation_ancillary = NULL, location_ancillary = NULL, taxon_ancillary = NULL, variable_mapping = NULL )
path |
(character) A path to the directory in which the files will be written. |
sep |
(character) Field delimiter to use when writing files. Default is comma. |
observation |
(tbl_df, tbl, data.frame) The observation table. |
location |
(tbl_df, tbl, data.frame) The location table. |
taxon |
(tbl_df, tbl, data.frame) The taxon table. |
dataset_summary |
(tbl_df, tbl, data.frame) The dataset_summary table. |
observation_ancillary |
(tbl_df, tbl, data.frame) The observation_ancillary table. |
location_ancillary |
(tbl_df, tbl, data.frame) The location_ancillary table. |
taxon_ancillary |
(tbl_df, tbl, data.frame) The taxon_ancillary table. |
variable_mapping |
(tbl_df, tbl, data.frame) The variable_mapping table. |
ecocomDP tables as sep
delimited files
# Create directory for the tables mypath <- paste0(tempdir(), "/data") dir.create(mypath) # Create a couple inputs to write_tables() flat <- ants_L0_flat observation <- create_observation( L0_flat = flat, observation_id = "observation_id", event_id = "event_id", package_id = "package_id", location_id = "location_id", datetime = "datetime", taxon_id = "taxon_id", variable_name = "variable_name", value = "value", unit = "unit") observation_ancillary <- create_observation_ancillary( L0_flat = flat, observation_id = "observation_id", variable_name = c("trap.type", "trap.num", "moose.cage")) # Write tables to file write_tables( path = mypath, observation = observation, observation_ancillary = observation_ancillary) dir(mypath) # Clean up unlink(mypath, recursive = TRUE)
# Create directory for the tables mypath <- paste0(tempdir(), "/data") dir.create(mypath) # Create a couple inputs to write_tables() flat <- ants_L0_flat observation <- create_observation( L0_flat = flat, observation_id = "observation_id", event_id = "event_id", package_id = "package_id", location_id = "location_id", datetime = "datetime", taxon_id = "taxon_id", variable_name = "variable_name", value = "value", unit = "unit") observation_ancillary <- create_observation_ancillary( L0_flat = flat, observation_id = "observation_id", variable_name = c("trap.type", "trap.num", "moose.cage")) # Write tables to file write_tables( path = mypath, observation = observation, observation_ancillary = observation_ancillary) dir(mypath) # Clean up unlink(mypath, recursive = TRUE)