diff --git a/man/bold.analyze.align.Rd b/man/bold.analyze.align.Rd index 02263c6..c466f5a 100644 --- a/man/bold.analyze.align.Rd +++ b/man/bold.analyze.align.Rd @@ -32,7 +32,8 @@ bold.analyze.align( Function designed to transform and align the sequence data retrieved from the function \code{bold.fetch}. } \details{ -\code{bold.analyze.align} retrieves the sequence information obtained using \code{bold.fetch} function and performs a multiple sequence alignment. It utilizes the \code{msa::msa()} function with default settings. Type of clustering method can be specified using the \code{align_method} argument (\code{Muscle}, \code{ClustalW} and \code{ClustalOmega} are available using the \code{msa} package). Additional arguments from the \code{msa} function can be passed via the \code{...} argument (arguments like \code{gapOpening}, \code{gapExtension}, \code{maxiters}, \code{substitutionMatrix},\code{type}). Marker name provided must match with the standard marker names (Ex. COI-5P) available on the \href{https://boldsystems.org/}{BOLD webpage} (Ratnasingham et al. 2024; pg.404). Name for individual sequences in the output can be customized by using the \code{cols_for_seq_names} argument. If more than one field is specified, the name will follow the sequence of the fields given in the vector. Performing a multiple sequence alignment on large sequence data might slow the system. Additionally, users are responsible for verifying the sequence quality and integrity, as the function does not provide any checks on issues like STOP codons and indels within the data by default. The output of this function is a modified Barcode Core Data Model (BCDM) dataframe, which includes two additional columns: one for the aligned sequences and another for the names given to the sequences. +\code{bold.analyze.align} takes the sequence information obtained using \code{\link[=bold.fetch]{bold.fetch()}} function and performs a multiple sequence alignment. It uses the \code{msa::msa()} function with default settings but additional arguments from the \code{msa} function can be passed through the \code{...} argument. The clustering method can be specified using the \code{align_method} argument, with options including \code{Muscle},\code{ClustalW} and \code{ClustalOmega} (available via the \code{msa} package). The provided marker name must match the standard marker names (Ex. COI-5P) available on the BOLD webpage (Ratnasingham et al. 2024; pg.404). The name for individual sequences in the output can be customized by using the \code{cols_for_seq_names} argument. If multiple fields are specified, the sequence name will follow the order of fields given in the vector. Performing a multiple sequence alignment on large sequence data might slow the system. Additionally, users are responsible for verifying the sequence quality and integrity, as the function does not automatically checks for issues like STOP codons and indels within the data. +The output of this function is a modified Barcode Core Data Model (BCDM) dataframe, which includes two additional columns: one for the aligned sequences and one for the customized sequence names. \emph{Note: }. Users are required to install and load the \code{Biostrings} and \code{msa} packages using \code{BiocManager} before running this function. } @@ -41,9 +42,10 @@ Function designed to transform and align the sequence data retrieved from the fu # Search for ids seq.data.ids <- bold.public.search(taxonomy = c("Oreochromis tanganicae", "Oreochromis karongae")) -# Fetch the data using the ids -#1. api_key must be obtained from BOLD support before usage -#2. The function `bold.apikey` should be used to set the apikey +# Fetch the data using the ids. +#1. api_key must be obtained from BOLD support before using `bold.fetch` function. +#2. Use the `bold.apikey` function to set the apikey in the global env. + bold.apikey('apikey') seq.data<-bold.fetch(get_by = "processid", diff --git a/man/bold.analyze.diversity.Rd b/man/bold.analyze.diversity.Rd index 495aee8..69933c6 100644 --- a/man/bold.analyze.diversity.Rd +++ b/man/bold.analyze.diversity.Rd @@ -31,7 +31,7 @@ bold.analyze.diversity( \item{presence_absence}{A logical value specifying whether the generated matrix should be converted into a ’presence-absence’ matrix.} -\item{diversity_profile}{A character value specifying the type of diversity profile ("richness","preston","shannon","beta")} +\item{diversity_profile}{A character value specifying the type of diversity profile ("richness","preston","shannon","beta","all").} \item{beta_index}{A character vector specifying the type of beta diversity index (’jaccard’ or ’sorensen’ available).} } @@ -59,29 +59,21 @@ An 'output' list containing results based on the profile selected: This function creates a biodiversity profile of the downloaded data using \code{\link[=bold.fetch]{bold.fetch()}}. } \details{ -\code{bold.analyze.diversity} estimates the richness & calculates the Shannon and beta diversity from the BIN counts or presence-absence data. Internally, the function converts the downloaded BCDM data into a community matrix (site X species) which is also generated as a part of the output. \code{taxon_rank} has to be provided by default while \code{taxon_name} if given, will create the matrix for that specific taxon/taxa. \code{site_type} = 'locations' can be used when a profile pertaining to particular geographic category is needed. This category can be specified using the \code{location_type} argument. \code{site_type}= grids creates a grid based on BIN occurrence data (latitude, longitude) with grid size determined by the user in square meters using the \code{gridsize} argument. The the Coordinate Reference System (CRS) of this data is converted to a ‘Mollweide’ projection by which distance-based grid can be correctly specified (Gott III et al. 2007). Each grid is also assigned a cell id, with the lowest number given to the lowest latitudinal point in the dataset. The \code{presence_absence} argument converts the counts (or abundances) to 1s and 0s. -The community matrix is then used to create one of the following \code{diversity_profile} using functions from \code{BAT} and \code{vegan} packages: -\itemize{ -\item \code{richness}(\code{BAT::alpha.accum()}) -\item \code{preston}(plots and results)(\code{vegan::prestondistr()}) -\item \code{shannon}(\code{vegan::diversity()}) -\item \code{beta}(\code{BAT::beta()}) -\item \code{all} generates results for all of the above. -} - -\code{BAT::alpha.accum()} currently offers various richness estimators, including Observed diversity (Obs); Singletons (S1); Doubletons (S2); Uniques (Q1); Duplicates (Q2); Jackknife1 abundance (Jack1ab); Jackknife1 incidence (Jack1in); Jackknife2 abundance (Jack2ab); Jackknife2 incidence (Jack2in); Chao1 and Chao2. The results depend on the input data (true abundances vs counts vs incidences) and users should be careful in the subsequent interpretation. -Preston plots feature cyan bars for observed species (or equivalent taxonomic group) and orange dots for expected counts. -\code{BAT::beta()} partitions the data using the Podani & Schmera (2011)/Carvalho et al. (2012) approach partitioning the beta diversity into ’species replacement’ and ’richness difference’ components. These results are stored as distance matrices in the output. The type of beta index can be specified using the \code{beta_index} argument. -\emph{Note on the community matrix}: Each cell in this matrix contains the counts (or abundances) of the specimens whose sequences have an assigned BIN, in a given a \code{site_type} (\code{locations} or \code{grids}). These counts can be generated at any taxonomic hierarchical level, applicable to one or multiple taxa including ’bin_uri’. \code{location_type} can refer to any geographic field, and metadata on these fields can be checked using the \code{bold.fields.info()}. Rows lacking latitude and longitude data are removed when \code{site_type} = 'grids', while NULL entries for \code{site_type} = 'locations' are allowed if they have a latitude and longitude value. This is because grids are drawn based on the bounding boxes which only use latitude and longitude values. -\emph{Important Note}: Results, including species counts, adapt based on taxon_rank argument although the output label remains ‘species’ in case of \code{preston} results. +\code{bold.analyze.diversity} estimates the richness, Shannon diversity and beta diversity from the BIN counts or presence-absence data. Internally, the function converts the downloaded BCDM data into a community matrix (site X species) which is also generated as a part of the output. grids.cat converts the Coordinate Reference System (CRS) of the data to a ‘Mollweide’ projection by which distance-based grid can be correctly specified (Gott III et al. 2007).Each grid is assigned a cell id, with the lowest number given to the lowest latitudinal point in the dataset. The community matrix generated by the function is used to create richness profiles using \code{BAT::alpha.accum()} and Preston and Shannon diversity analyses using \code{vegan::prestondistr()} and \code{vegan::diversity()} respectively. The \code{BAT::alpha.accum()} currently offers various richness estimators, including Observed diversity (Obs); Singletons (S1); Doubletons (S2); Uniques (Q1); Duplicates (Q2); Jackknife1 abundance (Jack1ab); Jackknife1 incidence (Jack1in); Jackknife2 abundance (Jack2ab); Jackknife2 incidence (Jack2in); Chao1 and Chao2. The results depend on the input data (true abundances vs counts vs incidences) and users should be careful in the subsequent interpretation. +Preston plots are generated using the data from the \code{prestondistr} results in \code{ggplot2} featuring cyan bars for observed species (or equivalent taxonomic group) and orange dots for expected counts. The \code{presence_absence} argument converts the counts (or abundances) to 1s and 0s. This dataset can then be directly used as input data for biodiversity analysis functions from packages like vegan. +Beta diversity values are calculated using \code{BAT::beta()} function, which partitions the data using the Podani & Schmera (2011)/Carvalho et al. (2012) approach partitioning the beta diversity into ’species replacement’ and ’richness difference’ components. These results are stored as distance matrices in the output. +\emph{Note on the community matrix}: Each cell in this matrix contains the counts (or abundances) of the specimens whose sequences have an assigned BIN, in a given a site category (\code{site.cat}) or a grid (\code{grids.cat}). These counts can be generated at any taxonomic hierarchical level, applicable to one or multiple taxa including ’bin_uri’. The \code{site.cat} can refer to any geographic field, and metadata on these fields can be checked using the \code{bold.fields.info()}. If, \code{grids.cat} = TRUE, grids are generated based on BIN occurrence data (latitude, longitude) with grid size determined by the user in square meters using the \code{gridsize} argument. Rows lacking latitude and longitude data are removed, while NULL entries for site.cat are allowed if they have a latitude and longitude value. This is because grids are drawn based on the bounding boxes which only use latitude and longitude values +\emph{Important Note}: Results, including species counts, adapt based on taxon_rank argument although the output label remains ‘species’ in some instances (\code{preston.res}). } \examples{ \dontrun{ # Search for ids comm.mat.data <- bold.public.search(taxonomy = "Poecilia") -#1. api_key must be obtained from BOLD support before usage -#2. The function `bold.apikey` should be used to set the apikey +# Fetch the data using the ids. +#1. api_key must be obtained from BOLD support before using `bold.fetch` function. +#2. Use the `bold.apikey` function to set the apikey in the global env. + bold.apikey('apikey') BCDMdata <- bold.fetch(get_by = "processid", diff --git a/man/bold.analyze.map.Rd b/man/bold.analyze.map.Rd index 2dc291f..36a6c34 100644 --- a/man/bold.analyze.map.Rd +++ b/man/bold.analyze.map.Rd @@ -31,9 +31,10 @@ This function creates basic maps of BIN occurrence data at different scales. #Download the ids geo_data.ids <- bold.public.search(taxonomy = "Musca domestica") -# Fetch the data using the ids -#1. api_key must be obtained from BOLD support before usage -#2. The function `bold.apikey` should be used to set the apikey +# Fetch the data using the ids. +#1. api_key must be obtained from BOLD support before using `bold.fetch` function. +#2. Use the `bold.apikey` function to set the apikey in the global env. + bold.apikey('apikey') geo_data <- bold.fetch(get_by = "processid", diff --git a/man/bold.analyze.tree.Rd b/man/bold.analyze.tree.Rd index 87182b9..66356b0 100644 --- a/man/bold.analyze.tree.Rd +++ b/man/bold.analyze.tree.Rd @@ -54,8 +54,10 @@ Calculates genetic distances and performs a Neighbor Joining tree estimation of #Download the data ids seq.data.ids <- bold.public.search(taxonomy = c("Eulimnadia")) -#1. api_key must be obtained from BOLD support before usage -#2. The function `bold.apikey` should be used to set the apikey +# Fetch the data using the ids. +#1. api_key must be obtained from BOLD support before using `bold.fetch` function. +#2. Use the `bold.apikey` function to set the apikey in the global env. + bold.apikey('apikey') seq.data <- bold.fetch(get_by = "processid", diff --git a/man/bold.apikey.Rd b/man/bold.apikey.Rd index d6e14a7..e8013ae 100644 --- a/man/bold.apikey.Rd +++ b/man/bold.apikey.Rd @@ -13,8 +13,19 @@ bold.apikey(apikey) Token saved as 'apikey' } \description{ -Generates an 'apikey' variable in the R session +Stores the BOLD-provided access token ‘api key’ in a variable, making it available for use in other function within the R session. } \details{ -\code{bold.apikey} creates a variable called \code{apikey} that stores the access token provided by BOLD (The access token has to be procured from BOLD before using the function/package). The \code{apikey} variable is then used by the \code{\link[=bold.fetch]{bold.fetch()}} internally so that the user does not have to input it again. The token must be provided as an input for the function before any other functions are used so as to set the apikey. The \code{api_key} is a UUID v4 hexadecimal string obtained upon request from BOLD at \code{support@boldsystems.org} and is valid for one year, requiring renewal thereafter. +\code{bold.apikey} creates a variable called \code{apikey} that stores the access token provided by BOLD. This apikey variable is then used internally by the \code{\link[=bold.fetch]{bold.fetch()}} function, so that the user does not have need to input it again. To set the \code{apikey}, the token must be provided as an input for the function before any other functions are called. The api_key is a UUID v4 hexadecimal string obtained upon request from BOLD at \code{support@boldsystems.org} and is valid for one year, after which it must be renewed. +} +\examples{ +\dontrun{ + +#This example below is for documentation only + +bold_apykey(‘00000000-0000-0000-0000-000000000000’) + +} + + } diff --git a/man/bold.data.summarize.Rd b/man/bold.data.summarize.Rd index 73d19b2..8cf1bce 100644 --- a/man/bold.data.summarize.Rd +++ b/man/bold.data.summarize.Rd @@ -42,8 +42,10 @@ The function is used to obtain a detailed summary of the data obtained by \code{ bold_data.ids <- bold.public.search(taxonomy = "Oreochromis") # Fetch the data using the ids. -#1. api_key must be obtained from BOLD support before usage -#2. the function `bold.apikey` should be used to set the apikey in the global env +#1. api_key must be obtained from BOLD support before using `bold.fetch` function. +#2. Use the `bold.apikey` function to set the apikey in the global env. + +bold.apikey('apikey') bold.data <- bold.fetch(get_by = "processid", identifiers = bold_data.ids$processid) diff --git a/man/bold.export.Rd b/man/bold.export.Rd index 5bc41e8..a4d76db 100644 --- a/man/bold.export.Rd +++ b/man/bold.export.Rd @@ -21,7 +21,7 @@ bold.export( \item{cols_for_fas_names}{A single or multiple character vector indicating the column headers that should be used to name each sequence for the unaligned FASTA file. Default is NULL; in this case, only the processid is used as the name.} -\item{export_to}{A character value specifying the data path and the name for the file. Extension should not be provided} +\item{export_to}{A character value specifying the data file path and the name for the file. Extension should not be included.} } \value{ It exports a .fas or a tsv file based on the export argument. @@ -48,16 +48,21 @@ Only one preset can be used at a time. The name for individual sequences in the # Download the records data_for_export_ids <- bold.public.search(taxonomy = "Poecilia reticulata") +# Fetch the data using the ids. +#1. api_key must be obtained from BOLD support before using `bold.fetch` function. +#2. Use the `bold.apikey` function to set the apikey in the global env. + +bold.apikey('apikey') + # Fetching the data using the ids data_for_export <- bold.fetch(get_by = "processid", identifiers = data_for_export_ids$processid) #1. Export the BCDM data using 'presets' -# (Using getwd() as the path and trial_export as the name) bold.export(bold_df=data_for_export, export_type = "preset_df", presets = 'taxonomy', - export_to = paste(getwd(),'/','trial_export',sep="")) + export_to = "file_path_with_intended_name") #2. Export multiple sequence alignment #a. Align the data @@ -71,14 +76,14 @@ seq_align<-bold.analyze.align(data_for_export, # Note the input data here is the modified BCDM data (seq_align) bold.export(bold_df=seq_align, export_type = "msa", - export_to = paste(getwd(),'/','trial_export',sep="")) + export_to = "file_path_with_intended_name") #3. Export the fasta file (unaligned) # Note that input data here is the original BCDM data (data_for_export) bold.export(bold_df = data_for_export, export_type = "fas", cols_for_fas_names = c("bin_uri","genus","species"), - export_to = paste(getwd(),'/','trial_export',sep="")) + export_to = "file_path_with_intended_name") } } diff --git a/man/bold.fetch.Rd b/man/bold.fetch.Rd index b898100..b8f19ad 100644 --- a/man/bold.fetch.Rd +++ b/man/bold.fetch.Rd @@ -26,7 +26,7 @@ bold.fetch( ) } \arguments{ -\item{get_by}{The parameter on which the data should be fetched (“processid”, “sampleid”, "bin_uri", "dataset_codes" or "project_codes")} +\item{get_by}{The parameter used to fetch data (“processid”, “sampleid”, "bin_uri", "dataset_codes" or "project_codes")} \item{identifiers}{A vector (or a data frame column) pointing to the \code{get_by} parameter specified.} @@ -77,8 +77,10 @@ Retrieves public and private user data based on different parameters (processid, #Test data with processids data(test.data) -#1. api_key must be obtained from BOLD support before usage -#2. The function `bold.apikey` should be used to set the apikey +# Fetch the data using the ids. +#1. api_key must be obtained from BOLD support before using `bold.fetch` function. +#2. Use the `bold.apikey` function to set the apikey in the global env. + bold.apikey('apikey') # With processids diff --git a/man/bold.public.search.Rd b/man/bold.public.search.Rd index 8b3be3c..68d455b 100644 --- a/man/bold.public.search.Rd +++ b/man/bold.public.search.Rd @@ -30,7 +30,7 @@ A data frame containing all the processids and sampleids related to the query se Retrieves record ids for publicly available data based on taxonomy, geography, bin_uris or datasets/project codes search. } \details{ -\code{bold.public.search} searches publicly available data on BOLD, retrieving associated proccessids and sampleids, which can then be accessed using \code{bold.fetch}. Search parameters can include one or a combination of taxonomy, geography, bin uris, dataset or project codes. While there is no limit on the amount of ID data that can be downloaded, complex combinations of the search parameters may exceed the predetermined weburl character length (2048 characters). Searches using a single parameter are not subject to this limit. For multiparameter searches (e.g. taxonomy + geography + bins; see the example: Taxonomy + Geography + BIN id), it’s crucial that the parameters are logically combined to ensure accurate and non-empty results. +\code{bold.public.search} searches publicly available data on BOLD, retrieving associated proccessids and sampleids, which can then be accessed using \code{bold.fetch}. Search parameters can include one or a combination of taxonomy, geography, bin uris, dataset or project codes. While there is no limit on the amount of ID data that can be downloaded, complex combinations of the search parameters may exceed the predetermined web URL character length (2048 characters). Searches using a single parameter are not subject to this limit. For multiparameter searches (e.g. taxonomy + geography + bins; see the example: Taxonomy + Geography + BIN id), it’s important to logically combine the parameters to ensure accurate and non-empty results. } \examples{