Title: | Input/Output 'SomaScan' Data |
---|---|
Description: | Load and export 'SomaScan' data via the 'Standard BioTools, Inc.' structured text file called an ADAT ('*.adat'). For file format see <https://github.com/SomaLogic/SomaLogic-Data/blob/main/README.md>. The package also exports auxiliary functions for manipulating, wrangling, and extracting relevant information from an ADAT object once in memory. |
Authors: | Stu Field [aut] |
Maintainer: | Caleb Scheidel <[email protected]> |
License: | MIT + file LICENSE |
Version: | 6.2.0.9000 |
Built: | 2025-02-07 14:20:35 UTC |
Source: | https://github.com/somalogic/somadataio |
Retrieve elements of the HEADER
attribute of a soma_adat
object:
getAdatVersion()
determines the the ADAT version
number from a parsed ADAT header.
getSomaScanVersion()
determines the original SomaScan assay version
that generated RFU measurements within a soma_adat
object.
checkSomaScanVersion()
determines if the version of
is a recognized version of SomaScan.
Table of SomaScan assay versions:
Version | Commercial Name | Size |
V4 |
5k | 5284 |
v4.1 |
7k | 7596 |
v5.0 |
11k | 11083 |
getSignalSpace()
determines the current signal space of
the RFU values, which may differ from the original SomaScan
signal space if the data have been lifted. See lift_adat()
and
vignette("lifting-and-bridging", package = "SomaDataIO")
.
getSomaScanLiftCCC()
accesses the lifting Concordance Correlation
Coefficients between various SomaScan versions. For more about
CCC metrics see lift_adat()
.
getAdatVersion(x) getSomaScanVersion(adat) getSignalSpace(adat) checkSomaScanVersion(ver) getSomaScanLiftCCC(matrix = c("plasma", "serum"))
getAdatVersion(x) getSomaScanVersion(adat) getSignalSpace(adat) checkSomaScanVersion(ver) getSomaScanLiftCCC(matrix = c("plasma", "serum"))
x |
Either a |
adat |
A |
ver |
|
matrix |
Character. A string of (usually) either
|
\link[=getAdatVersion]{getAdatVersion()} |
The key-value of the |
\link[=getSomaScanVersion]{getSomaScanVersion()} |
The key-value of the |
\link[=getSignalSpace]{getSignalSpace()} |
The key-value of the |
\link[=checkSomaScanVersion]{checkSomaScanVersion()} |
Returns |
\link[=getSomaScanLiftCCC]{getSomaScanLiftCCC()} |
Returns a tibble of either the
|
Stu Field
Lin, Lawrence I-Kuei. 1989. A Concordance Correlation Coefficient to Evaluate Reproducibility. Biometrics. 45:255-268.
getAdatVersion(example_data) attr(example_data, "Header.Meta")$HEADER$Version <- "99.9" getAdatVersion(example_data) ver <- getSomaScanVersion(example_data) ver rfu_space <- getSignalSpace(example_data) rfu_space is.null(checkSomaScanVersion(ver)) # plasma (default) getSomaScanLiftCCC() # serum getSomaScanLiftCCC("serum")
getAdatVersion(example_data) attr(example_data, "Header.Meta")$HEADER$Version <- "99.9" getAdatVersion(example_data) ver <- getSomaScanVersion(example_data) ver rfu_space <- getSignalSpace(example_data) rfu_space is.null(checkSomaScanVersion(ver)) # plasma (default) getSomaScanLiftCCC() # serum getSomaScanLiftCCC("serum")
Utility to convert a SomaLogic soma_adat
object to an
ExpressionSet
object via the Biobase package
from Bioconductor:
https://www.bioconductor.org/packages/release/bioc/html/Biobase.html.
adat2eSet(adat)
adat2eSet(adat)
adat |
A |
The Biobase package is required and must be installed from Bioconductor via the following at the R console:
if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("Biobase", version = remotes::bioc_version())
A Bioconductor object of class ExpressionSet
.
Stu Field
https://bioconductor.org/install/
Other eSet:
pivotExpressionSet()
eSet <- adat2eSet(example_data) class(eSet) eSet ft <- Biobase::exprs(eSet) head(ft[, 1:10L], 10L)
eSet <- adat2eSet(example_data) class(eSet) eSet ft <- Biobase::exprs(eSet) head(ft[, 1:10L], 10L)
soma_adat
ObjectsAdds a set of attributes, typically "Header.Meta" and "Col.Meta",
to a data.frame
, tibble
, soma_adat
or similar tabular object.
Existing attributes data
are not over-written.
Typically untouched are:
names
class
row.names
addAttributes(data, new.atts)
addAttributes(data, new.atts)
data |
The receiving |
new.atts |
A named |
A data frame object corresponding to data
but with the
attributes of new.atts
grafted on to it.
Existing attribute names are not over-written.
Stu Field
Utility to add (prepend) a class(es) to existing objects.
addClass(x, class)
addClass(x, class)
x |
The object to receive new class(es). |
class |
Character. The name of additional class(es). |
An object with new classes.
Stu Field
class()
, typeof()
, structure()
class(iris) addClass(iris, "new") |> class() addClass(iris, c("A", "B")) |> class() # 2 classes addClass(iris, c("A", "data.frame")) |> class() # no duplicates addClass(iris, c("data.frame", "A")) |> class() # re-orders if exists
class(iris) addClass(iris, "new") |> class() addClass(iris, c("A", "B")) |> class() # 2 classes addClass(iris, c("A", "data.frame")) |> class() # no duplicates addClass(iris, c("data.frame", "A")) |> class() # re-orders if exists
Calculate the estimated limit of detection (eLOD) for SOMAmer reagent analytes in the provided input data. The input data should be filtered to include only buffer samples desired for eLOD calculation.
calc_eLOD(data)
calc_eLOD(data)
data |
A |
eLOD is calculated using the following steps:
For each SOMAmer, the median and adjusted median absolute
deviation () are calculated, where
The 1.4826 is a set constant used to adjust the MAD to be reflective of the standard deviation of the normal distribution.
For each SOMAmer, calculate
Note: The eLOD is useful for non-core matrices, including cell lysate and CSF, but should be used carefully for evaluating background signal in plasma and serum.
A tibble
object with 2 columns: SeqId and eLOD.
Caleb Scheidel, Christopher Dimapasok
# filter data frame using vector of SampleId controls df <- withr::with_seed(101, { data.frame( SampleType = rep(c("Sample", "Buffer"), each = 10), SampleId = paste0("Sample_", 1:20), seq.20.1.100 = runif(20, 1, 100), seq.21.1.100 = runif(20, 1, 100), seq.22.2.100 = runif(20, 1, 100) ) }) sample_ids <- paste0("Sample_", 11:20) selected_samples <- df |> filter(SampleId %in% sample_ids) selected_elod <- calc_eLOD(selected_samples) head(selected_elod) ## Not run: # filter `soma_adat` object to buffer samples buffer_samples <- example_data |> filter(SampleType == "Buffer") # calculate eLOD buffer_elod <- calc_eLOD(buffer_samples) head(buffer_elod) # use eLOD to calculate signal to noise ratio of samples samples_median <- example_data |> dplyr::filter(SampleType == "Sample") |> dplyr::summarise(across(starts_with("seq"), median, .names = "median_{col}")) |> tidyr::pivot_longer(starts_with("median_"), names_to = "SeqId", values_to = "median_signal") |> dplyr::mutate(SeqId = gsub("median_seq", "seq", SeqId)) # analytes with signal to noise > 2 ratios <- samples_median |> dplyr::mutate(signal_to_noise = median_signal / buffer_elod$eLOD) |> dplyr::filter(signal_to_noise > 2) |> dplyr::arrange(desc(signal_to_noise)) head(ratios) ## End(Not run)
# filter data frame using vector of SampleId controls df <- withr::with_seed(101, { data.frame( SampleType = rep(c("Sample", "Buffer"), each = 10), SampleId = paste0("Sample_", 1:20), seq.20.1.100 = runif(20, 1, 100), seq.21.1.100 = runif(20, 1, 100), seq.22.2.100 = runif(20, 1, 100) ) }) sample_ids <- paste0("Sample_", 11:20) selected_samples <- df |> filter(SampleId %in% sample_ids) selected_elod <- calc_eLOD(selected_samples) head(selected_elod) ## Not run: # filter `soma_adat` object to buffer samples buffer_samples <- example_data |> filter(SampleType == "Buffer") # calculate eLOD buffer_elod <- calc_eLOD(buffer_samples) head(buffer_elod) # use eLOD to calculate signal to noise ratio of samples samples_median <- example_data |> dplyr::filter(SampleType == "Sample") |> dplyr::summarise(across(starts_with("seq"), median, .names = "median_{col}")) |> tidyr::pivot_longer(starts_with("median_"), names_to = "SeqId", values_to = "median_signal") |> dplyr::mutate(SeqId = gsub("median_seq", "seq", SeqId)) # analytes with signal to noise > 2 ratios <- samples_median |> dplyr::mutate(signal_to_noise = median_signal / buffer_elod$eLOD) |> dplyr::filter(signal_to_noise > 2) |> dplyr::arrange(desc(signal_to_noise)) head(ratios) ## End(Not run)
Often the names, particularly within soma_adat
objects,
are messy due to varying inputs, this function attempts to remedy this by
removing the following:
trailing/leading/internal whitespace
non-alphanumeric strings (except underscores)
duplicated internal dots (..
), (...
), etc.
SomaScan normalization scale factor format
cleanNames(x)
cleanNames(x)
x |
Character. String to clean up. |
A cleaned up character string.
Stu Field
cleanNames(" sdkfj...sdlkfj.sdfii4994### ") cleanNames("Hyb..Scale")
cleanNames(" sdkfj...sdlkfj.sdfii4994### ") cleanNames("Hyb..Scale")
In a standard SomaLogic ADAT, the section of information that
sits directly above the measurement data (RFU data matrix) is
the column meta data (Col.Meta
), which contains detailed information
and annotations about the analytes, SeqId()
s, and their targets.
See section below for further information about available
fields and their descriptions. Use getAnalyteInfo()
to
obtain an object containing this information for programmatic analyses,
and use getMeta()
to obtain the column names representing the
row-specific meta data about the samples (see section below).
Information describing the analytes is found to the above the data matrix in a standard SomaLogic ADAT. This information may consist of the any or all of the following:
Field | Description | Example |
SeqId | SomaLogic sequence identifier | 2182-54_1 |
SeqidVersion | Version of SOMAmer sequence | 2 |
SomaId | Target identifier, of the form SLnnnnnn (8 characters in length) | SL000318 |
TargetFullName | Target name curated for consistency with UniProt name | Complement C4b |
Target | SomaLogic Target Name | C4b |
UniProt | UniProt identifier(s) | P0C0L4 P0C0L5 |
EntrezGeneID | Entrez Gene Identifier(s) | 720 721 |
EntrezGeneSymbol | Entrez Gene Symbol names | C4A C4B |
Organism | Protein Source Organism | Human |
Units | Relative Fluorescence Units | RFU |
Type | SOMAmer target type | Protein |
Dilution | Dilution mix assignment | 0.01% |
PlateScale_Reference | PlateScale reference value | 1378.85 |
CalReference | Calibration sample reference value | 1378.85 |
medNormRef_ReferenceRFU | Median normalization reference value | 490.342 |
Cal_V4_<YY>_<SSS>_<PPP> |
Calibration scale factor (for given Year_Study_Plate) | 0.64 |
ColCheck | QC acceptance criteria across all plates/sets | PASS |
QcReference_<LLLLL> |
QC sample reference value (for given QC lot) | PASS |
CalQcRatio_V4_<YY>_<SSS>_<PPP> |
Post calibration median QC ratio to reference (for given Year_Study_Plate) | 1.04 |
Information describing the samples is typically found to the left of the data matrix in a standard SomaLogic ADAT. This information may consist of clinical information provided by the client, or run-specific diagnostic information included for assay quality control. Below are some examples of what may be present in this section:
Field | Description | Examples |
PlateId | Plate identifier | V4-18-004_001, V4-18-004_002 |
ScannerID | Scanner used to analyze slide | SG12064173, SG14374437 |
PlatePosition | Location on 96 well plate (A1-H12) | A1, H12 |
SlideId | Agilent slide barcode | 2.58E+11 |
Subarray | Agilent subarray (1 – 8) | 1,8 |
SampleId | 1st form is Subject Identifier, 2nd form (calibrators, buffers) | 2031 |
SampleType | 1st form for clinical samples (Sample), 2nd form as above | Sample, QC, Calibrator, Buffer |
PercentDilution | Highest concentration the SOMAmer dilution groups | 20 |
SampleMatrix | Sample matrix | Plasma-PPT |
Barcode | 1D Barcode of aliquot | S622225 |
Barcode2d | 2D Barcode of aliquot | 1.91E+08 |
SampleNotes | Assay team sample observation | Cloudy, Low sample volume, Reddish |
SampleDescription | Supplemental sample information | Plasma QC 1 |
AssayNotes | Assay team run observation | Beads aspirated, Leak/Hole, Smear |
TimePoint | Sample time point | Baseline |
ExtIdentifier | Primary key for Subarray | EXID40000000032037 |
SsfExtId | Primary key for sample | EID102733 |
SampleGroup | Sample group | A, B |
SiteId | Collection site | SomaLogic, Covance |
TubeUniqueID | Unique tube identifier | 1.12E+11 |
CLI | Cohort definition identifier | CLI6006F001 |
HybControlNormScale | Hybridization control scale factor | 0.948304 |
RowCheck | Normalization acceptance criteria for all row scale factors | PASS, FLAG |
NormScale_0_5 | Median signal normalization scale factor (0.5% mix) | 1.02718 |
NormScale_0_005 | Median signal normalization scale factor (0.005% mix) | 1.119754 |
NormScale_20 | Median signal normalization scale factor (20% mix) | 0.996148 |
# Annotations/Col.Meta tbl <- getAnalyteInfo(example_data) tbl # Row/sample Meta r_m <- getMeta(example_data) head(r_m) # Normalization Scale Factors grep("NormScale", r_m, value = TRUE) # adat subset example_data[1:3, head(r_m)]
# Annotations/Col.Meta tbl <- getAnalyteInfo(example_data) tbl # Row/sample Meta r_m <- getMeta(example_data) head(r_m) # Normalization Scale Factors grep("NormScale", r_m, value = TRUE) # adat subset example_data[1:3, head(r_m)]
Diff tool for the differences between two soma_adat
objects.
When diffs of the table values are interrogated, only
the intersect of the column meta data or feature data is considered
diffAdats(adat1, adat2, tolerance = 1e-06)
diffAdats(adat1, adat2, tolerance = 1e-06)
adat1 , adat2
|
Two |
tolerance |
Numeric |
NULL
, invisibly. Called for side effects.
Only diffs of the column name intersect are reported.
Stu Field
# subset `example_data` for speed # all SeqIds from 2000 -> 2999 seqs <- grep("^seq\\.2[0-9]{3}", names(example_data), value = TRUE) ex_data_small <- head(example_data[, c(getMeta(example_data), seqs)], 10L) dim(ex_data_small) # no diff to itself diffAdats(ex_data_small, ex_data_small) # remove random column rm <- withr::with_seed(123, sample(1:ncol(ex_data_small), 1)) diffAdats(ex_data_small, ex_data_small[, -rm]) # randomly shuffle Subarray diffAdats(ex_data_small, dplyr::mutate(ex_data_small, Subarray = sample(Subarray))) # modify 2 RFUs randomly new <- ex_data_small new[5L, c(rm, rm + 1L)] <- 999 diffAdats(ex_data_small, new)
# subset `example_data` for speed # all SeqIds from 2000 -> 2999 seqs <- grep("^seq\\.2[0-9]{3}", names(example_data), value = TRUE) ex_data_small <- head(example_data[, c(getMeta(example_data), seqs)], 10L) dim(ex_data_small) # no diff to itself diffAdats(ex_data_small, ex_data_small) # remove random column rm <- withr::with_seed(123, sample(1:ncol(ex_data_small), 1)) diffAdats(ex_data_small, ex_data_small[, -rm]) # randomly shuffle Subarray diffAdats(ex_data_small, dplyr::mutate(ex_data_small, Subarray = sample(Subarray))) # modify 2 RFUs randomly new <- ex_data_small new[5L, c(rm, rm + 1L)] <- 999 diffAdats(ex_data_small, new)
Uses the Col.Meta
attribute (analyte annotation data that appears above
the protein measurements in the *.adat
text file) of a soma_adat
object,
adds the AptName
column key, conducts a few sanity checks, and
generates a "lookup table" of analyte data that can be used for simple
manipulation and indexing of analyte annotation information.
Most importantly, the analyte column names of the soma_adat
(e.g. seq.XXXX.XX
) become the AptName
column of the lookup table and
represents the key index between the table and soma_adat
from which it comes.
getAnalyteInfo(adat) getTargetNames(tbl) getFeatureData(adat)
getAnalyteInfo(adat) getTargetNames(tbl) getFeatureData(adat)
adat |
A |
tbl |
A |
A tibble
object with columns corresponding
to the column meta data entries in the soma_adat
. One row per analyte.
getTargetNames()
: creates a lookup table (or dictionary) as a named list object of AptNames
and Target names in key-value pairs.
This is a convenient tool to quickly access a TargetName
given
the AptName
in which the key-value pairs map the seq.XXXX.XX
to its corresponding TargetName
in tbl
.
This structure which provides a convenient auto-completion mechanism at
the command line or for generating plot titles.
getFeatureData()
: . Please now use
getAnalyteInfo()
.
Stu Field
getAnalytes()
, is_intact_attr()
, read_adat()
# Get Aptamer table anno_tbl <- getAnalyteInfo(example_data) anno_tbl # Use `dplyr::group_by()` dplyr::tally(dplyr::group_by(anno_tbl, Dilution)) # print summary by dilution # Columns containing "Target" anno_tbl |> dplyr::select(dplyr::contains("Target")) # Rows of "Target" starting with MMP anno_tbl |> dplyr::filter(grepl("^MMP", Target)) # Target names tg <- getTargetNames(anno_tbl) # how to use for plotting feats <- sample(anno_tbl$AptName, 6) op <- par(mfrow = c(2, 3)) sapply(feats, function(.x) plot(1:10, main = tg[[.x]])) par(op)
# Get Aptamer table anno_tbl <- getAnalyteInfo(example_data) anno_tbl # Use `dplyr::group_by()` dplyr::tally(dplyr::group_by(anno_tbl, Dilution)) # print summary by dilution # Columns containing "Target" anno_tbl |> dplyr::select(dplyr::contains("Target")) # Rows of "Target" starting with MMP anno_tbl |> dplyr::filter(grepl("^MMP", Target)) # Target names tg <- getTargetNames(anno_tbl) # how to use for plotting feats <- sample(anno_tbl$AptName, 6) op <- par(mfrow = c(2, 3)) sapply(feats, function(.x) plot(1:10, main = tg[[.x]])) par(op)
Return the feature names (i.e. the column names for
SOMAmer reagent analytes) from a soma_adat
.
S3 methods also exist for these classes:
#> [1] getAnalytes.character getAnalytes.data.frame getAnalytes.default #> [4] getAnalytes.list getAnalytes.matrix getAnalytes.recipe #> [7] getAnalytes.soma_adat #> see '?methods' for accessing help and source code
getMeta()
returns the inverse, a character vector of string
names of non-analyte feature columns/variables, which typically
correspond to the clinical ("meta") data variables.
S3 methods exist for these classes:
#> [1] getMeta.character getMeta.data.frame getMeta.default getMeta.list #> [5] getMeta.matrix getMeta.soma_adat #> see '?methods' for accessing help and source code
getAnalytes(x, n = FALSE, rm.controls = FALSE) getMeta(x, n = FALSE) getFeatures(x, n = FALSE, rm.controls = FALSE)
getAnalytes(x, n = FALSE, rm.controls = FALSE) getMeta(x, n = FALSE) getFeatures(x, n = FALSE, rm.controls = FALSE)
x |
Typically a |
n |
Logical. Return an integer corresponding to the length of the features? |
rm.controls |
Logical. Should all control and non-human analytes
(e.g. |
getAnalytes()
: a character vector of ADAT feature ("analyte") names.
getMeta()
: a character vector of ADAT clinical ("meta") data names.
For both, if n = TRUE
, an integer corresponding to the
length of the character vector.
getFeatures()
: . Please now use
getAnalytes()
.
Stu Field
# RFU feature variables apts <- getAnalytes(example_data) head(apts) getAnalytes(example_data, n = TRUE) # vector string bb <- getAnalytes(names(example_data)) all.equal(apts, bb) # create some control sequences # ~~~~~~~~~ Spuriomer ~~~ HybControl ~~~ apts2 <- c("seq.2053.2", "seq.2171.12", head(apts)) apts2 no_crtl <- getAnalytes(apts2, rm.controls = TRUE) no_crtl setdiff(apts2, no_crtl) # clinical variables mvec <- getMeta(example_data) head(mvec, 10) getMeta(example_data, n = TRUE) # test 'data.frame' and 'character' S3 methods are identical identical(getMeta(example_data), getMeta(names(example_data))) # TRUE
# RFU feature variables apts <- getAnalytes(example_data) head(apts) getAnalytes(example_data, n = TRUE) # vector string bb <- getAnalytes(names(example_data)) all.equal(apts, bb) # create some control sequences # ~~~~~~~~~ Spuriomer ~~~ HybControl ~~~ apts2 <- c("seq.2053.2", "seq.2171.12", head(apts)) apts2 no_crtl <- getAnalytes(apts2, rm.controls = TRUE) no_crtl setdiff(apts2, no_crtl) # clinical variables mvec <- getMeta(example_data) head(mvec, 10) getMeta(example_data, n = TRUE) # test 'data.frame' and 'character' S3 methods are identical identical(getMeta(example_data), getMeta(names(example_data))) # TRUE
soma_adat
Class ObjectsS3 group generic methods to apply group specific prototype functions
to the RFU data only of soma_adat
objects.
The clinical meta data are not transformed and remain unmodified in
the returned object (Math()
and Ops()
) or are ignored for the
Summary()
group. See groupGeneric()
.
## S3 method for class 'soma_adat' Math(x, ...) antilog(x, base = 10) ## S3 method for class 'soma_adat' Ops(e1, e2 = NULL) ## S3 method for class 'soma_adat' Summary(..., na.rm = FALSE) ## S3 method for class 'soma_adat' e1 == e2
## S3 method for class 'soma_adat' Math(x, ...) antilog(x, base = 10) ## S3 method for class 'soma_adat' Ops(e1, e2 = NULL) ## S3 method for class 'soma_adat' Summary(..., na.rm = FALSE) ## S3 method for class 'soma_adat' e1 == e2
x |
The |
... |
Additional arguments passed to the various group generics as appropriate. |
base |
A positive or complex number: the base with respect to which logarithms are computed. |
e1 , e2
|
Objects. |
na.rm |
Logical. Should missing values be removed? |
A soma_adat
object with the same dimensions of the input
object with the feature columns transformed by the specified generic.
antilog()
: performs the inverse or anti-log transform for a numeric vector of
soma_adat
object. note: default is base = 10
, which differs from
the log()
default base e.
Ops(soma_adat)
: performs binary mathematical operations on class soma_adat
. See Ops()
.
Summary(soma_adat)
: performs summary calculations on class soma_adat
. See Summary()
.
==
: compares left- and right-hand sides of the operator unless the RHS
is also a soma_adat
, in which case diffAdats()
is invoked.
Group members:
#> [1] "abs" "acos" "acosh" "asin" "asinh" "atan" #> [7] "atanh" "ceiling" "cos" "cosh" "cospi" "cummax" #> [13] "cummin" "cumprod" "cumsum" "digamma" "exp" "expm1" #> [19] "floor" "gamma" "lgamma" "log" "log10" "log1p" #> [25] "log2" "sign" "sin" "sinh" "sinpi" "sqrt" #> [31] "tan" "tanh" "tanpi" "trigamma" "trunc"
Commonly used generics of this group include:
log()
, log10()
, log2()
, antilog()
,
abs()
, sign()
, floor()
, sqrt()
, exp()
Group members:
#> [1] "+" "-" "*" "^" "%%" "%/%" "/" "==" ">" "<" "!=" "<=" #> [13] ">="
Note that for the `==`
method if the RHS is also a soma_adat
,
diffAdats()
is invoked which compares LHS vs. RHS.
Commonly used generics of this group include:
+
, -
, *
, /
, ^
, ==
, >
, <
Group members:
#> [1] "all" "any" "max" "min" "prod" "range" "sum"
Commonly used generics of this group include:
max()
, min()
, range()
, sum()
, any()
Stu Field
groupGeneric()
, getGroupMembers()
, getGroup()
# subset `example_data` for speed # all SeqIds from 2000 -> 2999 seqs <- grep("^seq\\.2[0-9]{3}", names(example_data), value = TRUE) ex_data_small <- head(example_data[, c(getMeta(example_data), seqs)], 10L) dim(ex_data_small) ex_data_small$seq.2991.9 # Math Generics: # ------------- # log-transformation a <- log(ex_data_small) a$seq.2991.9 b <- log10(ex_data_small) b$seq.2991.9 isTRUE(all.equal(b, log(ex_data_small, base = 10))) # floor c <- floor(ex_data_small) c$seq.2991.9 # square-root d <- sqrt(ex_data_small) d$seq.2991.9 # rounding e <- round(ex_data_small) e$seq.2991.9 # inverse log antilog(1:4) alog <- antilog(b) all.equal(ex_data_small, alog) # return `b` -> linear space # Ops Generics: # ------------- plus1 <- ex_data_small + 1 times2 <- ex_data_small * 2 sq <- ex_data_small^2 all.equal(sqrt(sq), ex_data_small) gt100k <- ex_data_small > 100000 gt100k ex_data_small == ex_data_small # invokes diffAdats() # Summary Generics: # ------------- sum(ex_data_small) any(ex_data_small < 100) # low RFU analytes sum(ex_data_small < 100) # how many min(ex_data_small) min(ex_data_small, 0) max(ex_data_small) max(ex_data_small, 1e+7) range(ex_data_small)
# subset `example_data` for speed # all SeqIds from 2000 -> 2999 seqs <- grep("^seq\\.2[0-9]{3}", names(example_data), value = TRUE) ex_data_small <- head(example_data[, c(getMeta(example_data), seqs)], 10L) dim(ex_data_small) ex_data_small$seq.2991.9 # Math Generics: # ------------- # log-transformation a <- log(ex_data_small) a$seq.2991.9 b <- log10(ex_data_small) b$seq.2991.9 isTRUE(all.equal(b, log(ex_data_small, base = 10))) # floor c <- floor(ex_data_small) c$seq.2991.9 # square-root d <- sqrt(ex_data_small) d$seq.2991.9 # rounding e <- round(ex_data_small) e$seq.2991.9 # inverse log antilog(1:4) alog <- antilog(b) all.equal(ex_data_small, alog) # return `b` -> linear space # Ops Generics: # ------------- plus1 <- ex_data_small + 1 times2 <- ex_data_small * 2 sq <- ex_data_small^2 all.equal(sqrt(sq), ex_data_small) gt100k <- ex_data_small > 100000 gt100k ex_data_small == ex_data_small # invokes diffAdats() # Summary Generics: # ------------- sum(ex_data_small) any(ex_data_small < 100) # low RFU analytes sum(ex_data_small < 100) # how many min(ex_data_small) min(ex_data_small, 0) max(ex_data_small) max(ex_data_small, 1e+7) range(ex_data_small)
This function runs a series of checks to determine
if a soma_adat
object has a complete
set of attributes. If not, this indicates that the object has
been modified since the initial read_adat()
call.
Checks for the presence of both "Header.Meta" and "Col.Meta" in the
attribute names. These entries are added during the
read_adat()
call. Specifically, within these sections
it also checks for the presence of the following entries:
"HEADER", "COL_DATA", and "ROW_DATA"
"SeqId", "Target", "Units", and "Dilution"
If any of the above they are altered or missing, FALSE
is returned.
is.intact.attributes()
is .
It remains for backward compatibility and may be removed in the future.
You are encouraged to shift your code to
is_intact_attr()
.
is_intact_attr(adat, verbose = interactive()) is.intact.attributes(adat, verbose = interactive())
is_intact_attr(adat, verbose = interactive()) is.intact.attributes(adat, verbose = interactive())
adat |
A |
verbose |
Logical. Should diagnostic information about failures
be printed to the console? If the default, see |
Logical. TRUE
if all checks pass, otherwise FALSE
.
# checking attributes my_adat <- example_data is_intact_attr(my_adat) # TRUE is_intact_attr(my_adat[, -303L]) # doesn't break atts; TRUE attributes(my_adat)$Col.Meta$Target <- NULL # break attributes is_intact_attr(my_adat) # FALSE (Target missing)
# checking attributes my_adat <- example_data is_intact_attr(my_adat) # TRUE is_intact_attr(my_adat[, -303L]) # doesn't break atts; TRUE attributes(my_adat)$Col.Meta$Target <- NULL # break attributes is_intact_attr(my_adat) # FALSE (Target missing)
AptName
FormatTest whether an object is in the new seq.XXXX.XX
format.
is_seqFormat(x)
is_seqFormat(x)
x |
The object to be tested. |
A logical indicating whether x
contains AptNames
consistent
with the new format, beginning with a seq.
prefix.
Stu Field, Eduardo Tabacman
# character S3 method is_seqFormat(names(example_data)) # no; meta data not ^seq. is_seqFormat(tail(names(example_data), -20L)) # yes # soma_adat S3 method is_seqFormat(example_data)
# character S3 method is_seqFormat(names(example_data)) # no; meta data not ^seq. is_seqFormat(tail(names(example_data), -20L)) # yes # soma_adat S3 method is_seqFormat(example_data)
The SomaScan platform continually improves its technical processes between assay versions. The primary change of interest is content expansion, and other protocol changes may be implemented including: changing reagents, liquid handling equipment, and well volumes.
Table of SomaScan assay versions:
Version | Commercial Name | Size |
V4 |
5k | 5284 |
v4.1 |
7k | 7596 |
v5.0 |
11k | 11083 |
However, for a given analyte, these technical upgrades can result
in minute measurement signal differences,
requiring a calibration (aka "lifting" or "bridging") to bring RFUs into a
comparable signal space.
This is accomplished by applying an analyte-specific scalar,
a linear transformation, to each analyte RFU measurement (column).
If you have an annotations file (*.xlsx
) and wish to examine the
bridging scalars themselves, please see read_annotations()
.
Lifting between SomaScan versions no longer requires an
annotations file containing lifting scalars. We now enable users to pass
a bridge
parameter, indicating the direction of the bridge.
For example, to "lift" between 11k
-> 7k
, you must be acting on
SomaScan data in 11k
RFU space and would pass bridge = "11k_to_7k"
.
Likewise, 7k
-> 5k
requires bridge = "7k_to_5k"
.
Lastly, you may also lift directly from 11k
-> 5k
(aka "double-bridge") with bridge = "11k_to_5k"
.
See below for all options for the bridge
argument.
lift_adat( adat, bridge = c("11k_to_7k", "11k_to_5k", "7k_to_11k", "7k_to_5k", "5k_to_11k", "5k_to_7k"), anno.tbl = deprecated() ) is_lifted(adat)
lift_adat( adat, bridge = c("11k_to_7k", "11k_to_5k", "7k_to_11k", "7k_to_5k", "5k_to_11k", "5k_to_7k"), anno.tbl = deprecated() ) is_lifted(adat)
adat |
A |
bridge |
The direction of the lift (i.e. bridge). |
anno.tbl |
Matched samples across assay versions are used to calculate bridging
scalars. For each analyte, this scalar is computed as the ratio of
population medians across assay versions.
Please see the lifting vignette
vignette("lifting-and-bridging", package = "SomaDataIO")
for more details.
lift_adat()
: A "lifted" soma_adat
object corresponding to
the scaling requested in the bridge
parameter. RFU values are
rounded to 1 decimal place to match standard SomaScan delivery format.
is_lifted()
: Logical. Whether the RFU values in a soma_adat
have been lifted from its original signal space to a new signal space.
The Lin's Concordance Correlation Coefficient (CCC) is calculated
by computing the correlation between post-lift RFU values and the
RFU values generated on the original SomaScan version.
This CCC estimate is a measure of how well an analyte can be bridged
across SomaScan versions.
See vignette("lifting-and-bridging", package = "SomaDataIO")
.
As with the lifting scalars, if you have an annotations file
you may view the analyte-specific CCC values via read_annotations()
.
Alternatively, getSomaScanLiftCCC()
retrieves these values
from an internal object for both "serum"
and "plasma"
.
Newer versions of SomaScan typically have additional content, i.e.
new reagents added to the multi-plex assay that bind to additional proteins.
When lifting to a previous SomaScan version, new reagents that do not
exist in the "earlier" assay version assay are scaled by 1.0, and thus
maintained, unmodified in the returned object. Users may need to drop
these columns in order to combine these data with a previous study
from an earlier SomaScan version, e.g. with collapseAdats()
.
In the inverse scenario, lifting "forward" from a previous, lower-plex
version, there will be extra reference values that are unnecessary
to perform the lift, and a warning is triggered. The resulting data
consists of RFU data in the "new" signal space, but with fewer analytes
than would otherwise be expected (e.g. 11k
space with only 5284
analytes; see example below).
Lin, Lawrence I-Kuei. 1989. A Concordance Correlation Coefficient to Evaluate Reproducibility. Biometrics. 45:255-268.
# `example_data` is SomaScan (V4, 5k) adat <- head(example_data, 3L) dim(adat) getSomaScanVersion(adat) getSignalSpace(adat) # perform 'lift' lift_11k <- lift_adat(adat, "5k_to_11k") # warning is_lifted(lift_11k) dim(lift_11k) # attributes updated to reflect the 'lift' attr(lift_11k, "Header")$HEADER$SignalSpace attr(lift_11k, "Header")$HEADER$ProcessSteps
# `example_data` is SomaScan (V4, 5k) adat <- head(example_data, 3L) dim(adat) getSomaScanVersion(adat) getSignalSpace(adat) # perform 'lift' lift_11k <- lift_adat(adat, "5k_to_11k") # warning is_lifted(lift_11k) dim(lift_11k) # attributes updated to reflect the 'lift' attr(lift_11k, "Header")$HEADER$SignalSpace attr(lift_11k, "Header")$HEADER$ProcessSteps
Load a series of ADATs and return a list of soma_adat
objects, one for each ADAT file.
collapseAdats()
concatenates a list of ADATs from loadAdatsAsList()
,
while maintaining the relevant attribute entries (mainly the HEADER
element). This makes writing out the final object possible without the
loss of HEADER
information.
loadAdatsAsList(files, collapse = FALSE, verbose = interactive(), ...) collapseAdats(x)
loadAdatsAsList(files, collapse = FALSE, verbose = interactive(), ...) collapseAdats(x)
files |
A character string of files to load. |
collapse |
Logical. Should the resulting list of ADATs be collapsed into a single ADAT object? |
verbose |
Logical. Should the function call be run in verbose mode. |
... |
Additional arguments passed to |
x |
A list of |
The default behavior is to "vertically bind"
(rbind()
) on the intersect of the column variables, with
unique columns silently dropped.
If "vertically binding" on the column union is
desired, use dplyr::bind_rows()
, however this results in NAs
in
non-intersecting columns. For many files with little variable
intersection, a sparse RFU-matrix will result
(and will likely break ADAT attributes):
adats <- loadAdatsAsList(files) union_adat <- dplyr::bind_rows(adats, .id = "SourceFile")
A list of ADATs named by files
, each a soma_adat
object
corresponding to an individual file in files
. For collapseAdats()
,
a single, collapsed soma_adat
object.
Stu Field
Other IO:
parseHeader()
,
read_adat()
,
soma_adat
,
write_adat()
# only 1 file in directory dir(system.file("extdata", package = "SomaDataIO")) files <- system.file("extdata", package = "SomaDataIO") |> dir(pattern = "[.]adat$", full.names = TRUE) |> rev() adats <- loadAdatsAsList(files) class(adats) # collapse into 1 ADAT collapsed <- collapseAdats(adats) class(collapsed) # Alternatively use `collapse = TRUE` loadAdatsAsList(files, collapse = TRUE)
# only 1 file in directory dir(system.file("extdata", package = "SomaDataIO")) files <- system.file("extdata", package = "SomaDataIO") |> dir(pattern = "[.]adat$", full.names = TRUE) |> rev() adats <- loadAdatsAsList(files) class(adats) # collapse into 1 ADAT collapsed <- collapseAdats(adats) class(collapsed) # Alternatively use `collapse = TRUE` loadAdatsAsList(files, collapse = TRUE)
Occasionally, additional clinical data is obtained after samples
have been submitted to SomaLogic, or even after 'SomaScan'
results have been delivered.
This requires the new clinical variables, i.e. non-proteomic, data to be
merged with 'SomaScan' data into a "new" ADAT prior to analysis.
merge_clin()
easily merges such clinical variables into an
existing soma_adat
object and is a simple wrapper around dplyr::left_join()
.
merge_clin(x, clin_data, by = NULL, by_class = NULL, ...)
merge_clin(x, clin_data, by = NULL, by_class = NULL, ...)
x |
A |
clin_data |
One of 2 options:
|
by |
A character vector of variables to join by.
See |
by_class |
If |
... |
Additional parameters passed to |
This functionality also exists as a command-line tool (R script) contained
in merge_clin.R
that lives in the cli/merge
system file directory.
Please see:
dir(system.file("cli/merge", package = "SomaDataIO"), full.names = TRUE)
vignette("cli-merge-tool", package = "SomaDataIO")
A soma_adat
with new clinical variables merged.
Stu Field
# retrieve clinical data clin_file <- system.file("cli/merge", "meta.csv", package = "SomaDataIO", mustWork = TRUE) clin_file # view clinical data to be merged: # 1) `group` # 2) `newvar` clin_df <- read.csv(clin_file, colClasses = c(SampleId = "character")) clin_df # create mini-adat apts <- withr::with_seed(123, sample(getAnalytes(example_data), 2L)) adat <- head(example_data, 9L) |> # 9 x 2 dplyr::select(SampleId, all_of(apts)) # merge clinical variables merged <- merge_clin(adat, clin_df, by = "SampleId") merged # Alternative syntax: # 1) pass file path # 2) merge on different variable names # 3) convert join type on-the-fly clin_file2 <- system.file("cli/merge", "meta2.csv", package = "SomaDataIO", mustWork = TRUE) id_type <- typeof(adat$SampleId) merged2 <- merge_clin(adat, clin_file2, # file path by = c(SampleId = "ClinKey"), # join on 2 variables by_class = c(ClinKey = id_type)) # match types merged2
# retrieve clinical data clin_file <- system.file("cli/merge", "meta.csv", package = "SomaDataIO", mustWork = TRUE) clin_file # view clinical data to be merged: # 1) `group` # 2) `newvar` clin_df <- read.csv(clin_file, colClasses = c(SampleId = "character")) clin_df # create mini-adat apts <- withr::with_seed(123, sample(getAnalytes(example_data), 2L)) adat <- head(example_data, 9L) |> # 9 x 2 dplyr::select(SampleId, all_of(apts)) # merge clinical variables merged <- merge_clin(adat, clin_df, by = "SampleId") merged # Alternative syntax: # 1) pass file path # 2) merge on different variable names # 3) convert join type on-the-fly clin_file2 <- system.file("cli/merge", "meta2.csv", package = "SomaDataIO", mustWork = TRUE) id_type <- typeof(adat$SampleId) merged2 <- merge_clin(adat, clin_file2, # file path by = c(SampleId = "ClinKey"), # join on 2 variables by_class = c(ClinKey = id_type)) # match types merged2
The parameters below are commonly used throughout the SomaDataIO package.
adat |
A |
x |
A |
matrix |
Character. A string of (usually) either
|
A soma_adat
class object.
Parses the header section of an ADAT file.
parseHeader(file)
parseHeader(file)
file |
Character. The elaborated path and file name of the
|
A list of relevant file information required by read_adat()
in order to complete loading the ADAT file, including:
Header.Meta |
list of notes and other information about the adat |
Col.Meta |
list of vectors that contain the column meta data about individual analytes, includes information about the target name and calibration and QC ratios |
file_specs |
list of values of the file parsing specifications |
row_meta |
character vector of the clinical variables; assay information that is included in the adat output along with the RFU data |
Stu Field
Other IO:
loadAdatsAsList()
,
read_adat()
,
soma_adat
,
write_adat()
f <- system.file("extdata", "example_data10.adat", package = "SomaDataIO", mustWork = TRUE) header <- parseHeader(f) names(header) header$Header.Meta header$file_specs header$row_meta head(as.data.frame(header$Col.Meta))
f <- system.file("extdata", "example_data10.adat", package = "SomaDataIO", mustWork = TRUE) header <- parseHeader(f) names(header) header$Header.Meta header$file_specs header$row_meta head(as.data.frame(header$Col.Meta))
Utility to convert an ExpressionSet
class object
from the "wide" data format to the "long" format via tidyr::pivot_longer()
.
The Biobase package is required for this function.
pivotExpressionSet(eSet) meltExpressionSet(eSet)
pivotExpressionSet(eSet) meltExpressionSet(eSet)
eSet |
An |
A tibble
consisting of the long format
conversion of an ExpressionSet
object.
meltExpressionSet()
: . Please now use
pivotExpressionSet()
.
Stu Field
Other eSet:
adat2eSet()
# subset into a reduced mini-ADAT object # 10 samples (rows) # 5 clinical variables and 3 features (cols) sub_adat <- example_data[1:10, c(1:5, 35:37)] ex_set <- adat2eSet(sub_adat) # convert ExpressionSet object to long format adat_long <- pivotExpressionSet(ex_set)
# subset into a reduced mini-ADAT object # 10 samples (rows) # 5 clinical variables and 3 features (cols) sub_adat <- example_data[1:10, c(1:5, 35:37)] ex_set <- adat2eSet(sub_adat) # convert ExpressionSet object to long format adat_long <- pivotExpressionSet(ex_set)
The parse and load a *.adat
file as a data.frame
-like object into
an R workspace environment. The class of the returned object is
a soma_adat
object.
read.adat()
is .
For backward compatibility it will likely never go away completely,
but you are strongly encouraged to shift your code to use
read_adat()
.
is.soma_adat()
checks whether an object is of class soma_adat
.
See inherits()
.
read_adat(file, debug = FALSE, verbose = getOption("verbose"), ...) read.adat(file, debug = FALSE, verbose = getOption("verbose"), ...) is.soma_adat(x)
read_adat(file, debug = FALSE, verbose = getOption("verbose"), ...) read.adat(file, debug = FALSE, verbose = getOption("verbose"), ...) is.soma_adat(x)
file |
Character. The elaborated path and file name of the |
debug |
Logical. Used for debugging and development of an ADAT that fails to load, particularly out-of-spec, poorly modified, or legacy ADATs. |
verbose |
Logical. Should the function call be run in verbose mode, printing relevant diagnostic call information to the console. |
... |
Additional arguments passed ultimately to
|
x |
An |
A data.frame
-like object of class soma_adat
consisting of SomaLogic RFU (feature) data and clinical meta data as
columns, and samples as rows. Row names are labeled with the unique ID
"SlideId_Subarray" concatenation. The sections of the ADAT header (e.g.,
"Header.Meta", "Col.Meta", ...) are stored as attributes (e.g.
attributes(x)$Header.Meta
).
Logical. Whether x
inherits from class soma_adat
.
Stu Field
Other IO:
loadAdatsAsList()
,
parseHeader()
,
soma_adat
,
write_adat()
# path to *.adat file # replace with your file path adat_path <- system.file("extdata", "example_data10.adat", package = "SomaDataIO", mustWork = TRUE) adat_path my_adat <- read_adat(adat_path) is.soma_adat(my_adat)
# path to *.adat file # replace with your file path adat_path <- system.file("extdata", "example_data10.adat", package = "SomaDataIO", mustWork = TRUE) adat_path my_adat <- read_adat(adat_path) is.soma_adat(my_adat)
Import a SomaLogic Annotations File
read_annotations(file)
read_annotations(file)
file |
A path to an annotations file location.
This is a sanctioned, versioned file provided by
Standard BioTools, Inc. and should be an unmodified
|
A tibble
containing analyte-specific annotations and
related (e.g. lift/bridging) information, keyed on SomaLogic
SeqId, the unique SomaScan analyte identifier.
## Not run: # for example file <- "~/Downloads/SomaScan_11K_Annotated_Content.xlsx" anno_tbl <- read_annotations(file) ## End(Not run)
## Not run: # for example file <- "~/Downloads/SomaScan_11K_Annotated_Content.xlsx" anno_tbl <- read_annotations(file) ## End(Not run)
Easily move row names to a column and vice-versa without the unwanted
side-effects to object class and attributes. Drop-in replacement for
tibble::rownames_to_column()
and tibble::column_to_rownames()
which
can have undesired side-effects to complex object attributes.
Does not import any external packages, modify the environment, or change
the object (other than the desired column). When using col2rn()
, if
explicit row names exist, they are overwritten with a warning. add_rowid()
does not affect row names, which differs from tibble::rowid_to_column()
.
rn2col(data, name = ".rn") col2rn(data, name = ".rn") has_rn(data) rm_rn(data) set_rn(data, value) add_rowid(data, name = ".rowid")
rn2col(data, name = ".rn") col2rn(data, name = ".rn") has_rn(data) rm_rn(data) set_rn(data, value) add_rowid(data, name = ".rowid")
data |
An object that inherits from class |
name |
Character. The name of the column to move. |
value |
Character. The new set of names for the data frame.
If duplicates exist they are modified on-the-fly via |
All functions attempt to return an object of the same class as
the input with fully intact and unmodified attributes (aside from those
required by the desired action). has_rn()
returns a scalar logical.
rn2col()
: moves the row names of data
to an explicit column
whether they are explicit or implicit.
col2rn()
: is the inverse of rn2col()
. If row names exist, they
will be overwritten (with warning).
has_rn()
: returns a boolean indicating whether the data frame
has explicit row names assigned.
rm_rn()
: removes existing row names, leaving only "implicit" row names.
set_rn()
: sets (and overwrites) existing row names for data frames only.
add_rowid()
: adds a sequential integer row identifier; starting at 1:nrow(data)
.
It does not remove existing row names currently, but may in the future
(please code accordingly).
df <- data.frame(a = 1:5, b = rnorm(5), row.names = LETTERS[1:5]) df rn2col(df) # default name is `.rn` rn2col(df, "AptName") # pass `name =` # moving columns df$mtcars <- sample(names(mtcars), 5) col2rn(df, "mtcars") # with a warning # Move back and forth easily # Leaves original object un-modified identical(df, col2rn(rn2col(df))) # add "id" column add_rowid(mtcars) # remove row names has_rn(mtcars) mtcars2 <- rm_rn(mtcars) has_rn(mtcars2)
df <- data.frame(a = 1:5, b = rnorm(5), row.names = LETTERS[1:5]) df rn2col(df) # default name is `.rn` rn2col(df, "AptName") # pass `name =` # moving columns df$mtcars <- sample(names(mtcars), 5) col2rn(df, "mtcars") # with a warning # Move back and forth easily # Leaves original object un-modified identical(df, col2rn(rn2col(df))) # add "id" column add_rowid(mtcars) # remove row names has_rn(mtcars) mtcars2 <- rm_rn(mtcars) has_rn(mtcars2)
The SeqId
is the cornerstone used to uniquely identify
SomaLogic analytes.
SeqIds
follow the format <Pool>-<Clone>_<Version>
, for example
"1234-56_7"
can be represented as:
Pool | Clone | Version |
1234 |
56 |
7
|
See Details below for the definition of each sub-unit.
The <Pool>-<Clone>
combination is sufficient to uniquely identify a
specific analyte and therefore versions are no longer provided (though
they may be present in legacy ADATs).
The tools below enable users to extract, test, identify, compare,
and manipulate SeqIds
across assay runs and/or versions.
getSeqId(x, trim.version = FALSE) regexSeqId() locateSeqId(x, trailing = TRUE) seqid2apt(x) apt2seqid(x) is.apt(x) is.SeqId(x) matchSeqIds(x, y, order.by.x = TRUE) getSeqIdMatches(x, y, show = FALSE)
getSeqId(x, trim.version = FALSE) regexSeqId() locateSeqId(x, trailing = TRUE) seqid2apt(x) apt2seqid(x) is.apt(x) is.SeqId(x) matchSeqIds(x, y, order.by.x = TRUE) getSeqIdMatches(x, y, show = FALSE)
x |
Character. A vector of strings, usually analyte/feature column
names, |
trim.version |
Logical. Whether to remove the version number, i.e. "1234-56_7" -> "1234-56". Primarily for legacy ADATs. |
trailing |
Logical. Should the regular expression explicitly specify
trailing |
y |
Character. A second vector of |
order.by.x |
Logical. Order the returned character string by
the |
show |
Logical. Return the data frame visibly? |
Pool: | ties back to the original well during SELEX |
Clone: | ties to the specific sequence within a pool |
Version: | refers to custom modifications (optional/defunct) |
AptName
a SeqId
combined with a string, usually a GeneId
- or
seq.
-prefix, for convenient, human-readable
manipulation from within R
.
getSeqId()
: a character vector of SeqIds
captured from a string.
regexSeqId()
: a regular expression (regex
) string
pre-defined to match SomaLogic the SeqId
pattern.
locateSeqId()
: a data frame containing the start
and stop
integer positions for SeqId
matches at each value of x
.
seqid2apt()
: a character vector with the seq.*
prefix, i.e.
the inverse of getSeqId()
.
apt2seqid()
: a character vector of SeqIds
. is.SeqId()
will
return TRUE
for all elements.
is.apt()
, is.SeqId()
: Logical. TRUE
or FALSE
.
matchSeqIds()
: a character string corresponding to values
in y
of the intersect of x
and y
. If no matches are
found, character(0)
.
getSeqIdMatches()
: a data frame, where
n
is the
length of the intersect of the matching SeqIds
.
The data frame is named by the passed arguments, x
and y
.
getSeqId()
: extracts/captures the the SeqId
match from an analyte column identifier,
i.e. column name of an ADAT loaded with read_adat()
. Assumes the
SeqId
pattern occurs at the end of the string, which for
the vast majority of cases will be true. For edge cases, see the
trailing
argument to locateSeqId()
.
regexSeqId()
: generates a pre-formatted regular expression for
matching of SeqIds
. Note the trailing match, which is most
commonly required, but locateSeqId()
offers
an alternative to mach anywhere in a string.
Used internally in many utility functions
locateSeqId()
: generates a data frame of the positional SeqId
matches. Specifically
designed to facilitate SeqId
extraction via substr()
.
Similar to stringr::str_locate()
.
seqid2apt()
: converts a SeqId
into anonymous-AptName format, i.e.
1234-56
-> seq.1234.56
. Version numbers (1234-56_ver
)
are always trimmed when present.
apt2seqid()
: converts an anonymous-AptName into SeqId
format, i.e.
seq.1234.56
-> 1234-56
. Version numbers (seq.1234.56.ver
)
are always trimmed when present.
is.apt()
: regular expression match to determine if a string contains
a SeqId
, and thus is probably an AptName
format string. Both
legacy EntrezGeneSymbol-SeqId
combinations or newer
so-called "anonymous-AptNames"
formats (seq.1234.45
) are matched.
is.SeqId()
: tests for SeqId
format, i.e. values returned from getSeqId()
will always return TRUE
.
matchSeqIds()
: matches two character vectors on the basis of their
intersecting SeqIds
. Note that elements in y
not
containing a SeqId
regular expression are silently dropped.
getSeqIdMatches()
: matches two character vectors on the basis of their intersecting SeqIds
only (irrespective of the GeneID
-prefix). This produces a two-column
data frame which then can be used as to map between the two sets.
The final order of the matches/rows is by the input
corresponding to the first argument (x
).
By default the data frame is invisibly returned to
avoid dumping excess output to the console (see the show =
argument.)
Stu Field
x <- c("ABDC.3948.48.2", "3948.88", "3948.48.2", "3948-48_2", "3948.48.2", "3948-48_2", "3948-88", "My.Favorite.Apt.3948.88.9") tibble::tibble(orig = x, SeqId = getSeqId(x), SeqId_trim = getSeqId(x, TRUE), AptName = seqid2apt(SeqId)) # Logical Matching is.apt("AGR2.4959.2") # TRUE is.apt("seq.4959.2") # TRUE is.apt("4959-2") # TRUE is.apt("AGR2") # FALSE # SeqId Matching x <- c("seq.4554.56", "seq.3714.49", "PlateId") y <- c("Group", "3714-49", "Assay", "4554-56") matchSeqIds(x, y) matchSeqIds(x, y, order.by.x = FALSE) # vector of features feats <- getAnalytes(example_data) match_df <- getSeqIdMatches(feats[1:100], feats[90:500]) # 11 overlapping match_df a <- utils::head(feats, 15) b <- withr::with_seed(99, sample(getSeqId(a))) # => SeqId & shuffle (getSeqIdMatches(a, b)) # sorted by first vector "a"
x <- c("ABDC.3948.48.2", "3948.88", "3948.48.2", "3948-48_2", "3948.48.2", "3948-48_2", "3948-88", "My.Favorite.Apt.3948.88.9") tibble::tibble(orig = x, SeqId = getSeqId(x), SeqId_trim = getSeqId(x, TRUE), AptName = seqid2apt(SeqId)) # Logical Matching is.apt("AGR2.4959.2") # TRUE is.apt("seq.4959.2") # TRUE is.apt("4959-2") # TRUE is.apt("AGR2") # FALSE # SeqId Matching x <- c("seq.4554.56", "seq.3714.49", "PlateId") y <- c("Group", "3714-49", "Assay", "4554-56") matchSeqIds(x, y) matchSeqIds(x, y, order.by.x = FALSE) # vector of features feats <- getAnalytes(example_data) match_df <- getSeqIdMatches(feats[1:100], feats[90:500]) # 11 overlapping match_df a <- utils::head(feats, 15) b <- withr::with_seed(99, sample(getSeqId(a))) # => SeqId & shuffle (getSeqIdMatches(a, b)) # sorted by first vector "a"
soma_adat
Class and S3 MethodsThe soma_adat
data structure is the primary internal R
representation
of SomaScan data. A soma_adat
is automatically created via read_adat()
when loading a *.adat
text file. It consists of a data.frame
-like
object with leading columns as clinical variables and SomaScan RFU data
as the remaining variables. Two main attributes corresponding to analyte
and SomaScan run information contained in the *.adat
file are added:
Header.Meta
: information about the SomaScan run, see parseHeader()
or attr(x, "Header.Meta")
Col.Meta
: annotations information about the SomaScan reagents/analytes,
see getAnalyteInfo()
or attr(x, "Col.Meta")
file_specs
: parsing specifications for the ingested *.adat
file
row_meta
: the names of the non-RFU fields. See getMeta()
.
See groupGenerics()
for a details on Math()
, Ops()
, and Summary()
methods that dispatch on class soma_adat
.
See reexports()
for a details on re-exported S3 generics from other
packages (mostly dplyr
and tidyr
) to enable S3 methods to be
dispatched on class soma_adat
.
Below is a list of all currently available S3 methods that dispatch on
the soma_adat
class:
#> [1] [ [[ [[<- [<- #> [5] == $ $<- anti_join #> [9] arrange count filter full_join #> [13] getAdatVersion getAnalytes getMeta group_by #> [17] inner_join is_seqFormat left_join Math #> [21] median merge mutate Ops #> [25] print rename right_join row.names<- #> [29] sample_frac sample_n select semi_join #> [33] separate slice_sample slice summary #> [37] Summary transform ungroup unite #> see '?methods' for accessing help and source code
The S3 print()
method returns summary information parsed from the object
attributes, if present, followed by a dispatch to the tibble::tibble()
print method. Rownames are printed as the first column in the print method
only.
The S3 summary()
method returns the following for each column of the ADAT
object containing SOMAmer data (clinical meta data is excluded):
Target (if available)
Minimum value
1st Quantile
Median
Mean
3rd Quantile
Maximum value
Standard deviation
Median absolute deviation (mad()
)
Interquartile range (IQR()
)
The S3 Extract()
method is used for sub-setting a soma_adat
object and relies heavily on the [
method that maintains the soma_adat
attributes intact and subsets the Col.Meta
so that it is consistent
with the newly created object.
S3 extraction via $
is fully supported, however,
as opposed to the data.frame
method, partial matching
is not allowed for class soma_adat
.
S3 extraction via [[
is supported, however, we restrict
the usage of [[
for soma_adat
. Use only a numeric index (e.g. 1L
)
or a character identifying the column (e.g. "SampleID"
).
Do not use [[i,j]]
syntax with [[
, use [
instead.
As with $
, partial matching is not allowed.
S3 assignment via [
is supported for class soma_adat
.
S3 assignment via $
is fully supported for class soma_adat
.
S3 assignment via [[
is supported for class soma_adat
.
S3 median()
is not currently supported for the soma_adat
class,
however a dispatch is in place to direct users to alternatives.
## S3 method for class 'soma_adat' print(x, show_header = FALSE, ...) ## S3 method for class 'soma_adat' summary(object, tbl = NULL, digits = max(3L, getOption("digits") - 3L), ...) ## S3 method for class 'soma_adat' x[i, j, drop = TRUE, ...] ## S3 method for class 'soma_adat' x$name ## S3 method for class 'soma_adat' x[[i, j, ..., exact = TRUE]] ## S3 replacement method for class 'soma_adat' x[i, j, ...] <- value ## S3 replacement method for class 'soma_adat' x$i, j, ... <- value ## S3 replacement method for class 'soma_adat' x[[i, j, ...]] <- value ## S3 method for class 'soma_adat' median(x, na.rm = FALSE, ...)
## S3 method for class 'soma_adat' print(x, show_header = FALSE, ...) ## S3 method for class 'soma_adat' summary(object, tbl = NULL, digits = max(3L, getOption("digits") - 3L), ...) ## S3 method for class 'soma_adat' x[i, j, drop = TRUE, ...] ## S3 method for class 'soma_adat' x$name ## S3 method for class 'soma_adat' x[[i, j, ..., exact = TRUE]] ## S3 replacement method for class 'soma_adat' x[i, j, ...] <- value ## S3 replacement method for class 'soma_adat' x$i, j, ... <- value ## S3 replacement method for class 'soma_adat' x[[i, j, ...]] <- value ## S3 method for class 'soma_adat' median(x, na.rm = FALSE, ...)
x , object
|
A |
show_header |
Logical. Should all the |
... |
Ignored. |
tbl |
An annotations table. If |
digits |
Integer. Used for number formatting with |
i , j
|
Row and column indices respectively. If |
drop |
Coerce to a vector if fetching one column via |
name |
A name or a string. |
exact |
Ignored with a |
value |
A value to store in a row, column, range or cell. |
na.rm |
a logical value indicating whether |
The set of S3 methods above return the soma_adat
object with
the corresponding S3 method applied.
Other IO:
loadAdatsAsList()
,
parseHeader()
,
read_adat()
,
write_adat()
# S3 print method example_data # show the header info (no RFU data) print(example_data, show_header = TRUE) # S3 summary method # MMP analytes (4) mmps <- c("seq.2579.17", "seq.2788.55", "seq.2789.26", "seq.4925.54") mmp_adat <- example_data[, c("Sex", mmps)] summary(mmp_adat) # Summarize by group mmp_adat |> split(mmp_adat$Sex) |> lapply(summary) # Alternatively pass annotations with Target info anno <- getAnalyteInfo(mmp_adat) summary(mmp_adat, tbl = anno)
# S3 print method example_data # show the header info (no RFU data) print(example_data, show_header = TRUE) # S3 summary method # MMP analytes (4) mmps <- c("seq.2579.17", "seq.2788.55", "seq.2789.26", "seq.4925.54") mmp_adat <- example_data[, c("Sex", mmps)] summary(mmp_adat) # Summarize by group mmp_adat |> split(mmp_adat$Sex) |> lapply(summary) # Alternatively pass annotations with Target info anno <- getAnalyteInfo(mmp_adat) summary(mmp_adat, tbl = anno)
These functions have either been
or
in the current version of SomaDataIO package.
They may eventually be completely removed, so
please re-code your scripts accordingly based on the
suggestions below:
Function | Now Use | |
getSomamers() |
|
getAnalytes() |
getSomamerData() |
|
getAnalyteInfo() |
Some badges you may see in SomaDataIO:
The example_data
object is intended to provide existing and prospective
SomaLogic customers with example data to enable analysis preparation prior
to receipt of SomaScan data, and also for those generally curious about the
SomaScan data deliverable. It is not intended to be used as a control
group for studies or provide any metrics for SomaScan data in general.
a soma_adat
parsed via read_adat()
containing
192 samples (see below for breakdown of sample type). There are 5318
columns containing 5284 analyte features and 34 clinical meta data fields.
These data have been pre-processed via the following steps:
hybridization normalized (all samples)
calibrators and buffers median normalized
plate scaled
calibrated
Adaptive Normalization by Maximum Likelihood (ANML) of QC and clinical samples
Note1: The Age
and Sex
(M
/F
) fields contain simulated values
designed to contain biological signal.
**Note2:** The `SampleType` column contains sample source/type information and usually the `SampleType == Sample` represents the "client" samples. **Note3:** The original source file can be found at \url{https://github.com/SomaLogic/SomaLogic-Data}.
character string of the analyte features contained
in the soma_adat
object, derived from a call to getAnalytes()
.
a lookup table corresponding to a
transposed data frame of the "Col.Meta" attribute of an ADAT, with an
index key field AptName
included in column 1, derived from a call to
getAnalyteInfo()
.
A lookup table mapping SeqId
feature names ->
target names contained in example_data
. This object (or one like it) is
convenient at the console via auto-complete for labeling and/or creating
plot titles on the fly.
The example_data
object contains a SomaScan V4 study from healthy
normal individuals. The RFU measurements themselves and other identifiers
have been altered to protect personally identifiable information (PII),
but also retain underlying biological signal as much as possible.
There are 192 total EDTA-plasma samples across two 96-well plate runs
which are broken down by the following types:
170 clinical samples (client study samples)
10 calibrators (replicate controls for combining data across runs)
6 QC samples (replicate controls used to assess run quality)
6 Buffer samples (no protein controls)
The standard V4 data normalization procedure for EDTA-plasma samples was applied to this dataset. For more details on the data standardization process see the Data Standardization and File Specification Technical Note. General details are outlined above.
https://github.com/SomaLogic/SomaLogic-Data
Standard BioTools, Inc.
# S3 print method example_data # print header info print(example_data, show_header = TRUE) class(example_data) # Features/Analytes head(ex_analytes, 20L) # Feature info table (annotations) ex_anno_tbl # Search via `filter()` dplyr::filter(ex_anno_tbl, grepl("^MMP", Target)) # Lookup table -> targets # MMP-9 ex_target_names$seq.2579.17 # gender hormone FSH tapply(example_data$seq.3032.11, example_data$Sex, median) # gender hormone LH tapply(example_data$seq.2953.31, example_data$Sex, median) # Target lookup ex_target_names$seq.2953.31 # tab-completion at console # Sample Type/Source table(example_data$SampleType) # Sex/Gender Variable table(example_data$Sex) # Age Variable summary(example_data$Age)
# S3 print method example_data # print header info print(example_data, show_header = TRUE) class(example_data) # Features/Analytes head(ex_analytes, 20L) # Feature info table (annotations) ex_anno_tbl # Search via `filter()` dplyr::filter(ex_anno_tbl, grepl("^MMP", Target)) # Lookup table -> targets # MMP-9 ex_target_names$seq.2579.17 # gender hormone FSH tapply(example_data$seq.3032.11, example_data$Sex, median) # gender hormone LH tapply(example_data$seq.2953.31, example_data$Sex, median) # Target lookup ex_target_names$seq.2953.31 # tab-completion at console # Sample Type/Source table(example_data$SampleType) # Sex/Gender Variable table(example_data$Sex) # Age Variable summary(example_data$Age)
soma_adat
Columns/RowsScale the i-th row or column of a soma_adat
object by the i-th
element of a vector. Designed to facilitate linear transformations
of only the analyte/RFU entries by scaling the data matrix.
If scaling the analytes/RFU (columns), v
must have
getAnalytes(adat, n = TRUE)
elements.
If scaling the samples (rows), v
must
have nrow(_data)
elements.
## S3 method for class 'soma_adat' transform(`_data`, v, dim = 2L, ...)
## S3 method for class 'soma_adat' transform(`_data`, v, dim = 2L, ...)
_data |
A |
v |
A numeric vector of the appropriate length corresponding to |
dim |
Integer. The dimension to apply elements of |
... |
Currently not used but required by the S3 generic. |
Performs the following operations (quickly):
Columns:
Rows:
A modified value of _data
with either the rows or columns
linearly transformed by v
.
This method in intentionally naive, and assumes the user has
ordered v
to match the columns/rows of _data
appropriately.
This must be done upstream.
# simplified example of underlying operations M <- matrix(1:12, ncol = 4) M v <- 1:4 M %*% diag(v) # transform columns v <- 1:3 diag(v) %*% M # transform rows # dummy ADAT example: v <- c(2, 0.5) # double seq1; half seq2 adat <- data.frame(sample = paste0("sample_", 1:3), seq.1234.56 = c(1, 2, 3), seq.9999.88 = c(4, 5, 6) * 10) adat # `soma_adat` to invoke S3 method dispatch class(adat) <- c("soma_adat", "data.frame") trans <- transform(adat, v) data.frame(trans)
# simplified example of underlying operations M <- matrix(1:12, ncol = 4) M v <- 1:4 M %*% diag(v) # transform columns v <- 1:3 diag(v) %*% M # transform rows # dummy ADAT example: v <- c(2, 0.5) # double seq1; half seq2 adat <- data.frame(sample = paste0("sample_", 1:3), seq.1234.56 = c(1, 2, 3), seq.9999.88 = c(4, 5, 6) * 10) adat # `soma_adat` to invoke S3 method dispatch class(adat) <- c("soma_adat", "data.frame") trans <- transform(adat, v) data.frame(trans)
One can write an existing modified internal ADAT
(soma_adat
R object) to an external file.
However the ADAT object itself must have intact
attributes, see is_intact_attr()
.
write_adat(x, file)
write_adat(x, file)
x |
A |
file |
Character. File path where the object should be written.
For example, extensions should be |
The ADAT specification no longer requires Windows
end of line (EOL) characters ("\r\n"
).
The current EOL spec is "\n"
which is commonly used in POSIX systems,
like MacOS and Linux.
Since the EOL affects the resulting checksum, ADATs written on
other systems generate slightly differing files.
Standardizing to "\n"
attempts to solve this issue.
For reference, see the EOL encoding for operating systems below:
Symbol | Platform | Character |
LF | Linux | "\n" |
CR | MacOS | "\r" |
CRLF | DOS/Windows | "\r\n"
|
Invisibly returns the input x
.
Stu Field
Other IO:
loadAdatsAsList()
,
parseHeader()
,
read_adat()
,
soma_adat
# trim to 1 sample for speed adat_out <- head(example_data, 1L) # attributes must(!) be intact to write is_intact_attr(adat_out) write_adat(adat_out, file = tempfile(fileext = ".adat"))
# trim to 1 sample for speed adat_out <- head(example_data, 1L) # attributes must(!) be intact to write is_intact_attr(adat_out) write_adat(adat_out, file = tempfile(fileext = ".adat"))