| Title: | Input/Output 'SomaScan' Data |
|---|---|
| Description: | Load and export 'SomaScan' data via the 'SomaLogic Operating Co., Inc.' structured text file called an ADAT ('*.adat'). For file format see <https://github.com/SomaLogic/SomaLogic-Data/blob/main/README.md>. The package also exports auxiliary functions for manipulating, wrangling, and extracting relevant information from an ADAT object once in memory. |
| Authors: | Stu Field [aut] (ORCID: <https://orcid.org/0000-0002-1024-5859>), Caleb Scheidel [cre], SomaLogic Operating Co., Inc. [cph, fnd] |
| Maintainer: | Caleb Scheidel <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 6.6.1 |
| Built: | 2026-05-18 15:31:36 UTC |
| Source: | https://github.com/somalogic/somadataio |
Retrieve elements of the HEADER attribute of a soma_adat object:
getAdatVersion() determines the the ADAT version
number from a parsed ADAT header.
getSomaScanVersion() determines the original SomaScan assay version
that generated RFU measurements within a soma_adat object.
checkSomaScanVersion() determines if the version of
is a recognized version of SomaScan.
Table of SomaScan assay versions:
| Version | Commercial Name | Size |
V4 |
5k | 5284 |
v4.1 |
7k | 7596 |
v5.0 |
11k | 11083 |
getSignalSpace() determines the current signal space of
the RFU values, which may differ from the original SomaScan
signal space if the data have been lifted. See lift_adat() and
vignette("lifting-and-bridging", package = "SomaDataIO").
getSomaScanLiftCCC() accesses the lifting Concordance Correlation
Coefficients between various SomaScan versions. For more about
CCC metrics see lift_adat().
getAdatVersion(x) getSomaScanVersion(adat) getSignalSpace(adat) checkSomaScanVersion(ver) getSomaScanLiftCCC(matrix = c("plasma", "serum"))getAdatVersion(x) getSomaScanVersion(adat) getSignalSpace(adat) checkSomaScanVersion(ver) getSomaScanLiftCCC(matrix = c("plasma", "serum"))
x |
Either a |
adat |
A |
ver |
|
matrix |
Character. A string of (usually) either
|
\link[=getAdatVersion]{getAdatVersion()} |
The key-value of the |
\link[=getSomaScanVersion]{getSomaScanVersion()} |
The key-value of the |
\link[=getSignalSpace]{getSignalSpace()} |
The key-value of the |
\link[=checkSomaScanVersion]{checkSomaScanVersion()} |
Returns |
\link[=getSomaScanLiftCCC]{getSomaScanLiftCCC()} |
Returns a tibble of either the
|
Stu Field
Lin, Lawrence I-Kuei. 1989. A Concordance Correlation Coefficient to Evaluate Reproducibility. Biometrics. 45:255-268.
getAdatVersion(example_data) attr(example_data, "Header.Meta")$HEADER$Version <- "99.9" getAdatVersion(example_data) ver <- getSomaScanVersion(example_data) ver rfu_space <- getSignalSpace(example_data) rfu_space is.null(checkSomaScanVersion(ver)) # plasma (default) getSomaScanLiftCCC() # serum getSomaScanLiftCCC("serum")getAdatVersion(example_data) attr(example_data, "Header.Meta")$HEADER$Version <- "99.9" getAdatVersion(example_data) ver <- getSomaScanVersion(example_data) ver rfu_space <- getSignalSpace(example_data) rfu_space is.null(checkSomaScanVersion(ver)) # plasma (default) getSomaScanLiftCCC() # serum getSomaScanLiftCCC("serum")
Utility to convert a SomaLogic soma_adat object to an
ExpressionSet object via the Biobase package
from Bioconductor:
https://www.bioconductor.org/packages/release/bioc/html/Biobase.html.
adat2eSet(adat)adat2eSet(adat)
adat |
A |
The Biobase package is required and must be installed from Bioconductor via the following at the R console:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("Biobase", version = remotes::bioc_version())
A Bioconductor object of class ExpressionSet.
Stu Field
https://bioconductor.org/install/
Other eSet:
pivotExpressionSet()
eSet <- adat2eSet(example_data) class(eSet) eSet ft <- Biobase::exprs(eSet) head(ft[, 1:10L], 10L)eSet <- adat2eSet(example_data) class(eSet) eSet ft <- Biobase::exprs(eSet) head(ft[, 1:10L], 10L)
soma_adat ObjectsAdds a set of attributes, typically "Header.Meta" and "Col.Meta",
to a data.frame, tibble, soma_adat or similar tabular object.
Existing attributes data are not over-written.
Typically untouched are:
names
class
row.names
addAttributes(data, new.atts)addAttributes(data, new.atts)
data |
The receiving |
new.atts |
A named |
A data frame object corresponding to data but with the
attributes of new.atts grafted on to it.
Existing attribute names are not over-written.
Stu Field
Utility to add (prepend) a class(es) to existing objects.
addClass(x, class)addClass(x, class)
x |
The object to receive new class(es). |
class |
Character. The name of additional class(es). |
An object with new classes.
Stu Field
class(), typeof(), structure()
class(iris) addClass(iris, "new") |> class() addClass(iris, c("A", "B")) |> class() # 2 classes addClass(iris, c("A", "data.frame")) |> class() # no duplicates addClass(iris, c("data.frame", "A")) |> class() # re-orders if existsclass(iris) addClass(iris, "new") |> class() addClass(iris, c("A", "B")) |> class() # 2 classes addClass(iris, c("A", "data.frame")) |> class() # no duplicates addClass(iris, c("data.frame", "A")) |> class() # re-orders if exists
Calculate the estimated limit of detection (eLOD) for SOMAmer reagent analytes in the provided input data. The input data should be filtered to include only buffer samples desired for eLOD calculation.
calc_eLOD(data)calc_eLOD(data)
data |
A |
eLOD is calculated using the following steps:
For each SOMAmer, the median and adjusted median absolute
deviation () are calculated, where
The 1.4826 is a set constant used to adjust the MAD to be reflective of the standard deviation of the normal distribution.
For each SOMAmer, calculate
Note: The eLOD is useful for non-core matrices, including cell lysate and CSF, but should be used carefully for evaluating background signal in plasma and serum.
A tibble object with 2 columns: SeqId and eLOD.
Caleb Scheidel, Christopher Dimapasok
# filter data frame using vector of SampleId controls df <- withr::with_seed(101, { data.frame( SampleType = rep(c("Sample", "Buffer"), each = 10), SampleId = paste0("Sample_", 1:20), seq.20.1.100 = runif(20, 1, 100), seq.21.1.100 = runif(20, 1, 100), seq.22.2.100 = runif(20, 1, 100) ) }) sample_ids <- paste0("Sample_", 11:20) selected_samples <- df |> filter(SampleId %in% sample_ids) selected_elod <- calc_eLOD(selected_samples) head(selected_elod) ## Not run: # filter `soma_adat` object to buffer samples buffer_samples <- example_data |> filter(SampleType == "Buffer") # calculate eLOD buffer_elod <- calc_eLOD(buffer_samples) head(buffer_elod) # use eLOD to calculate signal to noise ratio of samples samples_median <- example_data |> dplyr::filter(SampleType == "Sample") |> dplyr::summarise(across(starts_with("seq"), median, .names = "median_{col}")) |> tidyr::pivot_longer(starts_with("median_"), names_to = "SeqId", values_to = "median_signal") |> dplyr::mutate(SeqId = gsub("median_seq", "seq", SeqId)) # analytes with signal to noise > 2 ratios <- samples_median |> dplyr::mutate(signal_to_noise = median_signal / buffer_elod$eLOD) |> dplyr::filter(signal_to_noise > 2) |> dplyr::arrange(desc(signal_to_noise)) head(ratios) ## End(Not run)# filter data frame using vector of SampleId controls df <- withr::with_seed(101, { data.frame( SampleType = rep(c("Sample", "Buffer"), each = 10), SampleId = paste0("Sample_", 1:20), seq.20.1.100 = runif(20, 1, 100), seq.21.1.100 = runif(20, 1, 100), seq.22.2.100 = runif(20, 1, 100) ) }) sample_ids <- paste0("Sample_", 11:20) selected_samples <- df |> filter(SampleId %in% sample_ids) selected_elod <- calc_eLOD(selected_samples) head(selected_elod) ## Not run: # filter `soma_adat` object to buffer samples buffer_samples <- example_data |> filter(SampleType == "Buffer") # calculate eLOD buffer_elod <- calc_eLOD(buffer_samples) head(buffer_elod) # use eLOD to calculate signal to noise ratio of samples samples_median <- example_data |> dplyr::filter(SampleType == "Sample") |> dplyr::summarise(across(starts_with("seq"), median, .names = "median_{col}")) |> tidyr::pivot_longer(starts_with("median_"), names_to = "SeqId", values_to = "median_signal") |> dplyr::mutate(SeqId = gsub("median_seq", "seq", SeqId)) # analytes with signal to noise > 2 ratios <- samples_median |> dplyr::mutate(signal_to_noise = median_signal / buffer_elod$eLOD) |> dplyr::filter(signal_to_noise > 2) |> dplyr::arrange(desc(signal_to_noise)) head(ratios) ## End(Not run)
Calculate the median absolute deviation (statistical) outliers measurements
and fold-change criteria from an ADAT. Two values are required for the
calculation: median absolute deviation (MAD) and fold-change (FC). Outliers
are determined based on the result of both 6*MAD and x*FC , where x
is the number of fold changes defined.
calcOutlierMap( data, anno_tbl = NULL, apt.order = c(NA, "dilution", "signal"), sample.order = NULL, fc.crit = 5 ) ## S3 method for class 'outlier_map' print(x, ...)calcOutlierMap( data, anno_tbl = NULL, apt.order = c(NA, "dilution", "signal"), sample.order = NULL, fc.crit = 5 ) ## S3 method for class 'outlier_map' print(x, ...)
data |
A |
anno_tbl |
An annotations table produced via |
apt.order |
Character. How should the columns/features be ordered?
Options include: by dilution mix ("dilution"), by median overall signal
("signal"), or as-is in |
sample.order |
Either a character string indicating the column name
with entries to be used to order the data frame rows, or a numeric vector
representing the order of the data frame rows. The
default ( |
fc.crit |
Integer. The fold change criterion to evaluate. Defaults to 5x. |
x |
An object of class |
... |
Arguments for S3 print methods. |
For the S3 plotting method, see plot.Map().
A list of class c("outlier_map", "Map") containing:
matrix |
A boolean matrix of |
x.lab |
A character string containing the plot x-axis label. |
title |
A character string containing the plot title. |
rows.by.freq |
A logical indicating if the samples are ordered by outlier frequency. |
class.tab |
A table containing the frequencies of each class if input
|
sample.order |
A numeric vector representing the order of the data frame rows. |
legend.sub |
A character string containing the plot legend subtitle. |
print(outlier_map): There is a S3 print method for class "outlier_map".
Stu Field
Other Calc Map:
getOutlierIds(),
plot.Map()
dat <- example_data |> dplyr::filter(SampleType == "Sample") om <- calcOutlierMap(dat) class(om) # S3 print method om # `sample.order = "frequency"` orders samples by outlier frequency om <- calcOutlierMap(dat, sample.order = "frequency") om$rows.by.freq om$sample.order # order samples field in Adat om <- calcOutlierMap(dat, sample.order = "Sex") om$sample.orderdat <- example_data |> dplyr::filter(SampleType == "Sample") om <- calcOutlierMap(dat) class(om) # S3 print method om # `sample.order = "frequency"` orders samples by outlier frequency om <- calcOutlierMap(dat, sample.order = "frequency") om$rows.by.freq om$sample.order # order samples field in Adat om <- calcOutlierMap(dat, sample.order = "Sex") om$sample.order
Often the names, particularly within soma_adat objects,
are messy due to varying inputs, this function attempts to remedy this by
removing the following:
trailing/leading/internal whitespace
non-alphanumeric strings (except underscores)
duplicated internal dots (..), (...), etc.
SomaScan normalization scale factor format
cleanNames(x)cleanNames(x)
x |
Character. String to clean up. |
A cleaned up character string.
Stu Field
cleanNames(" sdkfj...sdlkfj.sdfii4994### ") cleanNames("Hyb..Scale")cleanNames(" sdkfj...sdlkfj.sdfii4994### ") cleanNames("Hyb..Scale")
In a standard SomaLogic ADAT, the section of information that
sits directly above the measurement data (RFU data matrix) is
the column meta data (Col.Meta), which contains detailed information
and annotations about the analytes, SeqId()s, and their targets.
See section below for further information about available
fields and their descriptions. Use getAnalyteInfo() to
obtain an object containing this information for programmatic analyses,
and use getMeta() to obtain the column names representing the
row-specific meta data about the samples (see section below).
Information describing the analytes is found to the above the data matrix in a standard SomaLogic ADAT. This information may consist of the any or all of the following:
| Field | Description | Example |
| SeqId | SomaLogic sequence identifier | 2182-54_1 |
| SeqidVersion | Version of SOMAmer sequence | 2 |
| SomaId | Target identifier, of the form SLnnnnnn (8 characters in length) | SL000318 |
| TargetFullName | Target name curated for consistency with UniProt name | Complement C4b |
| Target | SomaLogic Target Name | C4b |
| UniProt | UniProt identifier(s) | P0C0L4 P0C0L5 |
| EntrezGeneID | Entrez Gene Identifier(s) | 720 721 |
| EntrezGeneSymbol | Entrez Gene Symbol names | C4A C4B |
| Organism | Protein Source Organism | Human |
| Units | Relative Fluorescence Units | RFU |
| Type | SOMAmer target type | Protein |
| Dilution | Dilution mix assignment | 0.01% |
| PlateScale_Reference | PlateScale reference value | 1378.85 |
| CalReference | Calibration sample reference value | 1378.85 |
| medNormRef_ReferenceRFU | Median normalization reference value | 490.342 |
Cal_V4_<YY>_<SSS>_<PPP> |
Calibration scale factor (for given Year_Study_Plate) | 0.64 |
| ColCheck | QC acceptance criteria across all plates/sets | PASS |
QcReference_<LLLLL> |
QC sample reference value (for given QC lot) | PASS |
CalQcRatio_V4_<YY>_<SSS>_<PPP> |
Post calibration median QC ratio to reference (for given Year_Study_Plate) | 1.04 |
Information describing the samples is typically found to the left of the data matrix in a standard SomaLogic ADAT. This information may consist of clinical information provided by the client, or run-specific diagnostic information included for assay quality control. Below are some examples of what may be present in this section:
| Field | Description | Examples |
| PlateId | Plate identifier | V4-18-004_001, V4-18-004_002 |
| ScannerID | Scanner used to analyze slide | SG12064173, SG14374437 |
| PlatePosition | Location on 96 well plate (A1-H12) | A1, H12 |
| SlideId | Agilent slide barcode | 2.58E+11 |
| Subarray | Agilent subarray (1 – 8) | 1,8 |
| SampleId | 1st form is Subject Identifier, 2nd form (calibrators, buffers) | 2031 |
| SampleType | 1st form for clinical samples (Sample), 2nd form as above | Sample, QC, Calibrator, Buffer |
| PercentDilution | Highest concentration the SOMAmer dilution groups | 20 |
| SampleMatrix | Sample matrix | Plasma-PPT |
| Barcode | 1D Barcode of aliquot | S622225 |
| Barcode2d | 2D Barcode of aliquot | 1.91E+08 |
| SampleNotes | Assay team sample observation | Cloudy, Low sample volume, Reddish |
| SampleDescription | Supplemental sample information | Plasma QC 1 |
| AssayNotes | Assay team run observation | Beads aspirated, Leak/Hole, Smear |
| TimePoint | Sample time point | Baseline |
| ExtIdentifier | Primary key for Subarray | EXID40000000032037 |
| SsfExtId | Primary key for sample | EID102733 |
| SampleGroup | Sample group | A, B |
| SiteId | Collection site | SomaLogic, Covance |
| TubeUniqueID | Unique tube identifier | 1.12E+11 |
| CLI | Cohort definition identifier | CLI6006F001 |
| HybControlNormScale | Hybridization control scale factor | 0.948304 |
| RowCheck | Normalization acceptance criteria for all row scale factors | PASS, FLAG |
| NormScale_0_5 | Median signal normalization scale factor (0.5% mix) | 1.02718 |
| NormScale_0_005 | Median signal normalization scale factor (0.005% mix) | 1.119754 |
| NormScale_20 | Median signal normalization scale factor (20% mix) | 0.996148 |
# Annotations/Col.Meta tbl <- getAnalyteInfo(example_data) tbl # Row/sample Meta r_m <- getMeta(example_data) head(r_m) # Normalization Scale Factors grep("NormScale", r_m, value = TRUE) # adat subset example_data[1:3, head(r_m)]# Annotations/Col.Meta tbl <- getAnalyteInfo(example_data) tbl # Row/sample Meta r_m <- getMeta(example_data) head(r_m) # Normalization Scale Factors grep("NormScale", r_m, value = TRUE) # adat subset example_data[1:3, head(r_m)]
Diff tool for the differences between two soma_adat objects.
When diffs of the table values are interrogated, only
the intersect of the column meta data or feature data is considered
diffAdats(adat1, adat2, tolerance = 1e-06)diffAdats(adat1, adat2, tolerance = 1e-06)
adat1, adat2
|
Two |
tolerance |
Numeric |
NULL, invisibly. Called for side effects.
Only diffs of the column name intersect are reported.
Stu Field
# subset `example_data` for speed # all SeqIds from 2000 -> 2999 seqs <- grep("^seq\\.2[0-9]{3}", names(example_data), value = TRUE) ex_data_small <- head(example_data[, c(getMeta(example_data), seqs)], 10L) dim(ex_data_small) # no diff to itself diffAdats(ex_data_small, ex_data_small) # remove random column rm <- withr::with_seed(123, sample(1:ncol(ex_data_small), 1)) diffAdats(ex_data_small, ex_data_small[, -rm]) # randomly shuffle Subarray diffAdats(ex_data_small, dplyr::mutate(ex_data_small, Subarray = sample(Subarray))) # modify 2 RFUs randomly new <- ex_data_small new[5L, c(rm, rm + 1L)] <- 999 diffAdats(ex_data_small, new)# subset `example_data` for speed # all SeqIds from 2000 -> 2999 seqs <- grep("^seq\\.2[0-9]{3}", names(example_data), value = TRUE) ex_data_small <- head(example_data[, c(getMeta(example_data), seqs)], 10L) dim(ex_data_small) # no diff to itself diffAdats(ex_data_small, ex_data_small) # remove random column rm <- withr::with_seed(123, sample(1:ncol(ex_data_small), 1)) diffAdats(ex_data_small, ex_data_small[, -rm]) # randomly shuffle Subarray diffAdats(ex_data_small, dplyr::mutate(ex_data_small, Subarray = sample(Subarray))) # modify 2 RFUs randomly new <- ex_data_small new[5L, c(rm, rm + 1L)] <- 999 diffAdats(ex_data_small, new)
Uses the Col.Meta attribute (analyte annotation data that appears above
the protein measurements in the *.adat text file) of a soma_adat object,
adds the AptName column key, conducts a few sanity checks, and
generates a "lookup table" of analyte data that can be used for simple
manipulation and indexing of analyte annotation information.
Most importantly, the analyte column names of the soma_adat
(e.g. seq.XXXX.XX) become the AptName column of the lookup table and
represents the key index between the table and soma_adat from which it comes.
getAnalyteInfo(adat) getTargetNames(tbl) getFeatureData(adat)getAnalyteInfo(adat) getTargetNames(tbl) getFeatureData(adat)
adat |
A |
tbl |
A |
A tibble object with columns corresponding
to the column meta data entries in the soma_adat. One row per analyte.
getTargetNames(): creates a lookup table (or dictionary) as a named list object of AptNames
and Target names in key-value pairs.
This is a convenient tool to quickly access a TargetName given
the AptName in which the key-value pairs map the seq.XXXX.XX
to its corresponding TargetName in tbl.
This structure which provides a convenient auto-completion mechanism at
the command line or for generating plot titles.
getFeatureData(): . Please now use
getAnalyteInfo().
Stu Field
getAnalytes(), is_intact_attr(), read_adat()
# Get Aptamer table anno_tbl <- getAnalyteInfo(example_data) anno_tbl # Use `dplyr::group_by()` dplyr::tally(dplyr::group_by(anno_tbl, Dilution)) # print summary by dilution # Columns containing "Target" anno_tbl |> dplyr::select(dplyr::contains("Target")) # Rows of "Target" starting with MMP anno_tbl |> dplyr::filter(grepl("^MMP", Target)) # Target names tg <- getTargetNames(anno_tbl) # how to use for plotting feats <- sample(anno_tbl$AptName, 6) op <- par(mfrow = c(2, 3)) sapply(feats, function(.x) plot(1:10, main = tg[[.x]])) par(op)# Get Aptamer table anno_tbl <- getAnalyteInfo(example_data) anno_tbl # Use `dplyr::group_by()` dplyr::tally(dplyr::group_by(anno_tbl, Dilution)) # print summary by dilution # Columns containing "Target" anno_tbl |> dplyr::select(dplyr::contains("Target")) # Rows of "Target" starting with MMP anno_tbl |> dplyr::filter(grepl("^MMP", Target)) # Target names tg <- getTargetNames(anno_tbl) # how to use for plotting feats <- sample(anno_tbl$AptName, 6) op <- par(mfrow = c(2, 3)) sapply(feats, function(.x) plot(1:10, main = tg[[.x]])) par(op)
Return the feature names (i.e. the column names for
SOMAmer reagent analytes) from a soma_adat.
S3 methods also exist for these classes:
#> [1] getAnalytes.character getAnalytes.data.frame getAnalytes.default #> [4] getAnalytes.list getAnalytes.matrix getAnalytes.recipe #> [7] getAnalytes.soma_adat #> see '?methods' for accessing help and source code
getMeta() returns the inverse, a character vector of string
names of non-analyte feature columns/variables, which typically
correspond to the clinical ("meta") data variables.
S3 methods exist for these classes:
#> [1] getMeta.character getMeta.data.frame getMeta.default getMeta.list #> [5] getMeta.matrix getMeta.soma_adat #> see '?methods' for accessing help and source code
getAnalytes(x, n = FALSE, rm.controls = FALSE) getMeta(x, n = FALSE) getFeatures(x, n = FALSE, rm.controls = FALSE)getAnalytes(x, n = FALSE, rm.controls = FALSE) getMeta(x, n = FALSE) getFeatures(x, n = FALSE, rm.controls = FALSE)
x |
Typically a |
n |
Logical. Return an integer corresponding to the length of the features? |
rm.controls |
Logical. Should all control and non-human analytes
(e.g. |
getAnalytes(): a character vector of ADAT feature ("analyte") names.
getMeta(): a character vector of ADAT clinical ("meta") data names.
For both, if n = TRUE, an integer corresponding to the
length of the character vector.
getFeatures(): . Please now use
getAnalytes().
Stu Field
# RFU feature variables apts <- getAnalytes(example_data) head(apts) getAnalytes(example_data, n = TRUE) # vector string bb <- getAnalytes(names(example_data)) all.equal(apts, bb) # create some control sequences # ~~~~~~~~~ Spuriomer ~~~ HybControl ~~~ apts2 <- c("seq.2053.2", "seq.2171.12", head(apts)) apts2 no_crtl <- getAnalytes(apts2, rm.controls = TRUE) no_crtl setdiff(apts2, no_crtl) # clinical variables mvec <- getMeta(example_data) head(mvec, 10) getMeta(example_data, n = TRUE) # test 'data.frame' and 'character' S3 methods are identical identical(getMeta(example_data), getMeta(names(example_data))) # TRUE# RFU feature variables apts <- getAnalytes(example_data) head(apts) getAnalytes(example_data, n = TRUE) # vector string bb <- getAnalytes(names(example_data)) all.equal(apts, bb) # create some control sequences # ~~~~~~~~~ Spuriomer ~~~ HybControl ~~~ apts2 <- c("seq.2053.2", "seq.2171.12", head(apts)) apts2 no_crtl <- getAnalytes(apts2, rm.controls = TRUE) no_crtl setdiff(apts2, no_crtl) # clinical variables mvec <- getMeta(example_data) head(mvec, 10) getMeta(example_data, n = TRUE) # test 'data.frame' and 'character' S3 methods are identical identical(getMeta(example_data), getMeta(names(example_data))) # TRUE
Return the IDs of flagged samples for objects of the outlier_map class.
Samples are flagged based on the percent analytes (RFU columns) for a given
sample that were identified as outliers using the median absolute
deviation (MAD).
getOutlierIds(x, flags = 0.05, data = NULL, include = NULL)getOutlierIds(x, flags = 0.05, data = NULL, include = NULL)
x |
An object of class:
|
flags |
Numeric in |
data |
Optional. The data originally used to create the map |
include |
Optional. Character vector of column name(s) in |
A data.frame of the indices (idx) of flagged samples, along
with any additional variables as specified by include.
Caleb Scheidel
Other Calc Map:
calcOutlierMap(),
plot.Map()
# flagged outliers # create a single sample outlier (12) out_adat <- example_data apts <- getAnalytes(out_adat) out_adat[12, apts] <- out_adat[12, apts] * 10 om <- calcOutlierMap(out_adat) getOutlierIds(om, out_adat, flags = 0.05, include = c("Sex", "Subarray"))# flagged outliers # create a single sample outlier (12) out_adat <- example_data apts <- getAnalytes(out_adat) out_adat[12, apts] <- out_adat[12, apts] * 10 om <- calcOutlierMap(out_adat) getOutlierIds(om, out_adat, flags = 0.05, include = c("Sex", "Subarray"))
soma_adat Class ObjectsS3 group generic methods to apply group specific prototype functions
to the RFU data only of soma_adat objects.
The clinical meta data are not transformed and remain unmodified in
the returned object (Math() and Ops()) or are ignored for the
Summary() group. See groupGeneric().
## S3 method for class 'soma_adat' Math(x, ...) antilog(x, base = 10) ## S3 method for class 'soma_adat' Ops(e1, e2 = NULL) ## S3 method for class 'soma_adat' Summary(..., na.rm = FALSE) ## S3 method for class 'soma_adat' e1 == e2## S3 method for class 'soma_adat' Math(x, ...) antilog(x, base = 10) ## S3 method for class 'soma_adat' Ops(e1, e2 = NULL) ## S3 method for class 'soma_adat' Summary(..., na.rm = FALSE) ## S3 method for class 'soma_adat' e1 == e2
x |
The |
... |
Additional arguments passed to the various group generics as appropriate. |
base |
A positive or complex number: the base with respect to which logarithms are computed. |
e1, e2
|
Objects. |
na.rm |
Logical. Should missing values be removed? |
A soma_adat object with the same dimensions of the input
object with the feature columns transformed by the specified generic.
antilog(): performs the inverse or anti-log transform for a numeric vector of
soma_adat object. note: default is base = 10, which differs from
the log() default base e.
Ops(soma_adat): performs binary mathematical operations on class soma_adat. See Ops().
Summary(soma_adat): performs summary calculations on class soma_adat. See Summary().
== : compares left- and right-hand sides of the operator unless the RHS
is also a soma_adat, in which case diffAdats() is invoked.
Group members:
#> [1] "abs" "acos" "acosh" "asin" "asinh" "atan" #> [7] "atanh" "ceiling" "cos" "cosh" "cospi" "cummax" #> [13] "cummin" "cumprod" "cumsum" "digamma" "exp" "expm1" #> [19] "floor" "gamma" "lgamma" "log" "log10" "log1p" #> [25] "log2" "sign" "sin" "sinh" "sinpi" "sqrt" #> [31] "tan" "tanh" "tanpi" "trigamma" "trunc"
Commonly used generics of this group include:
log(), log10(), log2(), antilog(),
abs(), sign(), floor(), sqrt(), exp()
Group members:
#> [1] "+" "-" "*" "^" "%%" "%/%" "/" "==" ">" "<" "!=" "<=" #> [13] ">="
Note that for the `==` method if the RHS is also a soma_adat,
diffAdats() is invoked which compares LHS vs. RHS.
Commonly used generics of this group include:
+, -, *, /, ^, ==, >, <
Group members:
#> [1] "all" "any" "max" "min" "prod" "range" "sum"
Commonly used generics of this group include:
max(), min(), range(), sum(), any()
Stu Field
groupGeneric(), getGroupMembers(), getGroup()
# subset `example_data` for speed # all SeqIds from 2000 -> 2999 seqs <- grep("^seq\\.2[0-9]{3}", names(example_data), value = TRUE) ex_data_small <- head(example_data[, c(getMeta(example_data), seqs)], 10L) dim(ex_data_small) ex_data_small$seq.2991.9 # Math Generics: # ------------- # log-transformation a <- log(ex_data_small) a$seq.2991.9 b <- log10(ex_data_small) b$seq.2991.9 isTRUE(all.equal(b, log(ex_data_small, base = 10))) # floor c <- floor(ex_data_small) c$seq.2991.9 # square-root d <- sqrt(ex_data_small) d$seq.2991.9 # rounding e <- round(ex_data_small) e$seq.2991.9 # inverse log antilog(1:4) alog <- antilog(b) all.equal(ex_data_small, alog) # return `b` -> linear space # Ops Generics: # ------------- plus1 <- ex_data_small + 1 times2 <- ex_data_small * 2 sq <- ex_data_small^2 all.equal(sqrt(sq), ex_data_small) gt100k <- ex_data_small > 100000 gt100k ex_data_small == ex_data_small # invokes diffAdats() # Summary Generics: # ------------- sum(ex_data_small) any(ex_data_small < 100) # low RFU analytes sum(ex_data_small < 100) # how many min(ex_data_small) min(ex_data_small, 0) max(ex_data_small) max(ex_data_small, 1e+7) range(ex_data_small)# subset `example_data` for speed # all SeqIds from 2000 -> 2999 seqs <- grep("^seq\\.2[0-9]{3}", names(example_data), value = TRUE) ex_data_small <- head(example_data[, c(getMeta(example_data), seqs)], 10L) dim(ex_data_small) ex_data_small$seq.2991.9 # Math Generics: # ------------- # log-transformation a <- log(ex_data_small) a$seq.2991.9 b <- log10(ex_data_small) b$seq.2991.9 isTRUE(all.equal(b, log(ex_data_small, base = 10))) # floor c <- floor(ex_data_small) c$seq.2991.9 # square-root d <- sqrt(ex_data_small) d$seq.2991.9 # rounding e <- round(ex_data_small) e$seq.2991.9 # inverse log antilog(1:4) alog <- antilog(b) all.equal(ex_data_small, alog) # return `b` -> linear space # Ops Generics: # ------------- plus1 <- ex_data_small + 1 times2 <- ex_data_small * 2 sq <- ex_data_small^2 all.equal(sqrt(sq), ex_data_small) gt100k <- ex_data_small > 100000 gt100k ex_data_small == ex_data_small # invokes diffAdats() # Summary Generics: # ------------- sum(ex_data_small) any(ex_data_small < 100) # low RFU analytes sum(ex_data_small < 100) # how many min(ex_data_small) min(ex_data_small, 0) max(ex_data_small) max(ex_data_small, 1e+7) range(ex_data_small)
This function runs a series of checks to determine
if a soma_adat object has a complete
set of attributes. If not, this indicates that the object has
been modified since the initial read_adat() call.
Checks for the presence of both "Header.Meta" and "Col.Meta" in the
attribute names. These entries are added during the
read_adat() call. Specifically, within these sections
it also checks for the presence of the following entries:
"HEADER", "COL_DATA", and "ROW_DATA"
"SeqId", "Target", "Units", and "Dilution"
If any of the above they are altered or missing, FALSE is returned.
is.intact.attributes() is .
It remains for backward compatibility and may be removed in the future.
You are encouraged to shift your code to
is_intact_attr().
is_intact_attr(adat, verbose = interactive()) is.intact.attributes(adat, verbose = interactive())is_intact_attr(adat, verbose = interactive()) is.intact.attributes(adat, verbose = interactive())
adat |
A |
verbose |
Logical. Should diagnostic information about failures
be printed to the console? If the default, see |
Logical. TRUE if all checks pass, otherwise FALSE.
# checking attributes my_adat <- example_data is_intact_attr(my_adat) # TRUE is_intact_attr(my_adat[, -303L]) # doesn't break atts; TRUE attributes(my_adat)$Col.Meta$Target <- NULL # break attributes is_intact_attr(my_adat) # FALSE (Target missing)# checking attributes my_adat <- example_data is_intact_attr(my_adat) # TRUE is_intact_attr(my_adat[, -303L]) # doesn't break atts; TRUE attributes(my_adat)$Col.Meta$Target <- NULL # break attributes is_intact_attr(my_adat) # FALSE (Target missing)
AptName FormatTest whether an object is in the new seq.XXXX.XX format.
is_seqFormat(x)is_seqFormat(x)
x |
The object to be tested. |
A logical indicating whether x contains AptNames consistent
with the new format, beginning with a seq. prefix.
Stu Field, Eduardo Tabacman
# character S3 method is_seqFormat(names(example_data)) # no; meta data not ^seq. is_seqFormat(tail(names(example_data), -20L)) # yes # soma_adat S3 method is_seqFormat(example_data)# character S3 method is_seqFormat(names(example_data)) # no; meta data not ^seq. is_seqFormat(tail(names(example_data), -20L)) # yes # soma_adat S3 method is_seqFormat(example_data)
The SomaScan platform continually improves its technical processes between assay versions. The primary change of interest is content expansion, and other protocol changes may be implemented including: changing reagents, liquid handling equipment, and well volumes.
Table of SomaScan assay versions:
| Version | Commercial Name | Size |
V4 |
5k | 5284 |
v4.1 |
7k | 7596 |
v5.0 |
11k | 11083 |
However, for a given analyte, these technical upgrades can result
in minute measurement signal differences,
requiring a calibration (aka "lifting" or "bridging") to bring RFUs into a
comparable signal space.
This is accomplished by applying an analyte-specific scalar,
a linear transformation, to each analyte RFU measurement (column).
If you have an annotations file (*.xlsx) and wish to examine the
bridging scalars themselves, please see read_annotations().
Lifting between SomaScan versions no longer requires an
annotations file containing lifting scalars. We now enable users to pass
a bridge parameter, indicating the direction of the bridge.
For example, to "lift" between 11k -> 7k, you must be acting on
SomaScan data in 11k RFU space and would pass bridge = "11k_to_7k".
Likewise, 7k -> 5k requires bridge = "7k_to_5k".
Lastly, you may also lift directly from 11k -> 5k
(aka "double-bridge") with bridge = "11k_to_5k".
See below for all options for the bridge argument.
lift_adat( adat, bridge = c("11k_to_7k", "11k_to_5k", "7k_to_11k", "7k_to_5k", "5k_to_11k", "5k_to_7k"), anno.tbl = deprecated() ) is_lifted(adat)lift_adat( adat, bridge = c("11k_to_7k", "11k_to_5k", "7k_to_11k", "7k_to_5k", "5k_to_11k", "5k_to_7k"), anno.tbl = deprecated() ) is_lifted(adat)
adat |
A |
bridge |
The direction of the lift (i.e. bridge). |
anno.tbl |
Matched samples across assay versions are used to calculate bridging
scalars. For each analyte, this scalar is computed as the ratio of
population medians across assay versions.
Please see the lifting vignette
vignette("lifting-and-bridging", package = "SomaDataIO")
for more details.
lift_adat(): A "lifted" soma_adat object corresponding to
the scaling requested in the bridge parameter. RFU values are
rounded to 1 decimal place to match standard SomaScan delivery format.
is_lifted(): Logical. Whether the RFU values in a soma_adat
have been lifted from its original signal space to a new signal space.
The Lin's Concordance Correlation Coefficient (CCC) is calculated
by computing the correlation between post-lift RFU values and the
RFU values generated on the original SomaScan version.
This CCC estimate is a measure of how well an analyte can be bridged
across SomaScan versions.
See vignette("lifting-and-bridging", package = "SomaDataIO").
As with the lifting scalars, if you have an annotations file
you may view the analyte-specific CCC values via read_annotations().
Alternatively, getSomaScanLiftCCC() retrieves these values
from an internal object for both "serum" and "plasma".
Newer versions of SomaScan typically have additional content, i.e.
new reagents added to the multi-plex assay that bind to additional proteins.
When lifting to a previous SomaScan version, new reagents that do not
exist in the "earlier" assay version assay are scaled by 1.0, and thus
maintained, unmodified in the returned object. Users may need to drop
these columns in order to combine these data with a previous study
from an earlier SomaScan version, e.g. with collapseAdats().
In the inverse scenario, lifting "forward" from a previous, lower-plex
version, there will be extra reference values that are unnecessary
to perform the lift, and a warning is triggered. The resulting data
consists of RFU data in the "new" signal space, but with fewer analytes
than would otherwise be expected (e.g. 11k space with only 5284
analytes; see example below).
Lin, Lawrence I-Kuei. 1989. A Concordance Correlation Coefficient to Evaluate Reproducibility. Biometrics. 45:255-268.
# `example_data` is SomaScan (V4, 5k) adat <- head(example_data, 3L) dim(adat) getSomaScanVersion(adat) getSignalSpace(adat) # perform 'lift' lift_11k <- lift_adat(adat, "5k_to_11k") # warning is_lifted(lift_11k) dim(lift_11k) # attributes updated to reflect the 'lift' attr(lift_11k, "Header")$HEADER$SignalSpace attr(lift_11k, "Header")$HEADER$ProcessSteps# `example_data` is SomaScan (V4, 5k) adat <- head(example_data, 3L) dim(adat) getSomaScanVersion(adat) getSignalSpace(adat) # perform 'lift' lift_11k <- lift_adat(adat, "5k_to_11k") # warning is_lifted(lift_11k) dim(lift_11k) # attributes updated to reflect the 'lift' attr(lift_11k, "Header")$HEADER$SignalSpace attr(lift_11k, "Header")$HEADER$ProcessSteps
Load a series of ADATs and return a list of soma_adat
objects, one for each ADAT file.
collapseAdats() concatenates a list of ADATs from loadAdatsAsList(),
while maintaining the relevant attribute entries (mainly the HEADER
element). This makes writing out the final object possible without the
loss of HEADER information.
loadAdatsAsList(files, collapse = FALSE, verbose = interactive(), ...) collapseAdats(x)loadAdatsAsList(files, collapse = FALSE, verbose = interactive(), ...) collapseAdats(x)
files |
A character string of files to load. |
collapse |
Logical. Should the resulting list of ADATs be collapsed into a single ADAT object? |
verbose |
Logical. Should the function call be run in verbose mode. |
... |
Additional arguments passed to |
x |
A list of |
The default behavior is to "vertically bind"
(rbind()) on the intersect of the column variables, with
unique columns silently dropped.
If "vertically binding" on the column union is
desired, use dplyr::bind_rows(), however this results in NAs in
non-intersecting columns. For many files with little variable
intersection, a sparse RFU-matrix will result
(and will likely break ADAT attributes):
adats <- loadAdatsAsList(files) union_adat <- dplyr::bind_rows(adats, .id = "SourceFile")
A list of ADATs named by files, each a soma_adat object
corresponding to an individual file in files. For collapseAdats(),
a single, collapsed soma_adat object.
Stu Field
Other IO:
parseHeader(),
read_adat(),
soma_adat,
write_adat()
# only 1 file in directory dir(system.file("extdata", package = "SomaDataIO")) files <- system.file("extdata", package = "SomaDataIO") |> dir(pattern = "[.]adat$", full.names = TRUE) |> rev() adats <- loadAdatsAsList(files) class(adats) # collapse into 1 ADAT collapsed <- collapseAdats(adats) class(collapsed) # Alternatively use `collapse = TRUE` loadAdatsAsList(files, collapse = TRUE)# only 1 file in directory dir(system.file("extdata", package = "SomaDataIO")) files <- system.file("extdata", package = "SomaDataIO") |> dir(pattern = "[.]adat$", full.names = TRUE) |> rev() adats <- loadAdatsAsList(files) class(adats) # collapse into 1 ADAT collapsed <- collapseAdats(adats) class(collapsed) # Alternatively use `collapse = TRUE` loadAdatsAsList(files, collapse = TRUE)
Performs median normalization on a soma_adat object that has
already undergone standard data processing for array-based SomaScan studies.
Median normalization is a common, scale-based normalization technique that corrects for assay-derived technical variation by applying sample-specific linear scaling to expression measurements. Typical sources of assay variation include robotic and manual liquid handling, manufactured consumables such as buffers and plastic goods, laboratory instrument calibration, ambient environmental conditions, inter-operator differences, and other sources of technical variation. Median normalization can improve assay precision and reduce technical variation that can mask true biological signal.
The method scales each sample so that the center of the within-sample analyte distribution aligns to a defined reference, thereby correcting global intensity shifts without altering relative differences between measurements within a sample. For assay formats with multiple dilution groups (e.g., 1:5 or 20%; 1:200 or 0.5%; 1:20,000 or 0.005%), separate scale factors are calculated for each dilution because each dilution group is processed separately during the assay. For each sample, the ratio of reference RFU / observed RFU is calculated for every SeqId. The median ratio within each dilution group is selected as the scale factor and applied to all SeqIds for that sample within the associated dilution bin.
medianNormalize(adat, reference = NULL, by = NULL, verbose = TRUE)medianNormalize(adat, reference = NULL, by = NULL, verbose = TRUE)
adat |
A |
reference |
Optional. Reference for median normalization. Can be:
When providing an external reference data.frame it must contain:
|
by |
Character vector. Grouping variable(s) for grouped median
normalization. Must be column name(s) in the ADAT. Normalization will be
performed within each group separately. Default is |
verbose |
Logical. Should progress messages be printed? Default is |
A soma_adat object with median normalization applied and RFU values
adjusted. The existing NormScale_* columns are updated to include the
effects of both plate scale normalization and median normalization.
This function is designed for data in standard SomaLogic deliverable formats. Specific ADAT file requirements:
Intact ADAT file, with available data processing information
in the header section. Specifically, the ProcessSteps field must be
present and correctly represent the data processing steps present in
the data table.
Minimal standard processing, the function assumes a standard SomaScan data deliverable with minimally standard HybNorm and PlateScale steps applied.
Primary use cases:
Combining data sets from the same overarching experiment or sample population and normalize to a common reference that were originally processed separately and each normalized "within study".
Normalize fundamentally different types of samples separately (by group). For instance, lysate samples from different cell lines that will be analyzed separately should likely be median normalized within each cell type. Lysis buffer background samples would also be expected to be normalized separately.
A core assumption of median normalization is that the majority of analytes are not differentially expressed; consequently, users should validate this assumption by inspecting scale-factor distributions for systematic bias between the biological groups intended for comparison.
Note this function does not perform the adaptive normalization by maximum likelihood (ANML) method which leverages a population-based reference that iteratively down-selects the set of analytes to include for the normalization calculation.
This function requires unnormalized data as input. If study samples
have already undergone median normalization (ANML or standard), first use
reverseMedianNormalize() to remove existing normalization.
## Not run: # Starting with unnormalized ADAT unnormalized_adat <- read_adat("unnormalized_study_data.adat") # Internal reference from study samples (default - all samples normalized together) med_norm_adat <- medianNormalize(unnormalized_adat) # Reference from another ADAT ref_adat <- read_adat("reference_file.adat") med_norm_adat <- medianNormalize(unnormalized_adat, reference = ref_adat) # External reference as a data.frame - requires `SeqId` and `Reference` columns ref_data <- read.csv("reference_file.csv") med_norm_adat <- medianNormalize(unnormalized_adat, reference = ref_data) # Custom grouping by biological variables # Use when samples should be normalized separately by group med_norm_adat <- medianNormalize(unnormalized_adat, by = "Sex") med_norm_adat <- medianNormalize(unnormalized_adat, by = c("Sex", "Age_Group")) # If you already have normalized data, first reverse the normalization normalized_adat <- read_adat("normalized_study_data.adat") unnormalized_adat <- reverseMedianNormalize(normalized_adat) custom_norm_adat <- medianNormalize(unnormalized_adat, reference = new_reference) ## End(Not run)## Not run: # Starting with unnormalized ADAT unnormalized_adat <- read_adat("unnormalized_study_data.adat") # Internal reference from study samples (default - all samples normalized together) med_norm_adat <- medianNormalize(unnormalized_adat) # Reference from another ADAT ref_adat <- read_adat("reference_file.adat") med_norm_adat <- medianNormalize(unnormalized_adat, reference = ref_adat) # External reference as a data.frame - requires `SeqId` and `Reference` columns ref_data <- read.csv("reference_file.csv") med_norm_adat <- medianNormalize(unnormalized_adat, reference = ref_data) # Custom grouping by biological variables # Use when samples should be normalized separately by group med_norm_adat <- medianNormalize(unnormalized_adat, by = "Sex") med_norm_adat <- medianNormalize(unnormalized_adat, by = c("Sex", "Age_Group")) # If you already have normalized data, first reverse the normalization normalized_adat <- read_adat("normalized_study_data.adat") unnormalized_adat <- reverseMedianNormalize(normalized_adat) custom_norm_adat <- medianNormalize(unnormalized_adat, reference = new_reference) ## End(Not run)
Occasionally, additional clinical data is obtained after samples
have been submitted to SomaLogic, or even after 'SomaScan'
results have been delivered.
This requires the new clinical variables, i.e. non-proteomic, data to be
merged with 'SomaScan' data into a "new" ADAT prior to analysis.
merge_clin() easily merges such clinical variables into an
existing soma_adat object and is a simple wrapper around dplyr::left_join().
merge_clin(x, clin_data, by = NULL, by_class = NULL, ...)merge_clin(x, clin_data, by = NULL, by_class = NULL, ...)
x |
A |
clin_data |
One of 2 options:
|
by |
A character vector of variables to join by.
See |
by_class |
If |
... |
Additional parameters passed to |
This functionality also exists as a command-line tool (R script) contained
in merge_clin.R that lives in the cli/merge system file directory.
Please see:
dir(system.file("cli/merge", package = "SomaDataIO"), full.names = TRUE)
vignette("cli-merge-tool", package = "SomaDataIO")
A soma_adat with new clinical variables merged.
Stu Field
# retrieve clinical data clin_file <- system.file("cli/merge", "meta.csv", package = "SomaDataIO", mustWork = TRUE) clin_file # view clinical data to be merged: # 1) `group` # 2) `newvar` clin_df <- read.csv(clin_file, colClasses = c(SampleId = "character")) clin_df # create mini-adat apts <- withr::with_seed(123, sample(getAnalytes(example_data), 2L)) adat <- head(example_data, 9L) |> # 9 x 2 dplyr::select(SampleId, all_of(apts)) # merge clinical variables merged <- merge_clin(adat, clin_df, by = "SampleId") merged # Alternative syntax: # 1) pass file path # 2) merge on different variable names # 3) convert join type on-the-fly clin_file2 <- system.file("cli/merge", "meta2.csv", package = "SomaDataIO", mustWork = TRUE) id_type <- typeof(adat$SampleId) merged2 <- merge_clin(adat, clin_file2, # file path by = c(SampleId = "ClinKey"), # join on 2 variables by_class = c(ClinKey = id_type)) # match types merged2# retrieve clinical data clin_file <- system.file("cli/merge", "meta.csv", package = "SomaDataIO", mustWork = TRUE) clin_file # view clinical data to be merged: # 1) `group` # 2) `newvar` clin_df <- read.csv(clin_file, colClasses = c(SampleId = "character")) clin_df # create mini-adat apts <- withr::with_seed(123, sample(getAnalytes(example_data), 2L)) adat <- head(example_data, 9L) |> # 9 x 2 dplyr::select(SampleId, all_of(apts)) # merge clinical variables merged <- merge_clin(adat, clin_df, by = "SampleId") merged # Alternative syntax: # 1) pass file path # 2) merge on different variable names # 3) convert join type on-the-fly clin_file2 <- system.file("cli/merge", "meta2.csv", package = "SomaDataIO", mustWork = TRUE) id_type <- typeof(adat$SampleId) merged2 <- merge_clin(adat, clin_file2, # file path by = c(SampleId = "ClinKey"), # join on 2 variables by_class = c(ClinKey = id_type)) # match types merged2
The parameters below are commonly used throughout the SomaDataIO package.
adat |
A |
x |
A |
matrix |
Character. A string of (usually) either
|
A soma_adat class object.
Parses the header section of an ADAT file.
parseHeader(file)parseHeader(file)
file |
Character. The elaborated path and file name of the
|
A list of relevant file information required by read_adat()
in order to complete loading the ADAT file, including:
Header.Meta |
list of notes and other information about the adat |
Col.Meta |
list of vectors that contain the column meta data about individual analytes, includes information about the target name and calibration and QC ratios |
file_specs |
list of values of the file parsing specifications |
row_meta |
character vector of the clinical variables; assay information that is included in the adat output along with the RFU data |
Stu Field
Other IO:
loadAdatsAsList(),
read_adat(),
soma_adat,
write_adat()
f <- system.file("extdata", "example_data10.adat", package = "SomaDataIO", mustWork = TRUE) header <- parseHeader(f) names(header) header$Header.Meta header$file_specs header$row_meta head(as.data.frame(header$Col.Meta))f <- system.file("extdata", "example_data10.adat", package = "SomaDataIO", mustWork = TRUE) header <- parseHeader(f) names(header) header$Header.Meta header$file_specs header$row_meta head(as.data.frame(header$Col.Meta))
Utility to convert an ExpressionSet class object
from the "wide" data format to the "long" format via tidyr::pivot_longer().
The Biobase package is required for this function.
pivotExpressionSet(eSet) meltExpressionSet(eSet)pivotExpressionSet(eSet) meltExpressionSet(eSet)
eSet |
An |
A tibble consisting of the long format
conversion of an ExpressionSet object.
meltExpressionSet(): . Please now use
pivotExpressionSet().
Stu Field
Other eSet:
adat2eSet()
# subset into a reduced mini-ADAT object # 10 samples (rows) # 5 clinical variables and 3 features (cols) sub_adat <- example_data[1:10, c(1:5, 35:37)] ex_set <- adat2eSet(sub_adat) # convert ExpressionSet object to long format adat_long <- pivotExpressionSet(ex_set)# subset into a reduced mini-ADAT object # 10 samples (rows) # 5 clinical variables and 3 features (cols) sub_adat <- example_data[1:10, c(1:5, 35:37)] ex_set <- adat2eSet(sub_adat) # convert ExpressionSet object to long format adat_long <- pivotExpressionSet(ex_set)
Plotting function for objects of the outlier_map class. Produces a
heatmap-style image using ggplot2 syntax, for objects
produced by calcOutlierMap().
## S3 method for class 'Map' plot( x, color.scheme = NULL, legend.ticks = 7, gridlines = NULL, gridlinecol = "red", gridlinelwd = 0.5, gridlinelty = 2, main = NULL, y.lab = NULL, x.lab = NULL, flags = NULL, legend.width = 1, legend.height = 2, filename = NULL, plot.width = 14, plot.height = 8, plot.scale = 1, ... )## S3 method for class 'Map' plot( x, color.scheme = NULL, legend.ticks = 7, gridlines = NULL, gridlinecol = "red", gridlinelwd = 0.5, gridlinelty = 2, main = NULL, y.lab = NULL, x.lab = NULL, flags = NULL, legend.width = 1, legend.height = 2, filename = NULL, plot.width = 14, plot.height = 8, plot.scale = 1, ... )
x |
An object of class: |
color.scheme |
Which color scheme to use. Typical choices include: |
legend.ticks |
How many ticks to place on the color legend. |
gridlines |
Numeric vector or logical. Indicates where to draw the horizontal
grid lines that can be used to separate samples (rows). This
should be a vector of the cumulative sum of the horizontal lines to be
drawn, typically something like |
gridlinecol |
Color of the gridlines. |
gridlinelwd |
Width of the gridlines. |
gridlinelty |
Line type of the gridlines. |
main |
Character. Main title for the plot.
See |
y.lab |
Character. Optional string for the y-axis. Otherwise one is automatically generated (default). |
x.lab |
Character. Optional string for the x-axis. Otherwise one is automatically generated (default). |
flags |
Numeric in |
legend.width |
Width for the color legend. |
legend.height |
Height for the color legend. |
filename |
Optional. If provided, the plot will be written to a file.
The file name must also include the desired file type extension;
this will be used to determine the file type,
e.g. a file named |
plot.width |
If |
plot.height |
If |
plot.scale |
If |
... |
Arguments required by the |
Plot an image of the passed matrix.
Stu Field, Amanda Hiser
ggplot2::ggplot(), ggplot2::geom_raster()
Other Calc Map:
calcOutlierMap(),
getOutlierIds()
example_data |> dplyr::filter(SampleType == "Sample") |> head(10) |> calcOutlierMap() |> plot(flags = 0.05)example_data |> dplyr::filter(SampleType == "Sample") |> head(10) |> calcOutlierMap() |> plot(flags = 0.05)
Pre-process an ADAT file containing raw analyte RFU values in preparation for analysis. For more details please refer to the pre-processing how-to article
preProcessAdat( adat, filter.features = TRUE, filter.controls = TRUE, filter.rowcheck = TRUE, filter.qc = deprecated(), filter.outliers = FALSE, data.qc = NULL, log.10 = FALSE, center.scale = FALSE )preProcessAdat( adat, filter.features = TRUE, filter.controls = TRUE, filter.rowcheck = TRUE, filter.qc = deprecated(), filter.outliers = FALSE, data.qc = NULL, log.10 = FALSE, center.scale = FALSE )
adat |
A |
filter.features |
Logical. Should non-human protein features (SeqIds) be
dropped? Default is |
filter.controls |
Logical. Should SomaScan technical control samples
be dropped? If |
filter.rowcheck |
Logical. If |
filter.qc |
|
filter.outliers |
Logical. Should the |
data.qc |
Character. Character vector of variable names for which data
QC plots are desired. Default is |
log.10 |
Logical. Should the RFU values be log10 transformed?
Default is |
center.scale |
Logical. Should the RFU values be Z-transformed
(centered and scaled)? Default is |
The soma_adat object is pre-processed with the following steps:
Filter features -> down to human protein analytes
Filter samples -> by the following order and criteria: a) Retain study samples only (dropping buffer, calibrator, and QC samples) b) Only those that pass default normalization acceptance criteria c) Those not identified as outliers.
Data QC -> plots of normalization scale factors by clinical covariates
Transformations -> log10, center, and scale analyte RFU values
A soma_adat object.
Caleb Scheidel
preProcessAdat(example_data, data.qc = c("Age", "Sex"))preProcessAdat(example_data, data.qc = c("Age", "Sex"))
The parse and load a *.adat file as a data.frame-like object into
an R workspace environment. The class of the returned object is
a soma_adat object.
read.adat() is .
For backward compatibility it will likely never go away completely,
but you are strongly encouraged to shift your code to use
read_adat().
is.soma_adat() checks whether an object is of class soma_adat.
See inherits().
read_adat(file, debug = FALSE, verbose = getOption("verbose"), ...) read.adat(file, debug = FALSE, verbose = getOption("verbose"), ...) is.soma_adat(x)read_adat(file, debug = FALSE, verbose = getOption("verbose"), ...) read.adat(file, debug = FALSE, verbose = getOption("verbose"), ...) is.soma_adat(x)
file |
Character. The elaborated path and file name of the |
debug |
Logical. Used for debugging and development of an ADAT that fails to load, particularly out-of-spec, poorly modified, or legacy ADATs. |
verbose |
Logical. Should the function call be run in verbose mode, printing relevant diagnostic call information to the console. |
... |
Additional arguments passed ultimately to
|
x |
An |
A data.frame-like object of class soma_adat
consisting of SomaLogic RFU (feature) data and clinical meta data as
columns, and samples as rows. Row names are labeled with the unique ID
"SlideId_Subarray" concatenation. The sections of the ADAT header (e.g.,
"Header.Meta", "Col.Meta", ...) are stored as attributes (e.g.
attributes(x)$Header.Meta).
Logical. Whether x inherits from class soma_adat.
Stu Field
Other IO:
loadAdatsAsList(),
parseHeader(),
soma_adat,
write_adat()
# path to *.adat file # replace with your file path adat_path <- system.file("extdata", "example_data10.adat", package = "SomaDataIO", mustWork = TRUE) adat_path my_adat <- read_adat(adat_path) is.soma_adat(my_adat)# path to *.adat file # replace with your file path adat_path <- system.file("extdata", "example_data10.adat", package = "SomaDataIO", mustWork = TRUE) adat_path my_adat <- read_adat(adat_path) is.soma_adat(my_adat)
Import a SomaLogic Annotations File
read_annotations(file)read_annotations(file)
file |
A path to an annotations file location.
This should be a SomaLogic annotations file in
|
A tibble containing analyte-specific annotations and
related (e.g. lift/bridging) information, keyed on SomaLogic
SeqId, the unique SomaScan analyte identifier.
## Not run: # for example file <- "~/Downloads/SomaScan_11K_v5.0_Plasma_Serum_Annotated_Menu.xlsx" anno_tbl <- read_annotations(file) ## End(Not run)## Not run: # for example file <- "~/Downloads/SomaScan_11K_v5.0_Plasma_Serum_Annotated_Menu.xlsx" anno_tbl <- read_annotations(file) ## End(Not run)
Reverses median normalization (including ANML) that was
previously applied to study samples (SampleType == "Sample"). This function
is designed to work with standard SomaScan deliverable ADAT files where
study samples have undergone median normalization as the final processing step.
This function validates that:
Study samples have a median normalization step applied
The normalization was the last transformation applied to study samples
The correct reversal method is applied based on the normalization type
reverseMedianNormalize(adat, verbose = TRUE)reverseMedianNormalize(adat, verbose = TRUE)
adat |
A |
verbose |
Logical. Should progress messages be printed? Default is |
A soma_adat object with median normalization reversed for study samples.
QC, Calibrator, and Buffer samples retain their original normalization.
The ProcessSteps header is updated to reflect the reversal operation,
and median normalization-specific metadata fields are cleared.
Converting from normalized ADAT to unnormalized ADAT for custom normalization
Preparing normalized delivery data for use with medianNormalize() function
Backing out normalization to apply different normalization strategies
ADAT file with study samples (SampleType == "Sample") that have been
median normalized (either standard median normalization or ANML)
Intact header metadata with ProcessSteps field indicating the
normalization history
Median normalization must be the last processing step applied to study samples
## Not run: # Reverse normalization from a delivered ADAT file normalized_adat <- read_adat("normalized_study_data.adat") unnormalized_adat <- reverseMedianNormalize(normalized_adat) ## End(Not run)## Not run: # Reverse normalization from a delivered ADAT file normalized_adat <- read_adat("normalized_study_data.adat") unnormalized_adat <- reverseMedianNormalize(normalized_adat) ## End(Not run)
Easily move row names to a column and vice-versa without the unwanted
side-effects to object class and attributes. Drop-in replacement for
tibble::rownames_to_column() and tibble::column_to_rownames() which
can have undesired side-effects to complex object attributes.
Does not import any external packages, modify the environment, or change
the object (other than the desired column). When using col2rn(), if
explicit row names exist, they are overwritten with a warning. add_rowid()
does not affect row names, which differs from tibble::rowid_to_column().
rn2col(data, name = ".rn") col2rn(data, name = ".rn") has_rn(data) rm_rn(data) set_rn(data, value) add_rowid(data, name = ".rowid")rn2col(data, name = ".rn") col2rn(data, name = ".rn") has_rn(data) rm_rn(data) set_rn(data, value) add_rowid(data, name = ".rowid")
data |
An object that inherits from class |
name |
Character. The name of the column to move. |
value |
Character. The new set of names for the data frame.
If duplicates exist they are modified on-the-fly via |
All functions attempt to return an object of the same class as
the input with fully intact and unmodified attributes (aside from those
required by the desired action). has_rn() returns a scalar logical.
rn2col(): moves the row names of data to an explicit column
whether they are explicit or implicit.
col2rn(): is the inverse of rn2col(). If row names exist, they
will be overwritten (with warning).
has_rn(): returns a boolean indicating whether the data frame
has explicit row names assigned.
rm_rn(): removes existing row names, leaving only "implicit" row names.
set_rn(): sets (and overwrites) existing row names for data frames only.
add_rowid(): adds a sequential integer row identifier; starting at 1:nrow(data).
It does not remove existing row names currently, but may in the future
(please code accordingly).
df <- data.frame(a = 1:5, b = rnorm(5), row.names = LETTERS[1:5]) df rn2col(df) # default name is `.rn` rn2col(df, "AptName") # pass `name =` # moving columns df$mtcars <- sample(names(mtcars), 5) col2rn(df, "mtcars") # with a warning # Move back and forth easily # Leaves original object un-modified identical(df, col2rn(rn2col(df))) # add "id" column add_rowid(mtcars) # remove row names has_rn(mtcars) mtcars2 <- rm_rn(mtcars) has_rn(mtcars2)df <- data.frame(a = 1:5, b = rnorm(5), row.names = LETTERS[1:5]) df rn2col(df) # default name is `.rn` rn2col(df, "AptName") # pass `name =` # moving columns df$mtcars <- sample(names(mtcars), 5) col2rn(df, "mtcars") # with a warning # Move back and forth easily # Leaves original object un-modified identical(df, col2rn(rn2col(df))) # add "id" column add_rowid(mtcars) # remove row names has_rn(mtcars) mtcars2 <- rm_rn(mtcars) has_rn(mtcars2)
The SeqId is the cornerstone used to uniquely identify
SomaLogic analytes.
SeqIds follow the format <Pool>-<Clone>_<Version>, for example
"1234-56_7" can be represented as:
| Pool | Clone | Version |
1234 |
56 |
7
|
See Details below for the definition of each sub-unit.
The <Pool>-<Clone> combination is sufficient to uniquely identify a
specific analyte and therefore versions are no longer provided (though
they may be present in legacy ADATs).
The tools below enable users to extract, test, identify, compare,
and manipulate SeqIds across assay runs and/or versions.
getSeqId(x, trim.version = FALSE) regexSeqId() locateSeqId(x, trailing = TRUE) seqid2apt(x) apt2seqid(x) is.apt(x) is.SeqId(x) is.AptName(x) matchSeqIds(x, y, order.by.x = TRUE) getSeqIdMatches(x, y, show = FALSE)getSeqId(x, trim.version = FALSE) regexSeqId() locateSeqId(x, trailing = TRUE) seqid2apt(x) apt2seqid(x) is.apt(x) is.SeqId(x) is.AptName(x) matchSeqIds(x, y, order.by.x = TRUE) getSeqIdMatches(x, y, show = FALSE)
x |
Character. A vector of strings, usually analyte/feature column
names, |
trim.version |
Logical. Whether to remove the version number, i.e. "1234-56_7" -> "1234-56". Primarily for legacy ADATs. |
trailing |
Logical. Should the regular expression explicitly specify
trailing |
y |
Character. A second vector of |
order.by.x |
Logical. Order the returned character string by
the |
show |
Logical. Return the data frame visibly? |
| Pool: | ties back to the original well during SELEX |
| Clone: | ties to the specific sequence within a pool |
| Version: | refers to custom modifications (optional/defunct) |
AptNamea SeqId combined with a string, usually a GeneId- or
seq.-prefix, for convenient, human-readable
manipulation from within R.
getSeqId(): a character vector of SeqIds captured from a string.
regexSeqId(): a regular expression (regex) string
pre-defined to match SomaLogic the SeqId pattern.
locateSeqId(): a data frame containing the start and stop
integer positions for SeqId matches at each value of x.
seqid2apt(): a character vector with the seq.* prefix, i.e.
the inverse of getSeqId().
apt2seqid(): a character vector of SeqIds. is.SeqId() will
return TRUE for all elements.
is.apt(), is.SeqId(): Logical. TRUE or FALSE.
matchSeqIds(): a character string corresponding to values
in y of the intersect of x and y. If no matches are
found, character(0).
getSeqIdMatches(): a data frame, where n is the
length of the intersect of the matching SeqIds.
The data frame is named by the passed arguments, x and y.
getSeqId(): extracts/captures the the SeqId match from an analyte column identifier,
i.e. column name of an ADAT loaded with read_adat(). Assumes the
SeqId pattern occurs at the end of the string, which for
the vast majority of cases will be true. For edge cases, see the
trailing argument to locateSeqId().
regexSeqId(): generates a pre-formatted regular expression for
matching of SeqIds. Note the trailing match, which is most
commonly required, but locateSeqId() offers
an alternative to mach anywhere in a string.
Used internally in many utility functions
locateSeqId(): generates a data frame of the positional SeqId matches. Specifically
designed to facilitate SeqId extraction via substr().
Similar to stringr::str_locate().
seqid2apt(): converts a SeqId into anonymous-AptName format, i.e.
1234-56 -> seq.1234.56. Version numbers (1234-56_ver)
are always trimmed when present.
apt2seqid(): converts an anonymous-AptName into SeqId format, i.e.
seq.1234.56 -> 1234-56. Version numbers (seq.1234.56.ver)
are always trimmed when present.
is.apt(): regular expression match to determine if a string contains
a SeqId, and thus is probably an AptName format string. Both
legacy EntrezGeneSymbol-SeqId combinations or newer
so-called "anonymous-AptNames" formats (seq.1234.45) are matched.
is.SeqId(): tests for SeqId format, i.e. values returned from getSeqId()
will always return TRUE.
is.AptName(): tests for AptName format, i.e. values returned from seqid2apt()
will always return TRUE. This function will only match AptNames, not
SeqIds, and is therefore more strict than is.apt().
matchSeqIds(): matches two character vectors on the basis of their
intersecting SeqIds. Note that elements in y not
containing a SeqId regular expression are silently dropped.
getSeqIdMatches(): matches two character vectors on the basis of their intersecting SeqIds
only (irrespective of the GeneID-prefix). This produces a two-column
data frame which then can be used as to map between the two sets.
The final order of the matches/rows is by the input
corresponding to the first argument (x).
By default the data frame is invisibly returned to
avoid dumping excess output to the console (see the show = argument.)
Stu Field
x <- c("ABDC.3948.48.2", "3948.88", "3948.48.2", "3948-48_2", "3948.48.2", "3948-48_2", "3948-88", "My.Favorite.Apt.3948.88.9") tibble::tibble(orig = x, SeqId = getSeqId(x), SeqId_trim = getSeqId(x, TRUE), AptName = seqid2apt(SeqId)) # Logical Matching is.apt("AGR2.4959.2") # TRUE is.apt("seq.4959.2") # TRUE is.apt("4959-2") # TRUE is.apt("AGR2") # FALSE # SeqId Matching x <- c("seq.4554.56", "seq.3714.49", "PlateId") y <- c("Group", "3714-49", "Assay", "4554-56") matchSeqIds(x, y) matchSeqIds(x, y, order.by.x = FALSE) # vector of features feats <- getAnalytes(example_data) match_df <- getSeqIdMatches(feats[1:100], feats[90:500]) # 11 overlapping match_df a <- utils::head(feats, 15) b <- withr::with_seed(99, sample(getSeqId(a))) # => SeqId & shuffle (getSeqIdMatches(a, b)) # sorted by first vector "a"x <- c("ABDC.3948.48.2", "3948.88", "3948.48.2", "3948-48_2", "3948.48.2", "3948-48_2", "3948-88", "My.Favorite.Apt.3948.88.9") tibble::tibble(orig = x, SeqId = getSeqId(x), SeqId_trim = getSeqId(x, TRUE), AptName = seqid2apt(SeqId)) # Logical Matching is.apt("AGR2.4959.2") # TRUE is.apt("seq.4959.2") # TRUE is.apt("4959-2") # TRUE is.apt("AGR2") # FALSE # SeqId Matching x <- c("seq.4554.56", "seq.3714.49", "PlateId") y <- c("Group", "3714-49", "Assay", "4554-56") matchSeqIds(x, y) matchSeqIds(x, y, order.by.x = FALSE) # vector of features feats <- getAnalytes(example_data) match_df <- getSeqIdMatches(feats[1:100], feats[90:500]) # 11 overlapping match_df a <- utils::head(feats, 15) b <- withr::with_seed(99, sample(getSeqId(a))) # => SeqId & shuffle (getSeqIdMatches(a, b)) # sorted by first vector "a"
soma_adat Class and S3 MethodsThe soma_adat data structure is the primary internal R representation
of SomaScan data. A soma_adat is automatically created via read_adat()
when loading a *.adat text file. It consists of a data.frame-like
object with leading columns as clinical variables and SomaScan RFU data
as the remaining variables. Two main attributes corresponding to analyte
and SomaScan run information contained in the *.adat file are added:
Header.Meta: information about the SomaScan run, see parseHeader()
or attr(x, "Header.Meta")
Col.Meta: annotations information about the SomaScan reagents/analytes,
see getAnalyteInfo() or attr(x, "Col.Meta")
file_specs: parsing specifications for the ingested *.adat file
row_meta: the names of the non-RFU fields. See getMeta().
See groupGenerics() for a details on Math(), Ops(), and Summary()
methods that dispatch on class soma_adat.
See reexports() for a details on re-exported S3 generics from other
packages (mostly dplyr and tidyr) to enable S3 methods to be
dispatched on class soma_adat.
Below is a list of all currently available S3 methods that dispatch on
the soma_adat class:
#> [1] [ [[ [[<- [<- #> [5] == $ $<- anti_join #> [9] arrange count filter full_join #> [13] getAdatVersion getAnalytes getMeta group_by #> [17] inner_join is_seqFormat left_join Math #> [21] median merge mutate Ops #> [25] print rename right_join row.names<- #> [29] sample_frac sample_n select semi_join #> [33] separate slice_sample slice summary #> [37] Summary transform ungroup unite #> see '?methods' for accessing help and source code
The S3 print() method returns summary information parsed from the object
attributes, if present, followed by a dispatch to the tibble::tibble()
print method. Rownames are printed as the first column in the print method
only.
The S3 summary() method returns the following for each column of the ADAT
object containing SOMAmer data (clinical meta data is excluded):
Target (if available)
Minimum value
1st Quantile
Median
Mean
3rd Quantile
Maximum value
Standard deviation
Median absolute deviation (mad())
Interquartile range (IQR())
The S3 Extract() method is used for sub-setting a soma_adat
object and relies heavily on the [ method that maintains the soma_adat
attributes intact and subsets the Col.Meta so that it is consistent
with the newly created object.
S3 extraction via $ is fully supported, however,
as opposed to the data.frame method, partial matching
is not allowed for class soma_adat.
S3 extraction via [[ is supported, however, we restrict
the usage of [[ for soma_adat. Use only a numeric index (e.g. 1L)
or a character identifying the column (e.g. "SampleID").
Do not use [[i,j]] syntax with [[, use [ instead.
As with $, partial matching is not allowed.
S3 assignment via [ is supported for class soma_adat.
S3 assignment via $ is fully supported for class soma_adat.
S3 assignment via [[ is supported for class soma_adat.
S3 median() is not currently supported for the soma_adat class,
however a dispatch is in place to direct users to alternatives.
## S3 method for class 'soma_adat' print(x, show_header = FALSE, ...) ## S3 method for class 'soma_adat' summary(object, tbl = NULL, digits = max(3L, getOption("digits") - 3L), ...) ## S3 method for class 'soma_adat' x[i, j, drop = TRUE, ...] ## S3 method for class 'soma_adat' x$name ## S3 method for class 'soma_adat' x[[i, j, ..., exact = TRUE]] ## S3 replacement method for class 'soma_adat' x[i, j, ...] <- value ## S3 replacement method for class 'soma_adat' x$i, j, ... <- value ## S3 replacement method for class 'soma_adat' x[[i, j, ...]] <- value ## S3 method for class 'soma_adat' median(x, na.rm = FALSE, ...)## S3 method for class 'soma_adat' print(x, show_header = FALSE, ...) ## S3 method for class 'soma_adat' summary(object, tbl = NULL, digits = max(3L, getOption("digits") - 3L), ...) ## S3 method for class 'soma_adat' x[i, j, drop = TRUE, ...] ## S3 method for class 'soma_adat' x$name ## S3 method for class 'soma_adat' x[[i, j, ..., exact = TRUE]] ## S3 replacement method for class 'soma_adat' x[i, j, ...] <- value ## S3 replacement method for class 'soma_adat' x$i, j, ... <- value ## S3 replacement method for class 'soma_adat' x[[i, j, ...]] <- value ## S3 method for class 'soma_adat' median(x, na.rm = FALSE, ...)
x, object
|
A |
show_header |
Logical. Should all the |
... |
Ignored. |
tbl |
An annotations table. If |
digits |
Integer. Used for number formatting with |
i, j
|
Row and column indices respectively. If |
drop |
Coerce to a vector if fetching one column via |
name |
A name or a string. |
exact |
Ignored with a |
value |
A value to store in a row, column, range or cell. |
na.rm |
a logical value indicating whether |
The set of S3 methods above return the soma_adat object with
the corresponding S3 method applied.
Other IO:
loadAdatsAsList(),
parseHeader(),
read_adat(),
write_adat()
# S3 print method example_data # show the header info (no RFU data) print(example_data, show_header = TRUE) # S3 summary method # MMP analytes (4) mmps <- c("seq.2579.17", "seq.2788.55", "seq.2789.26", "seq.4925.54") mmp_adat <- example_data[, c("Sex", mmps)] summary(mmp_adat) # Summarize by group mmp_adat |> split(mmp_adat$Sex) |> lapply(summary) # Alternatively pass annotations with Target info anno <- getAnalyteInfo(mmp_adat) summary(mmp_adat, tbl = anno)# S3 print method example_data # show the header info (no RFU data) print(example_data, show_header = TRUE) # S3 summary method # MMP analytes (4) mmps <- c("seq.2579.17", "seq.2788.55", "seq.2789.26", "seq.4925.54") mmp_adat <- example_data[, c("Sex", mmps)] summary(mmp_adat) # Summarize by group mmp_adat |> split(mmp_adat$Sex) |> lapply(summary) # Alternatively pass annotations with Target info anno <- getAnalyteInfo(mmp_adat) summary(mmp_adat, tbl = anno)
These functions have either been
or
in the current version of SomaDataIO package.
They may eventually be completely removed, so
please re-code your scripts accordingly based on the
suggestions below:
| Function | Now Use | |
getSomamers() |
|
getAnalytes() |
getSomamerData() |
|
getAnalyteInfo()
|
Some badges you may see in SomaDataIO:
The example_data object is intended to provide existing and prospective
SomaLogic customers with example data to enable analysis preparation prior
to receipt of SomaScan data, and also for those generally curious about the
SomaScan data deliverable. It is not intended to be used as a control
group for studies or provide any metrics for SomaScan data in general.
a soma_adat parsed via read_adat() containing
192 samples (see below for breakdown of sample type). There are 5318
columns containing 5284 analyte features and 34 clinical meta data fields.
These data have been pre-processed via the following steps:
hybridization normalized (all samples)
calibrators and buffers median normalized
plate scaled
calibrated
Adaptive Normalization by Maximum Likelihood (ANML) of QC and clinical samples
Note1: The Age and Sex (M/F) fields contain simulated values
designed to contain biological signal.
**Note2:** The `SampleType` column contains sample source/type information
and usually the `SampleType == Sample` represents the "client" samples.
**Note3:** The original source file can be found at
\url{https://github.com/SomaLogic/SomaLogic-Data}.
character string of the analyte features contained
in the soma_adat object, derived from a call to getAnalytes().
a lookup table corresponding to a
transposed data frame of the "Col.Meta" attribute of an ADAT, with an
index key field AptName included in column 1, derived from a call to
getAnalyteInfo().
A lookup table mapping SeqId feature names ->
target names contained in example_data. This object (or one like it) is
convenient at the console via auto-complete for labeling and/or creating
plot titles on the fly.
A table containing SampleId, smoking_status, and
alcohol_use fields for each clinical sample in example_data used to
demonstrate how to merge sample annotation information to an existing
soma_adat object.
The example_data object contains a SomaScan V4 study from healthy
normal individuals. The RFU measurements themselves and other identifiers
have been altered to protect personally identifiable information (PII),
but also retain underlying biological signal as much as possible.
There are 192 total EDTA-plasma samples across two 96-well plate runs
which are broken down by the following types:
170 clinical samples (client study samples)
10 calibrators (replicate controls for combining data across runs)
6 QC samples (replicate controls used to assess run quality)
6 Buffer samples (no protein controls)
The standard V4 data normalization procedure for EDTA-plasma samples was applied to this dataset. For more details on the data standardization process see the Data Standardization and File Specification Technical Note. General details are outlined above.
https://github.com/SomaLogic/SomaLogic-Data
SomaLogic Operating Co., Inc.
# S3 print method example_data # print header info print(example_data, show_header = TRUE) class(example_data) # Features/Analytes head(ex_analytes, 20L) # Feature info table (annotations) ex_anno_tbl # Search via `filter()` dplyr::filter(ex_anno_tbl, grepl("^MMP", Target)) # Lookup table -> targets # MMP-9 ex_target_names$seq.2579.17 # gender hormone FSH tapply(example_data$seq.3032.11, example_data$Sex, median) # gender hormone LH tapply(example_data$seq.2953.31, example_data$Sex, median) # Target lookup ex_target_names$seq.2953.31 # tab-completion at console # Sample Type/Source table(example_data$SampleType) # Sex/Gender Variable table(example_data$Sex) # Age Variable summary(example_data$Age)# S3 print method example_data # print header info print(example_data, show_header = TRUE) class(example_data) # Features/Analytes head(ex_analytes, 20L) # Feature info table (annotations) ex_anno_tbl # Search via `filter()` dplyr::filter(ex_anno_tbl, grepl("^MMP", Target)) # Lookup table -> targets # MMP-9 ex_target_names$seq.2579.17 # gender hormone FSH tapply(example_data$seq.3032.11, example_data$Sex, median) # gender hormone LH tapply(example_data$seq.2953.31, example_data$Sex, median) # Target lookup ex_target_names$seq.2953.31 # tab-completion at console # Sample Type/Source table(example_data$SampleType) # Sex/Gender Variable table(example_data$Sex) # Age Variable summary(example_data$Age)
soma_adat Columns/RowsScale the i-th row or column of a soma_adat object by the i-th
element of a vector. Designed to facilitate linear transformations
of only the analyte/RFU entries by scaling the data matrix.
If scaling the analytes/RFU (columns), v must have
getAnalytes(adat, n = TRUE) elements.
If scaling the samples (rows), v must
have nrow(_data) elements.
## S3 method for class 'soma_adat' transform(`_data`, v, dim = 2L, ...)## S3 method for class 'soma_adat' transform(`_data`, v, dim = 2L, ...)
_data |
A |
v |
A numeric vector of the appropriate length corresponding to |
dim |
Integer. The dimension to apply elements of |
... |
Currently not used but required by the S3 generic. |
Performs the following operations (quickly):
Columns:
Rows:
A modified value of _data with either the rows or columns
linearly transformed by v.
This method in intentionally naive, and assumes the user has
ordered v to match the columns/rows of _data appropriately.
This must be done upstream.
# simplified example of underlying operations M <- matrix(1:12, ncol = 4) M v <- 1:4 M %*% diag(v) # transform columns v <- 1:3 diag(v) %*% M # transform rows # dummy ADAT example: v <- c(2, 0.5) # double seq1; half seq2 adat <- data.frame(sample = paste0("sample_", 1:3), seq.1234.56 = c(1, 2, 3), seq.9999.88 = c(4, 5, 6) * 10) adat # `soma_adat` to invoke S3 method dispatch class(adat) <- c("soma_adat", "data.frame") trans <- transform(adat, v) data.frame(trans)# simplified example of underlying operations M <- matrix(1:12, ncol = 4) M v <- 1:4 M %*% diag(v) # transform columns v <- 1:3 diag(v) %*% M # transform rows # dummy ADAT example: v <- c(2, 0.5) # double seq1; half seq2 adat <- data.frame(sample = paste0("sample_", 1:3), seq.1234.56 = c(1, 2, 3), seq.9999.88 = c(4, 5, 6) * 10) adat # `soma_adat` to invoke S3 method dispatch class(adat) <- c("soma_adat", "data.frame") trans <- transform(adat, v) data.frame(trans)
Utility to update a provided soma_adat object's column
metadata to match the annotations object.
updateColMeta(adat, anno)updateColMeta(adat, anno)
adat |
A |
anno |
A |
Attempts to update the following column metadata in the adat:
SomaId
Target
TargetFullName
UniProt
Type
Organism
EntrezGeneSymbol
EntrezGeneID
An identical object to adat with Col.Meta updated
to match those in anno.
Caleb Scheidel
## Not run: anno_tbl <- read_annotations("path/to/annotations.xlsx") adat <- read_adat("path/to/adat_file.adat") updated_adat <- updateColMeta(adat, anno_tbl) ## End(Not run)## Not run: anno_tbl <- read_annotations("path/to/annotations.xlsx") adat <- read_adat("path/to/adat_file.adat") updated_adat <- updateColMeta(adat, anno_tbl) ## End(Not run)
One can write an existing modified internal ADAT
(soma_adat R object) to an external file.
However the ADAT object itself must have intact
attributes, see is_intact_attr().
write_adat(x, file)write_adat(x, file)
x |
A |
file |
Character. File path where the object should be written.
For example, extensions should be |
The ADAT specification no longer requires Windows
end of line (EOL) characters ("\r\n").
The current EOL spec is "\n" which is commonly used in POSIX systems,
like MacOS and Linux.
Since the EOL affects the resulting checksum, ADATs written on
other systems generate slightly differing files.
Standardizing to "\n" attempts to solve this issue.
For reference, see the EOL encoding for operating systems below:
| Symbol | Platform | Character |
| LF | Linux | "\n" |
| CR | MacOS | "\r" |
| CRLF | DOS/Windows | "\r\n"
|
Invisibly returns the input x.
Stu Field
Other IO:
loadAdatsAsList(),
parseHeader(),
read_adat(),
soma_adat
# trim to 1 sample for speed adat_out <- head(example_data, 1L) # attributes must(!) be intact to write is_intact_attr(adat_out) write_adat(adat_out, file = tempfile(fileext = ".adat"))# trim to 1 sample for speed adat_out <- head(example_data, 1L) # attributes must(!) be intact to write is_intact_attr(adat_out) write_adat(adat_out, file = tempfile(fileext = ".adat"))