scifer 1.4.0
Have you ever index sorted cells in a 96 or 384-well plate and then sequenced using Sanger sequencing? If so, you probably had some struggles to either check the chromatogram of each cell sequenced manually or to identify which cell was sorted where after sequencing the plate. Scifer was developed to solve this issue by performing basic quality control of Sanger sequences and merging flow cytometry data from probed single-cell sorted B cells with sequencing data. Scifer can export summary tables, fasta files, chromatograms for visual inspection, and generate reports.
Single-cell sorting of probed B/T cells for Sanger sequencing of their receptors is widely used, either for identifying antigen-specific antibody sequences or studying antigen-specific B and T cell responses. For this reason, scifer R package was developed to facilitate the integration and QC of flow cytometry data and sanger sequencing.
This vignette aims to show one example of how to process your own samples based on a test dataset. This dataset contains raw flow cytometry index files (file extension: .fcs
) and raw sanger sequences (file extension: .ab1
). These samples are of antigen-specific B cells that were probed and single-cell sorted in a plate to have their B cell receptors (BCR) sequenced through sanger sequencing. This package can also be used for T cell receptors but you should have extra attention selecting the QC parameters according to your intended sequence. The sorted cells had their RNA reverse transcribed into cDNA and PCR amplified using a set of primer specific for rhesus macaques (sample origin), the resulting PCR products were sequenced using an IgG specific primer designed to capture the entire VDJ fragment of the BCRs.
Regardless of where is your data, you should have two folders, one for flow cytometry data and a second one for sanger sequences. The nomenclature of the .fcs
files and the sanger sequence subfolders should be matching, this is fundamental for merging both datasets.
If you want to have a more detailed explanation of installation steps and folder organization, check the README file in the package github here.
Get the latest stable R
release from CRAN. Then install scifer
using from Bioconductor the following code:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("scifer")
library(ggplot2)
library(scifer)
It is important to check your flow data to check how is your data being processed, if it is already compensated, and if the cells were probed which thresholds you should use.
Here is an example of a poor threshold using the forward and side scatter (cell size and granularity).
fcs_data <- fcs_processing(
folder_path = system.file("/extdata/fcs_index_sorting",package = "scifer"),
compensation = FALSE, plate_wells = 96,
probe1 = "FSC.A", probe2 = "SSC.A",
posvalue_probe1 = 600, posvalue_probe2 = 400)
fcs_plot(fcs_data)
You can play around with the threshold and the different channels available. You can check the name of each channel using this:
colnames(fcs_data)
#> NULL
You can see that the well position was already extracted from the file and a column named specificity
was added. This specificity is named based on your selected channels and their thresholds.
If you did not probe your cells for a specific antigen, you can just use the following and ignore the specificity
column. This approach will add all of your cells, regardless of the detected fluorescence in a channel.
fcs_data <- fcs_processing(
folder_path = system.file("/extdata/fcs_index_sorting",package = "scifer"),
compensation = FALSE, plate_wells = 96,
probe1 = "FSC.A", probe2 = "SSC.A",
posvalue_probe1 = 0, posvalue_probe2 = 0)
fcs_plot(fcs_data)
If you have probed your cells based on a specific marker, you can use the name of the channel or the custom name you have added during the sorting to that channel.
fcs_data <- fcs_processing(
folder_path = system.file("/extdata/fcs_index_sorting",package = "scifer"),
compensation = FALSE, plate_wells = 96,
probe1 = "Pre.F", probe2 = "Post.F",
posvalue_probe1 = 600, posvalue_probe2 = 400)
fcs_plot(fcs_data)
Finally, the above data used compensation as set to FALSE
, which is not usually the case since you probably have compensated your samples before sorting. You can set it to TRUE
and the compensation matrix within the index files will be already automatically applied.
fcs_data <- fcs_processing(
folder_path = system.file("/extdata/fcs_index_sorting",package = "scifer"),
compensation = TRUE, plate_wells = 96,
probe1 = "Pre.F", probe2 = "Post.F",
posvalue_probe1 = 600, posvalue_probe2 = 400)
fcs_plot(fcs_data)
The specificity
column uses these thresholds to name your sorted cells. In this example, you would have a Pre.F
single-positive, Post.F
single-positive, double-positive cells named asDP
, and double-negative cells named as DN
.
unique(fcs_data$specificity)
#> NULL
Here is just an example of if you would like to process a single sanger sequence
## Read abif using sangerseqR package
abi_seq <- sangerseqR::read.abif(
system.file("/extdata/sorted_sangerseq/E18_C1/A1_3_IgG_Inner.ab1",
package="scifer"))
## Summarise using summarise_abi_file()
summarised <- summarise_abi_file(abi_seq)
head(summarised[["summary"]])
#> raw.length trimmed.length trim.start
#> 465 0 0
#> trim.finish raw.secondary.peaks trimmed.secondary.peaks
#> 0 347 0
head(summarised[["quality_score"]])
#> score position
#> 1 25 1
#> 2 7 2
#> 3 6 3
#> 4 3 4
#> 5 2 5
#> 6 5 6
Most of the time, if you have sequenced an entire plate, you want to automate this processing.
Here you would process recursively all the .ab1
files within the chosen folder.
*Note: To speed up, you can increase the processors
parameter depending on your local computer’s number of processors or set it to NULL
which will try to detect the max number of cores.
sf <- summarise_quality(
folder_sequences=system.file("extdata/sorted_sangerseq", package="scifer"),
secondary.peak.ratio=0.33, trim.cutoff=0.01, processor = 1)
#> Looking for .ab1 files...
#> Found 14 .ab1 files...
#> Loading reads...
#> Calculating read summaries...
#> Cleaning up
Here are the columns from the summarised data.frame
:
## Print names of all the variables
colnames(sf[["summaries"]])
#> [1] "file.path" "folder.name"
#> [3] "file.name" "raw.length"
#> [5] "trimmed.length" "trim.start"
#> [7] "trim.finish" "raw.secondary.peaks"
#> [9] "trimmed.secondary.peaks" "raw.mean.quality"
#> [11] "trimmed.mean.quality" "raw.min.quality"
#> [13] "trimmed.min.quality"
Here is the example data.frame
with the summarised results of all the files within the selected path:
## Print table
head(sf[["summaries"]][4:10])
#> raw.length trimmed.length trim.start trim.finish raw.secondary.peaks
#> 1 463 433 16 448 6
#> 2 500 467 17 483 9
#> 3 464 153 68 220 342
#> 4 465 256 1 256 347
#> 5 473 444 15 458 9
#> 6 456 428 13 440 4
#> trimmed.secondary.peaks raw.mean.quality
#> 1 3 52.69677
#> 2 0 55.47117
#> 3 114 12.34698
#> 4 187 13.17345
#> 5 3 51.98109
#> 6 1 55.72429
Finally, the function you will use to integrate both datasets and export data from scifer
is quality_report()
. This function aims to basically merge the two datasets, assign sorting specificity based on the selected thresholds for each channel/probe and write different files.
This function generated the following files:
quality_report(
folder_sequences = system.file("extdata/sorted_sangerseq", package="scifer"),
outputfile = "QC_report.html", output_dir = "~/full/path/to/your/location",
folder_path_fcs = system.file("extdata/fcs_index_sorting",
package = "scifer"),
probe1 = "Pre.F", probe2 = "Post.F",
posvalue_probe1 = 600, posvalue_probe2 = 400)
sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] XVector_0.42.0 IRanges_2.36.0 scifer_1.4.0 ggplot2_3.4.4
#> [5] BiocStyle_2.30.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.0 viridisLite_0.4.2 farver_2.1.1
#> [4] dplyr_1.1.3 blob_1.2.4 Biostrings_2.70.0
#> [7] bitops_1.0-7 fastmap_1.1.1 RCurl_1.98-1.12
#> [10] promises_1.2.1 digest_0.6.33 mime_0.12
#> [13] lifecycle_1.0.3 ellipsis_0.3.2 RSQLite_2.3.1
#> [16] magrittr_2.0.3 compiler_4.3.1 rlang_1.1.1
#> [19] sass_0.4.7 tools_4.3.1 utf8_1.2.4
#> [22] yaml_2.3.7 data.table_1.14.8 knitr_1.44
#> [25] bit_4.0.5 plyr_1.8.9 xml2_1.3.5
#> [28] withr_2.5.1 RProtoBufLib_2.14.0 BiocGenerics_0.48.0
#> [31] grid_4.3.1 stats4_4.3.1 fansi_1.0.5
#> [34] xtable_1.8-4 colorspace_2.1-0 MASS_7.3-60
#> [37] scales_1.2.1 isoband_0.2.7 cli_3.6.1
#> [40] rmarkdown_2.25 crayon_1.5.2 generics_0.1.3
#> [43] rstudioapi_0.15.0 httr_1.4.7 DBI_1.1.3
#> [46] cachem_1.0.8 flowCore_2.14.0 stringr_1.5.0
#> [49] zlibbioc_1.48.0 rvest_1.0.3 parallel_4.3.1
#> [52] BiocManager_1.30.22 matrixStats_1.0.0 vctrs_0.6.4
#> [55] webshot_0.5.5 jsonlite_1.8.7 cytolib_2.14.0
#> [58] bookdown_0.36 S4Vectors_0.40.0 bit64_4.0.5
#> [61] magick_2.8.1 systemfonts_1.0.5 jquerylib_0.1.4
#> [64] glue_1.6.2 sangerseqR_1.38.0 stringi_1.7.12
#> [67] gtable_0.3.4 later_1.3.1 GenomeInfoDb_1.38.0
#> [70] munsell_0.5.0 tibble_3.2.1 pillar_1.9.0
#> [73] htmltools_0.5.6.1 GenomeInfoDbData_1.2.11 R6_2.5.1
#> [76] evaluate_0.22 kableExtra_1.3.4 shiny_1.7.5.1
#> [79] Biobase_2.62.0 memoise_2.0.1 DECIPHER_2.30.0
#> [82] httpuv_1.6.12 bslib_0.5.1 Rcpp_1.0.11
#> [85] svglite_2.1.2 gridExtra_2.3 xfun_0.40
#> [88] pkgconfig_2.0.3