bi order multimodal integration of single-cell data

RNA and . a, b Cells from CD19-CAR NK, IL15 NK, and NT-NK products, clustered independently by CyTOF (a) and scRNA (b) data. Among 9383 detected high-quality nuclei, 1276 are gated as BCs by known markers for further analysis. Hoyos V, Savoldo B, Quintarelli C, Mahendravada A, Zhang M, Vera J, et al. The transcription factor Engrailed-2 guides retinal axons. The fetal muscle sci-RNA-seq dataset was downloaded from Descartes database https://descartes.brotmanbaty.org/bbi/human-gene-expression-during-development/ [66], and the fetal muscle sci-ATAC-seq dataset was downloaded from https://descartes.brotmanbaty.org/bbi/human-chromatin-during-development/ [5]. Lake BB, Chen S, Sos BC, Fan J, Kaeser GE, Yung YC, et al. 2015;33:2859. Google Scholar. L2-normalization is performed to remove global differences in scale, therefore. 2016;44:W907. Use of CAR-transduced natural killer cells in CD19-positive lymphoid tumors. Nat Biotechnol. Li L, Vakul M, Dou J, Huang Y, Chen K, Rezvani K: Gene expression omnibus; 2022, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE190976. Cao J, ODay DR, Pliner HA, Kingsley PD, Deng M, Daza RM, et al. The final protein matrix includes 28,609 cells with 25 protein abundance levels. The cell-type level Pearson correlation was calculated by using entries of X and Z from a specific cell type. CAS Dou J, Liang S, Chen K, Chen R. biCCA: bi-order multimodal integration of single cell data: Gene Expression Omnibus; 2022. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE201402. Cell. https://doi.org/10.1186/s13059-022-02679-x, DOI: https://doi.org/10.1186/s13059-022-02679-x. Recent technological advances in single-cell genomics have enabled the measurement of multiple facets of cellular state, producing datasets with millions of single-cell observations. In comparison, Seurat v3.0 tended to misalign all cell types to BC1 and had difficulties separating BC8 and BC9. The Silhouette score assesses the separation of cell types, where a high score suggests that cells of the same cell type are close together and far from cells of a different type. Fig. 2019;10:112. Fig. Fig. iCluster 2 is circled out by dashed lines. MOJITOO: a fast and universal method for integration of multimodal BioRxiv. We select appropriate parameters that generally has best performance in Silhouette score and alignment mixing score. Ten clear clusters were identified and annotated unambiguously as BC1-10 (Fig. Cell. The p values shown are from the Wilcoxon test. Fig. It also substantially simplifies (or loses) multifactorial relations between transcription factors (TF) and target genes [27]. Cell. The imputed profile has high correlation with the true protein levels (Pearsons R = 0.6) and low correlation with the initial gene scores (Pearsons R < 0.3) (Additional file 1: Fig. After performing standard quality control, we obtained 1276 BC nuclei of high-quality matched ATAC+RNA profiles, which serve as an objective ground truth for quantifying the success of in silico integration. BioRxiv. 2018;36:42831. X-axis is the Pearson correlation between the ground truth protein level and the RNA level of its coding gene. and CD11a/CD11c (LYZ etc.) Cell. Integrating single-cell RNA with protein data produced by a CITE-seq assay. Here, we comment . The modality fusion matrix Z in bindSC can be considered as the imputed profiles of cells from Y on the first modality. For example, protein levels for CD19, CD14, and CD11c, markers overexpressing on B cells, monocytes, and DCs, are not highly correlated with the observed RNA expression levels in the same cells (Fig. 2017;18:115. Importantly, this subset of cells can be identified from neither the scRNA-seq clusters (Fig. Nat Methods. Silhouette score and alignment mixing score are calculated for each scenario. Argelaguet R, Clark SJ, Mohammed H, Stapel LC, Krueger C, Kapourani C-A, et al. The left term identifies cells of similar (aligned) features, while the right term identifies features shared by the (aligned) cells. Label transfer accuracy achieved by various methods on these imbalanced datasets are shown for scenario 1 (i) and scenario 2 (j). The general optimization framework can be formulated as follws: subject toUU=I, SS=I, TT=I, VV=I. instance of . We randomized the orders of the cells in the RNA matrix and the protein matrix, then tested the ability of each method in generating meaningful co-embeddings and recovering the correct pairing. The ARI values are labeled in each panel. PubMedGoogle Scholar. This project was also partially supported by the Single Cell Genomics Core at Baylor College of Medicine funded by the NIH shared instrument grants (S10OD023469, S10OD025240) and P30EY002520. Genome Biology 2023 BioMed Central Ltd unless otherwise stated. Correspondence to 1 1 Supplementary Notes for "Bi-order multimodal integration of single-cell 2 data" 3 Jinzhuang Dou1,2, jdou1@mdanderson.org 4 Shaoheng Liang1,2, sliang3@mdanderson.org 5 Vakul Mohanty1, vmohanty@mdanderson.org 6 Qi Miao1, qmiao1@mdanderson.org 7 Yuefan Huang1, yhuang15@mdanderson.org 8 Xuesen Cheng3, xuesenc@bcm.edu 9 Sangbae Kim3, Sangbae.Kim@bcm.edu These two assumptions are met under many scenarios of current investigations, however, could be violated when there are insufficient number of cells obtained from a rapidly developing cell population. For example, the fetal muscle cell ATAC dataset, consisting of 27,181 cells, has a cluster of cells (3.55% abundance) labeled as unknown (Additional file 1: Fig. Here, we present a novel mathematical solution named bi-order canonical correlation analysis (bi-CCA), which extends the widely used CCA . CD8+ and CD4+ T cells were partly blended together in the RNA data (ARI = 0.43) but separated clearly in the protein data (ARI = 0.82). 2020;370(6518):eaba7612. Transient retroviral supernatant was produced, collected, and used for transduction of NK cells. 4g), nor the CyTOF clusters alone (Additional file 1: Fig. S2. Data was analyzed using bcl2fastq (to generate fastq files) and CellRanger pipeline (10x Genomics). They can be grouped into 25 clusters. Single-cell RNA-seq analysis of retinal development identifies NFI factors as regulating mitotic exit and late-born cell specification. Bioinformatics. Terms and Conditions, Single-cell transcriptomic atlas of the human retina identifies cell types associated with age-related macular degeneration. The human bone marrow dataset was generated using the CITE-seq technology, which was downloaded from Seurat website https://satijalab.org/seurat/v4.0/weighted_nearest_neighbor_analysis.html. MOJITOO: a fast and universal method for integration of multimodal Thus, additional manual review and examination of marker gene expressions are likely required to ensure accurate annotation result. We also excluded cells that were CD3+ to focus on NK cells only. Part of 2020;370(6518):eaba7721. The protein-homologous RNA matrix includes 28,609 cells with the RNA levels of the 25 genes encoding the 25 proteins. Next, we set. Cord blood NK cells engineered to express IL-15 and a CD19-targeted CAR show long-term persistence and potent antitumor activity. Recent advances in multimodal single-cell technologies have enabled simultaneous acquisitions of multiple omics data from the same cell, providing deeper insights into cellular states and dynamics. Onco-fetal reprogramming of endothelial cells drives immunosuppressive macrophages in hepatocellular carcinoma. The key algorithm implemented in bindSC is bi-CCA, the concept of which extends traditional CCA [20, 27, 55] to both rows and columns to enable capturing of correlated variables in cells and features simultaneously. The update process (4) ~ (7) are repeated until it reaches convergence. Although further experimentation is clearly required, the clean definition of CCA may warrant relatively straightforward interpretation of the complex integration results. Science. 2019;16(12):128996. Wang X, Allen WE, Wright MA, Sylwestrak EL, Samusik N, Vesuna S, et al. Remarkably, the majority of the cells from the two modalities became well aligned in the co-embedding (Fig. However, gene activity scores generated by them did not improve integration results in our benchmarking experiments (Additional file 2: Supplementary Note 5 and Additional file 1: Fig. Cell. This finding may help quantify the therapeutic value of a CAR-NK cell project and reveal mechanisms that can be further leveraged to improve the efficacy of the treatment. For cell i, assuming proportions of cells from two modalities are p1i and p2i, respectively, the alignment mixing score is calculated as. Similar to previous integration methods, E is very important on cell type classification. Our approach can further improve integration performance by leveraging distal regulatory relations [8] (Fig. Clusters 7 and 8, which are classified differently by bindSC, are highlighted by black boxes. JD designed the bindSC tool and implemented the software. Ramanathan M, Porter DF, Khavari PA. Methods to study RNAprotein interactions. Rozenblatt-Rosen O, Stubbington MJ, Regev A, Teichmann SA. Previous approaches based on s We also removed cells with less than 1000 peaks expressed and peaks that exist in less than 500 cells from the sci-ATAC-seq data. The result at iteration 0 corresponds to the traditional CCA method. It typically requires O(MNL) floating-point operations to construct MN cell-cell distance matrix as input to SVD decomposition, where M and N are cell numbers of the two modalities and L is the number of overlapped features. Cluster for each cell is defined using the cell type labels assigned from single dataset clustering. Wang C, Sun D, Huang X, Wan C, Li Z, Han Y, et al. The fetal muscle sci-RNA-seq dataset was downloaded from https://descartes.brotmanbaty.org/bbi/human-gene-expression-during-development/, and the fetal muscle sci-ATAC-seq dataset was downloaded from https://descartes.brotmanbaty.org/bbi/human-chromatin-during-development/. a, b UMAPs generated from bindSC co-embedding of the sci-RNA (a) and the sci-ATAC (b) data. For cell i from the first data, we calculate its distance (Euclidean distance) with all cells in the second data as Di, and its distance with cell i in the second data as di. Bi-CCA is implemented as an open-source R package bindSC available at https://github.com/KChen-lab/bindSC. Hardoon DR, Szedmak S, Shawe-Taylor J. Canonical correlation analysis: an overview with application to learning methods. 2005;438:948. Moreover, the final modality fusion Z and Y can generate a consensus multiomic profile for cells from Y directly, thus enable (1) characterizing gene and chromatin-accessibility relations from aligned scRNA-seq and scATAC-seq data, (2) associating transcriptomic profiles with proteomic profiles from aligned scRNA-seq and CyTOF data, (3) associating transcriptomic profiles with spatial locations from aligned scRNA-seq and spatial transcriptomic data, and so on (Fig. Nat Commun. S3). Kuleshov MV, Jones MR, Rouillard AD, Fernandez NF, Duan Q, Wang Z, et al. Nat Genet. g Gene expression levels of CCL4 and CCL3 in iCluster 2 and rClusters R0 and R2. The shared nearest neighbor (SNN) graph is constructed by calculating the l-nearest neighbors (20 by default) based on the Euclidean distance in the L2-normlized space. Mass cytometry data was saved in FCS files by a CyTOF instrument (Helios). DEGs specific for clusters 0, 7, and 8 are highlighted by a black box. The Human Cell Atlas: from vision to reality. Previous approaches based on shared features have only provided approximate solutions. Article Integrated analysis of multimodal single-cell data - PMC c, d Integrated clusters (iClusters) after running bindSC on the CyTOF and the scRNA data. Notably, the bulk of CD4+ and CD8+ T cells mixing together in the RNA data became well separated in the co-embedding. Libraries were quantified and loaded on Novaseq 6000 and run with the following parameter: 151, 8, 8, and 151bp. Then, the updated matrix Pz could be solved based on the following equation: subject to \({\boldsymbol{U}}^{\prime}\boldsymbol{U}=\boldsymbol{I},{\boldsymbol{S}}^{\prime}\boldsymbol{S}=\boldsymbol{I},{\boldsymbol{T}}^{\prime}\boldsymbol{T}=\boldsymbol{I},{\boldsymbol{V}}^{\prime}\boldsymbol{V}=\boldsymbol{I},{\left\Vert {\boldsymbol{P}}_{\boldsymbol{z}}\right\Vert}_F^2=1\),in which \({n}_l={\left\Vert {\boldsymbol{P}}_{\boldsymbol{X}}\boldsymbol{U}{\boldsymbol{S}}^{\prime}\right\Vert}_F^2\) and \({n}_r={\left\Vert \boldsymbol{T}{\boldsymbol{V}}^{\prime }{\boldsymbol{P}}_{\boldsymbol{Y}}\right\Vert}_F^2\) represent scale factors for two objectives, and Pz0 is normalized as \({\boldsymbol{P}}_{\boldsymbol{z}\mathbf{0}}:= {\boldsymbol{P}}_{\boldsymbol{z}\mathbf{0}}/{\left\Vert {\boldsymbol{P}}_{\boldsymbol{z}\mathbf{0}}\right\Vert}_F^2\). Using integrating genes with ATAC peaks as an example, g1, , gM represent the gene expression levels and p1, , pN represent the ATAC peaks, with MN. It is worth noting that mathematically, Z may be defined in two ways depending on which modality is used as X. Finally, we obtained scRNA-seq data matrix having (1341 cells 33,538 genes) and CyTOF data matrix (59,510 cells 29 proteins) from the three groups. Peak-gene links inferred from bipolar cells clustered by subtypes. Leukemia. Multi-omics profiling of mouse gastrulation at single-cell resolution. Nat Methods. Highlights. Coupled NMF shares similar concept with LIGER [28]; Harmony projects cells onto a shared embedding using principle components analysis (PCA) and removes batch effects iteratively [21]. 2003;6:125563. Ma S, Zhang B, LaFave LM, Earl AS, Chiang Z, Hu Y, et al. Motivation: The advent of multi-modal single-cell sequencing techniques have shed new light on molecular mechanisms by simultaneously inspecting transcriptomes, epigenomes and proteomes of the same cell. 2020;17(1):2020. To address this challenge, bi-CCA introduces a modality fusion matrix Z to link X and Y (Fig. Cao K, Bai X, Hong Y, Wan L. Unsupervised topological alignment for single-cell multi-omics integration. Step 1: initializing a modality fusion matrix Z linking the two modalities (Methods). However, it is challenging to learn the joint representations from the multimodal data, model the relationship between modalities, and, more importantly, incorporate the vast amount of single . (11) is solved with a direct analogy of Eq. Stoeckius M, Hafemeister C, Stephenson W, Houck-Loomis B, Chattopadhyay PK, Swerdlow H, et al. Fig. LIGER and Harmony have worse accuracy. To address this computational challenge, bindSC implements the divide-and-conquer eigenvalue algorithm. This approach quantifies accessibility variation across single cells by aggregating accessible regions containing a specific TF motif. To examine bindSCs scalability in large-scale datasets, we created ten benchmark datasets with cells number ranging from 22,552 to 834,424 by resampling cells in the fetal muscle atlas (Additional file 1: Fig. Cell. Validations using co-assayed ground truth data . c, d Cell types identified respectively using bindSC (c) and based on gene activity scores in the original publication (d). S12a-b). Article S1b). A novel mathematical solution named bi-order canonical correlation analysis (bi-CCA), which extends the widely used CCA approach to iteratively align the rows and the columns between data matrices, and is generally applicable to combinations of any two single-cell modalities. As shown in the co-embedding UMAPs (Fig. S15. We then examined the ATAC profiles and found that cells in the same cell-types were largely clustered together (ARI = 0.71) although were not as distinctive. 2014;345:9439. 2016;12:e1005112. Take a subproblem from Eq. Google Scholar. For visualization purpose, we further keep no more than five genes for each individual protein. Each dot corresponds to the accuracy of one known marker gene (The full gene list is shown in Additional file 1: Fig. In addition, although we did not observe obviously mismatched clusters because most datasets we studied are derived from biological samples of limited heterogeneity, it is possible to observe modality-specific clusters that cannot be well aligned by bi-CCA. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. Data from all samples were divided by these maximum values. S7. Based on pre-aligned features generated by such empirical rules, Seurat integration (referred to as Seurat here after; not to be confused with the weighted nearest neighbor (WNN) approach introduced in Seurat v4 for clustering co-assayed data) applies canonical correlation analysis (CCA) and mutual nearest neighbors (MNNs) to identify cells anchoring the two data matrices [20]; LIGER uses an integrative non-negative matrix factorization (iNMF) to delineate shared and dataset-specific features [22]. S4i and S8e). From Louvain to Leiden: guaranteeing well-connected communities. Bi-order multimodal integration of single-cell data - DOAJ Fig. The authors would like to thank Yuanxin Wang, Linghua Wang, Tapsi Kumar, Runmin Wei, Nicholas Navin, Traver Hart, John Weinstein, and Hussein Abbas for their comments. In Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. Selection of allows us to adjust the size of modality-specific effects to reflect the divergence of the datasets being analyzed, and selection of couple coefficient depends on whether the initial Z(0) can represent the true gene score of Y. S4d-e). The feature dimensionality of the matrix Y is usually more than 100,000 for single-cell epigenetic profiles, which will take longer time/larger memory for integration. Science. NatureCommun. Therefore, the maximal memory usage of bindSC is independent of the total cell number (Additional file 1: Fig. 10). For each surface marker, the maximum intensity observed over the 99.5th percentile across all samples was excluded to avoid high-intensity outliers. Science. In addition, for scRNA-seq and scATAC-seq data, Z(0) can be inferred differently using the regulatory potential (RP) model in MAESTRO [19], or the co-accessibility model in Cicero (Additional file 1: Fig. Bi-order multimodal integration of single-cell data - [scite report] Joint clustering, label transfer and network inference can be done in the latent space (Fig. CAS Robertson MJ. This highlights the challenge of performing de novo gene regulation network inference from scATAC-seq data. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. The scale factor is introduced to balance the importance of each modality and it ranges from 0 to 1. Mol Cell. By using this website, you agree to our RC, YL, XC, SK, and JC contributed to mouse retina 10x Genomics ATAC+RNA data generation, curation. The standard singular value decomposition (SVD) can be implemented to obtain the canonical correlation vectors (CCVs) (Algorithm 1) at cell levels to approximate CCVs for the left term: Once CCV pairs (U,S) and (T,V) are obtained, the modality fusion matrix Z can be updated as follows: In Eq. 2g). Fig. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. S2). The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Advances in high-throughput single-cell technology such as single-cell RNA-sequencing (scRNA-seq) [1] and mass cytometry [2] have enabled systematic delineation of cell types based on thousands to millions of cells sampled from developing organisms or patient biopsies [3, 4]. Integrating single-cell RNA and protein data from the same sample can potentially achieve higher resolution characterization and enable discovery of novel cellular states and associated regulatory signatures. Article The divide part first splits cells into different blocks specified by users, which can be solved in parallel with lower memory usage (Additional file 1: Fig. After 3 iterations, the R value increased to 0.5; meanwhile, the value between imputed and the initial profile decreased to as low as 0.2, indicating the power of associating full peak profiles to genes in a de novo fashion, rather than utilizing reduced profiles (Fig. Bi-CCA made two assumptions: (1) the two sets of cells are sampled uniformly from the same biological sample and (2) the features of the two datasets are linearly correlated. Comprehensive classification of retinal bipolar neurons by single-cell transcriptomics. Cell. Unsupervised manifold alignment for single-cell multi-omics data. Genome Biol. Because bindSC works with the full ATAC profile, it has the power to better establish the relationship between the RNA and the ATAC features, including potentially distal relationships. Clark BS, Stein-OBrien GL, Shiau F, Cannon GH, Davis-Marcisak E, Sherman T, et al. We then tested bindSC on this task. CD56+ NK cells were isolated from cord blood units which were provided by MDACC Cord Blood Bank. A normalized cell-type score (ranges from 0 to 1 and sums up to 1) for each cell is returned. Motif-based Transcription factors (TFs) analysis of bipolar cells (BCs) based on bindSC integration. S6). Download Free PDF View PDF 2022 Yuhan Hao Mapping single-cell sequencing profiles to comprehensive reference datasets represents a powerful alternative to unsupervised analysis. CyTOF (a) and scRNA-seq (b) are emphasized, respectively, for better visualization. S4a) for SCOT. 3g). To further understand the utility of bindSC, we applied it to integrate scRNA-seq and CyTOF data generated from an immunotherapy study. S10. Highly scalable generation of DNA methylation profiles in single cells. f Consistency between cell types computationally inferred from ATAC profiles by bindSC, Seurat v3.0, LIGER, and Harmony, respectively, with the ground truth. Bioinformatics. This dataset was generated using the CITE-seq technology [45], which included 30,672 cells that have joint profiles of RNA and a panel of 25 antibodies. S8d), again indicating the power of associating two modalities de novo. 2020;17:114. As a result, intensity values for each marker ranged from 0 to 1. Bi-order multimodal integration of single-cell data - Semantic Scholar KC conceptualized and supervised the project. A single-cell atlas of in vivo mammalian chromatin accessibility. The DEGs are also enriched in biological processes related to extracellular matrix organization (p<104), regulation of exocytosis (p<104) and platelet degranulation (p<104). 5d). Science. volume23, Articlenumber:112 (2022) Bi-CCA outputs canonical correlation vectors (CCVs), which project cells from two datasets onto a shared latent space (hereafter co-embedding). Let a(i) be the average distance of cell i to all other cells within is cluster and b(i) the average distance of i to all cells in the nearest cluster, to which cell i does not belong. Brunet I, Weinl C, Piper M, Trembleau A, Volovitch M, Harris W, et al. Protein marker expression level for integrated clusters from bindSC. Unsupervised clustering of the RNA profiles revealed cell types largely consistent with those in the protein profiles, except for some noticeable differences (Fig. Briefly, scRNA-seq data was pre-processed using the default pipeline Cell Ranger recommended by 10x Genomics. After incubation for 1h at 37C, the transposed nuclei were combined with barcoded gel beads, RT mix, and partition oil on chromium to generate gel beads in emulsion (GEMs). PubMed 2012;76:26680. That means the two modalities may not have evenly represented molecular heterogeneity in the sample, violating the second assumption. N Engl J Med. Jinzhuang Dou and Shaoheng Liang contributed equally to this work. Ken Chen. Bi-order multimodal integration of single-cell data. S3). Genome Biol. When the two sets of cells are sampled uniformly from the same biological sample, it is safe to assume that there exists an optimal alignment of them. 2019;576:48791. ac UMAPs of 30,672 human bone marrow cells based on abundance of the 25 surface proteins (a), RNA expression levels of 3000 highly variable gene (b), and RNA expression levels of the 25 protein-coding genes (c). Steps 2 and 3 are performed iteratively to optimize Z. c Based on canonical correlation vectors (CCVs) in the derived latent space, bindSC can (1) jointly cluster cells in both modalities to define cell types and (2) transfer labels from one modality to another modality. Barbara Cheifet and Stephanie McClelland were the primary editors of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team. The overall Pearson correlation was calculated by treating X and Z as vectors. Affiliations 1 Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, USA. However, they only use intramodal pairwise cell-cell distance information and discard intermodal, trans-acting feature interaction. Google Scholar. Dou J, Liang S, Chen K. biCCA: bi-order multimodal integration of single-cell data: Github; 2022. https://github.com/KChen-lab/bindSC.git. BindSC workflow for creating in silico single-cell multi-omics embeddings consists of four steps: Individual dataset preprocessing including variable feature selection and cell clustering, Initializing feature matching across modalities (i.e., constructing modality fusion matrix), Identifying cell correspondence using the bi-cca algorithm. In the single-cell domain, bindSC can clearly be applied to align cells and features simultaneously, which are important for ongoing investigations in the Human Cell Atlas [50], the NIH HubMap [51], the Human Tumor Cell Network [52], and on remodeling of tumor microenvironment [53]. Given we use RNA annotation as the gold standard, the ARI for RNA clustering in (a) is 1. d, e UMAPs generated from the bindSC-integrated snRNA and snATAC co-embeddings. We used a CITE-seq dataset consisting of 30,672 human bone marrow cells with a panel of 25 proteins [20].

Spokane Craigslist Cars For Sale By Owner, Why Is Fitbit Sense Cheaper Than Versa 3, Used Cars Under $5,000 Melbourne, Fl, Leica Cyclone Register 360 Blk Edition Trial, Dakine Shorts Women's, Articles B