Tutorial 1: data integration for human lymph node (10x Genomics Visium, in-house data)

In this tutorial, we demonstrate how to apply SpatialGlue to integrate human lymph node data to decipher tissue heterogeneity. The data is generated using 10x Genomics Visium RNA and protein co-profiling technology. As a example, we analyse a human lyph node dataset. The dataset is publicly assessible (https://zenodo.org/records/10362607). According to histological image, our clinician collaborator manually annotated the tissue structures.

It’s important to note that the SpatialGlue model is based on PyTorch framework and might exhibit slight sensitivity to random seed or GPU enviroment. However, with larger input data (referring to the number of spots), the impact of these variations dimished. Furthermore, under consistent implementation enviroments, SpatialGlue yields reproducible results.

Before running the model, please download the input data via https://drive.google.com/drive/folders/1RlU3JmHg_LZM1d-o6QORvykYPoulWWMI. With the link, we can also obtain the annotation file.

Loading package

[50]:

import os
import torch
import pandas as pd
import scanpy as sc

[51]:

import SpatialGlue

[52]:

# Environment configuration. SpatialGlue pacakge can be implemented with either CPU or GPU. GPU acceleration is highly recommend for imporoved efficiency.
device = torch.device('cuda:2' if torch.cuda.is_available() else 'cpu')

# the location of R, which is required for the 'mclust' algorithm. Please replace the path below with local R installation path
os.environ['R_HOME'] = '/scbio4/tools/R/R-4.0.3_openblas/R-4.0.3'

Loading data

[53]:

# read data
file_fold = '/home/yahui/anaconda3/work/SpatialGlue_revision/data/Dataset11_Lymph_Node_A1/' #please replace 'file_fold' with the download path

adata_omics1 = sc.read_h5ad(file_fold + 'adata_RNA.h5ad')
adata_omics2 = sc.read_h5ad(file_fold + 'adata_ADT.h5ad')

adata_omics1.var_names_make_unique()
adata_omics2.var_names_make_unique()

/home/yahui/anaconda3/envs/STGAT/lib/python3.8/site-packages/anndata/_core/anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
/home/yahui/anaconda3/envs/STGAT/lib/python3.8/site-packages/anndata/_core/anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")

[54]:

# Specify data type
data_type = '10x'

# Fix random seed
from SpatialGlue.preprocess import fix_seed
random_seed = 2022
fix_seed(random_seed)

Pre-processing data

SpatialGlue adopts standard pre-processing steps for the transcriptomic and protein data. Specifically,for the transcriptomics data,the gene expression counts are log-transformed and normalized by library size via the SCANPY package. The top 3,000 highly variable genes (HVGs) are selected as input of PCA for dimension reduction. To ensure a consistent input dimension with the protein data, the first k (number of proteins) principal components are retained and used as the input of the model. The protein expresssion counts are normliazed using CLR (Centered Log Ratio). All principal components after PCA dimension reduction are used as the input of the model.

[55]:

from SpatialGlue.preprocess import clr_normalize_each_cell, pca


# RNA
sc.pp.filter_genes(adata_omics1, min_cells=10)
sc.pp.highly_variable_genes(adata_omics1, flavor="seurat_v3", n_top_genes=3000)
sc.pp.normalize_total(adata_omics1, target_sum=1e4)
sc.pp.log1p(adata_omics1)
sc.pp.scale(adata_omics1)

adata_omics1_high =  adata_omics1[:, adata_omics1.var['highly_variable']]
adata_omics1.obsm['feat'] = pca(adata_omics1_high, n_comps=adata_omics2.n_vars-1)

# Protein
adata_omics2 = clr_normalize_each_cell(adata_omics2)
sc.pp.scale(adata_omics2)
adata_omics2.obsm['feat'] = pca(adata_omics2, n_comps=adata_omics2.n_vars-1)

Constructing neighbor graph

To preserve the physical proximity between spots and phenotypic proximity of spots which have the same cell types but are spatially non-adjacent to each other, we construct two neighbors graphs, i.e., spatial graph and feature graph. Spatial graph is constructed based on spatial information while feature graph is constructed based on expression profiles.

[56]:

from SpatialGlue.preprocess import construct_neighbor_graph
data = construct_neighbor_graph(adata_omics1, adata_omics2, datatype=data_type)

Training the model

The SpatialGlue model aims to learn an integrated latent representation by adaptively integrating expression profiles of different omics modalities in a spatially aware manner.

After model training, SpatialGlue returns ‘output’ file. The ‘output’ file include multiple output results. Let’s go through each of the results in more detail:

Latent Representations:

‘emb_latent_omics1’: latent representation for the first omics modality.
‘emb_latent_omics2’: latent representation for the second omics modality.
‘SpatialGlue’: joint representation learned by incorporating expression data and spatial location information.

The joint representation can be used for downstream analysis such as clustering, visualization, or identifying differentially expressed genes (DEGs).

Attention Weight Values:

‘alpha_omics1’: intra-modality attention weight for the first omics modality, explaining the contribution of each graph to each cluster.
‘alpha_omics2’: intra-modality attention weight for the second omics modality, explaining the contribution of each graph to each cluster.
‘alpha’: inter-modality attention weight explaining the contribution of each modality to each cluster.

These intra- and inter-modality attention weights provide interpretable insights into the importance of each neighborhood graph and modality to each cluster.

[57]:

# define model
from SpatialGlue.SpatialGlue_pyG import Train_SpatialGlue
model = Train_SpatialGlue(data, datatype=data_type, device=device)

# train model
output = model.train()

  0%|                                                                                                                                                                                 | 0/200 [00:00<?, ?it/s]/home/yahui/anaconda3/envs/STGAT/lib/python3.8/site-packages/torch/nn/functional.py:1933: UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.
  warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.")
/home/yahui/anaconda3/envs/STGAT/lib/python3.8/site-packages/SpatialGlue/model.py:212: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  self.alpha = F.softmax(torch.squeeze(self.vu) + 1e-6)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:05<00:00, 38.16it/s]

Model training finished!

[58]:

adata = adata_omics1.copy()
adata.obsm['emb_latent_omics1'] = output['emb_latent_omics1'].copy()
adata.obsm['emb_latent_omics2'] = output['emb_latent_omics2'].copy()
adata.obsm['SpatialGlue'] = output['SpatialGlue'].copy()
adata.obsm['alpha'] = output['alpha']
adata.obsm['alpha_omics1'] = output['alpha_omics1']
adata.obsm['alpha_omics2'] = output['alpha_omics2']

Cross-omics integrative analysis

After integration, we perform clustering analysis using the joint representation. Here we provid three optional kinds of tools for clustering, including mclust, leiden, and louvain. In our experiment, we find ‘mclust’ algorithm performs better than ‘leiden’ and ‘louvain’ on spatial data in most cases. Therefore, we recommend using ‘mclust’ algorithm for clustering.

[59]:

# we set 'mclust' as clustering tool by default. Users can also select 'leiden' and 'louvain'
from SpatialGlue.utils import clustering
tool = 'mclust' # mclust, leiden, and louvain
clustering(adata, key='SpatialGlue', add_key='SpatialGlue', n_clusters=6, method=tool, use_pca=True)

fitting ...
  |======================================================================| 100%

[60]:

# visualization
import matplotlib.pyplot as plt
fig, ax_list = plt.subplots(1, 2, figsize=(7, 3))
sc.pp.neighbors(adata, use_rep='SpatialGlue', n_neighbors=10)
sc.tl.umap(adata)

sc.pl.umap(adata, color='SpatialGlue', ax=ax_list[0], title='SpatialGlue', s=20, show=False)
sc.pl.embedding(adata, basis='spatial', color='SpatialGlue', ax=ax_list[1], title='SpatialGlue', s=25, show=False)

plt.tight_layout(w_pad=0.3)
plt.show()

/home/yahui/anaconda3/envs/STGAT/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/home/yahui/anaconda3/envs/STGAT/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

_images/Tutorial_1_data_integration_for_human_lymph_node_(10x_Genomics_Visium,_in-house_data)_21_1.png

[61]:

# plotting modality weight values.
import pandas as pd
import seaborn as sns
plt.rcParams['figure.figsize'] = (5,3)
df = pd.DataFrame(columns=['RNA', 'protein', 'label'])
df['RNA'], df['protein'] = adata.obsm['alpha'][:, 0], adata.obsm['alpha'][:, 1]
df['label'] = adata.obs['SpatialGlue'].values
df = df.set_index('label').stack().reset_index()
df.columns = ['label_SpatialGlue', 'Modality', 'Weight value']
ax = sns.violinplot(data=df, x='label_SpatialGlue', y='Weight value', hue="Modality",
                split=True, inner="quart", linewidth=1, show=False)
ax.set_title('RNA vs protein')
ax.set_xlabel('SpatialGlue label')
ax.legend(bbox_to_anchor=(1.4, 1.01), loc='upper right')

plt.tight_layout(w_pad=0.05)
#plt.show()

_images/Tutorial_1_data_integration_for_human_lymph_node_(10x_Genomics_Visium,_in-house_data)_22_0.png

[ ]: