Package Tutorial 1

1. Video introduction

scFEA has been highlighted in recomb 2021. For more details, please see our presentation video at link.

2. Installation

scFEA is implemented by Python 3. If you don't have Python, please download Anaconda with Python 3 version.

torch >= 0.4.1
numpy >= 1.15.4
pandas >= 0.23.4
matplotlib >= 3.0.2
magic >= 2.0.4

Download scFEA:

git clone https://github.com/changwn/scFEA

Install requirements:

cd scFEA

conda install --file requirements

conda install pytorch torchvision -c pytorch

pip install --user magic-impute

3. Verify Installation

To ensure that the scFEA is installed correctly, run below code. If you can see help documentation, the software was installed successfully.

%%bash
cd /Users/chang/Documents/work/flux/scFEA
python src/scFEA.py --help

usage: scFEA.py [-h] [--data_dir <data_directory>]
                     [--input_dir <input_directory>] [--res_dir <data_directory>]
                     [--test_file TEST_FILE] [--moduleGene_file MODULEGENE_FILE]
                     [--stoichiometry_matrix STOICHIOMETRY_MATRIX]
                     [--cName_file CNAME_FILE] [--sc_imputation {True,False}]
                     [--output_flux_file OUTPUT_FLUX_FILE]
                     [--output_balance_file OUTPUT_BALANCE_FILE]

scFEA: A graph neural network model to estimate cell-wise metabolic flux using single cell RNA-seq data.

optional arguments:
    -h, --help            show this help message and exit
    --data_dir <data_directory>
                            The data directory for scFEA model files.
    --input_dir <input_directory>
                            The data directory for single cell input data.
    --res_dir <data_directory>
                            The data directory for result [output]. The output of
                            scFEA includes two matrices, predicted metabolic flux
                            and metabolites stress at single cell resolution.
    --test_file TEST_FILE
                            The test SC file [input]. The input of scFEA is a
                            single cell profile matrix, where row is gene and
                            column is cell. Example datasets are provided in
                            /data/ folder. The input can be raw counts or
                            normalised counts. The logarithm would be performed if
                            value larger than 30.
    --moduleGene_file MODULEGENE_FILE
                            The table contains genes for each module. We provide
                            human and mouse two models in scFEA. For human model,
                            please use module_gene_m168.csv which is default. All
                            candidate moduleGene files are provided in /data/
                            folder.
    --stoichiometry_matrix STOICHIOMETRY_MATRIX
                            The table describes relationship between compounds and
                            modules. Each row is an intermediate metabolite and
                            each column is metabolic module. For human model,
                            please use cmMat_171.csv which is default. All
                            candidate stoichiometry matrices are provided in
                            /data/ folder.
    --cName_file CNAME_FILE
                            The name of compounds. The table contains two rows.
                            First row is compounds name and second row is
                            corresponding id.
    --sc_imputation {True,False}
                            Whether perform imputation for SC dataset (recommend
                            set to <True> for 10x data).
    --output_flux_file OUTPUT_FLUX_FILE
                            User defined predicted flux file name.
    --output_balance_file OUTPUT_BALANCE_FILE
                            User defined predicted balance file name.

4. Run scFEA

4.1 example 1 human complete modules

This example run scFEA using Pa03c data on human complete modules.

%%bash
cd /Users/chang/Documents/work/flux/scFEA
python src/scFEA.py --data_dir data --input_dir input \
                    --test_file Pa03c.csv \
                    --moduleGene_file module_gene_m168.csv \
                    --stoichiometry_matrix cmMat_c70_m168.csv

Starting load data...
Load compound name file, the balance output will have compound name.
Load data done.
Starting process data...
Process data done.
Starting train neural network...
Training time:  15.639593124389648
scFEA job finished. Check result in the desired output folder.
100%|██████████| 100/100 [00:15<00:00,  6.40it/s]

4.2 example 2 mouse complete modules

This example run scFEA with a mouse example data on mouse complete modules. Also, we set sc_imputation as True to enable imputation process.

%%bash
cd /Users/chang/Documents/work/flux/scFEA
python src/scFEA.py --data_dir data --input_dir input \
                    --test_file mouse_example_data.csv \
                    --moduleGene_file module_gene_complete_mouse_m168.csv \
                    --stoichiometry_matrix cmMat_complete_mouse_c70_m168.csv \
                    --sc_imputation True \
                    --output_flux_file output/mouse_flux.csv \
                    --output_balance_file output/mouse_balance.csv

Calculating MAGIC...
Starting load data...
    Running MAGIC on 5659 cells and 678 genes.
    Calculating graph and diffusion operator...
    Calculating PCA...
    Calculated PCA in 0.36 seconds.
    Calculating KNN search...
    Calculated KNN search in 4.47 seconds.
    Calculating affinities...
    Calculated affinities in 4.45 seconds.
    Calculated graph and diffusion operator in 9.30 seconds.
    Calculating imputation...
    Calculated imputation in 0.24 seconds.
Calculated MAGIC in 9.56 seconds.
Load compound name file, the balance output will have compound name.
Load data done.
Starting process data...
Process data done.
Starting train neural network...
Training time:  1200.0908679962158
scFEA job finished. Check result in the desired output folder.
100%|██████████| 100/100 [20:00<00:00, 12.00s/it]

4.3 example 3 iron ion modules

This example run scFEA using GSE72056 on small metabolic system which is iron ion modules.

%%bash
cd /Users/chang/Documents/work/flux/scFEA
python src/scFEA.py --data_dir data --input_dir input \
                    --test_file GSE72056.csv \
                    --moduleGene_file module_gene_iron_m15.csv \
                    --stoichiometry_matrix cmMat_iron_c8_m15.csv \
                    --cName_file cName_iron_c8_m15.csv

Starting load data...
Load compound name file, the balance output will have compound name.
Load data done.
Starting process data...
Process data done.
Starting train neural network...
Training time:  151.6518177986145
scFEA job finished. Check result in the desired output folder.
100%|██████████| 100/100 [02:31<00:00,  1.52s/it]

4.4 example 4 glutaminolysis modules

This example run scFEA using GSE72056 on small metabolic system which is glutaminolysis modules.

%%bash
cd /Users/chang/Documents/work/flux/scFEA
python src/scFEA.py --data_dir data --input_dir input \
                    --test_file GSE72056.csv \
                    --moduleGene_file module_gene_glutaminolysis1_m23.csv \
                    --stoichiometry_matrix cmMat_glutaminolysis1_c17_m23.csv \
                    --cName_file cName_glutaminolysis1_c17_m23.csv \
                    --output_flux_file output/GSE72056_gluta_flux.csv \
                    --output_balance_file output/GSE72056_gluta_balance.csv

Starting load data...
Load compound name file, the balance output will have compound name.
Load data done.
Starting process data...
Process data done.
Starting train neural network...
Training time:  154.2574908733368
scFEA job finished. Check result in the desired output folder.
100%|██████████| 100/100 [02:34<00:00,  1.54s/it]

5. Check output files and loss function convergence

We could use ls command to see the output files after we run glutaminolysis experiment.

%%bash
cd /Users/chang/Documents/work/flux/scFEA/output
ls

GSE72056_gluta_balance.csv
GSE72056_gluta_flux.csv
lossValue_20210923-112543.txt
loss_20210923-112543.png
time_20210923-112543.txt

from IPython.display import Image
Image(filename='./scFEA/output/loss_20210923-112543.png') # make sure use correct file with time stamp

6. Downstream Analysis

(R package scFEA coming soon which includes downstream analysis and visualization of flux)

# use rpy2 to enable R and python in one jupyter notebook
%load_ext rpy2.ipython

%%R
library(tidyverse)
library(rstatix)
library(ggpubr)
library(reshape)
library(ggridges)
library(ggplot2)

# load predicted flux
data_c <- read.csv('./scFEA/output/mouse_flux.csv')
data_c0 <- as.matrix(data_c[,-1])
rownames(data_c0) <- as.character(data_c[,1])
data_c0 <- t(data_c0)
ppp_all <-c()

# load cell label of mouse example data
load('./scFEA/input/mouse_example_cell_ident.RData')
yyy <- cell_id[colnames(data_c0)]
for(ii in 1:nrow(data_c0)){
    xxx <- data_c0[ii,]
    final_df <- cbind(paste('X', 1:length(xxx), sep=''), xxx, yyy)
    final_df <- as.data.frame(final_df)
    final_df[,2] <- as.numeric(final_df[,2])
    colnames(final_df) <- c('var', 'flux', 'cellType')
    pp <- sd(final_df$flux)/abs(mean(final_df$flux))
    ppp_all <- c(ppp_all, pp)
}
tg_ids <- which(ppp_all > 1e-10)

# load mouse module info
load('./scFEA/input/mouse_module_info.RData')
if(length(tg_ids) > 0){
    jj = 1 # check module 2
    xxx <- data_c0[tg_ids[jj], ]
    #final_df <- cbind(paste('X', 1:length(xxx), sep=''), xxx, yyy)
    #final_df <- as.data.frame(final_df)
    #final_df[,2] <- as.numeric(final_df[,2])
    #colnames(final_df) <- c('var', 'flux', 'cellType')
    final_df <- data.frame(var = paste('X', 1:length(xxx), sep=''),
                            flux = xxx,
                            cellType = yyy)
        
    title <- mouse_module_info[rownames(data_c0)[tg_ids[jj]], 'M_name']
    aa <- ggplot(final_df, aes(x = flux, y = cellType, fill = cellType)) +
                    geom_density_ridges() +
                    theme_ridges() +
                    theme(legend.position = 'none') +
                    ggtitle(title) +
                    theme(plot.title = element_text(hjust = 0.5))
    plot(aa)
    
}

R[write to console]: Picking joint bandwidth of 0.00152

Package Tutorial 2

1. Introduction

In this tutorial, we will show how to embedding scFEA analysis into the general Seurat pipeline.

2. Demo

First, let's load and check original input.

# use rpy2 to enable R and python in one jupyter notebook
%load_ext rpy2.ipython

%%R
setwd('./scFEA')
gene_data <- read.csv('./input/GSE72056.csv', row.names = 1, header = T)
print(gene_data[1:5,1:2])

       Cy72_CD45_H02_S758_comb CY58_1_CD45_B02_S974_comb
ALDOC                 0.000000                         0
MAOA                  0.000000                         0
ASS1                519.368337                         0
AHCYL2                3.710014                         0
FUT4                  0.000000                         0

%%R
library(Seurat)
library(ggplot2)
obj <- CreateSeuratObject(counts = gene_data, project = "tutorial")
obj <- NormalizeData(obj, verbose = F)
all_genes <- rownames(obj)
obj <- FindVariableFeatures(obj, selection.method = "vst", nfeatures = 2000, verbose = F)
obj <- ScaleData(obj, features = all_genes, verbose = F)
obj <- RunPCA(obj, features = VariableFeatures(object = obj), verbose = F)
#DimPlot(obj, reduction = "pca") + ggtitle('PCA')

%%R
obj <- FindNeighbors(obj, dims = 1:10, verbose = F)
obj <- FindClusters(obj, resolution = 0.5, verbose = F)
obj <- RunTSNE(obj, dims = 1:10, verbose = F)
DimPlot(obj, reduction = "tsne") + ggtitle('tSNE')

Suppose we have finished a general Seurat analysis. We can prepare scFEA input data (gene expression) by extracting gene expression from Seurat object.

%%R
ExportGeneExpr <- obj@assays$RNA@counts
write.csv(ExportGeneExpr, file='./input/Seurat_geneExpr.csv', row.names = T)

%%bash
python src/scFEA.py --data_dir data --input_dir input \
                    --test_file Seurat_geneExpr.csv \
                    --moduleGene_file module_gene_glutaminolysis1_m23.csv \
                    --stoichiometry_matrix cmMat_glutaminolysis1_c17_m23.csv \
                    --cName_file cName_glutaminolysis1_c17_m23.csv \
                    --output_flux_file output/Seurat_gluta_flux.csv \
                    --output_balance_file output/Seurat_gluta_balance.csv

Starting load data...
Load compound name file, the balance output will have compound name.
Load data done.
Starting process data...
Process data done.
Starting train neural network...
Training time:  151.81953692436218
scFEA job finished. Check result in the desired output folder.
100%|██████████| 100/100 [02:31<00:00,  1.52s/it]

After scFEA done, we can read results into Seurat again to perform downstream analysis.

%%R
predFlux <- read.csv('./output/Seurat_gluta_flux.csv', header = T, row.names = 1)
predFlux <- data.matrix(predFlux)
predFlux0 <- t(predFlux)

# add flux as a new assay
obj[["FLUX"]] <- CreateAssayObject(counts = predFlux0)

R[write to console]: Warning:
R[write to console]:  Feature names cannot have underscores ('_'), replacing with dashes ('-')

Change default assay to FLUX to perfom cell clustering based on metabolic flux and plot tSNE.

%%R
DefaultAssay(obj) <- 'FLUX'
obj <- FindVariableFeatures(obj, selection.method = "vst", nfeatures = 2000, verbose = F)
obj <- ScaleData(obj, features = rownames(obj), assay = 'FLUX', verbose = F)
obj <- RunPCA(obj, features = VariableFeatures(object = obj), npcs = 10, reduction.name = 'pca.flux', verbose = F)
obj <- FindNeighbors(obj, dims = 1:2, verbose = F)
obj <- FindClusters(obj, resolution = 0.5, verbose = F)
obj <- RunTSNE(obj, dims = 1:2, assay = 'FLUX', reduction.name = "tsne.flux", verbose = F)
DimPlot(obj, reduction = "tsne.flux") + ggtitle('tSNE of Flux')

Now we plot the results from both RNA-seq and metabolic flux into one figure.

%%R
library(patchwork)
p1 <- DimPlot(obj, reduction = "tsne") + ggtitle('tSNE of Gene')
p2 <- DimPlot(obj, reduction = "tsne.flux") + ggtitle('tSNE of Flux')
p2 + p1

FLUXestimator

A webserver to estimate cell-/sample-wise metabolic fluxome by using scRNA-seq or general transcriptomics data