Preprocessing single-cell BCR sequencing and transcriptomic data

Quick notes on the different stages of processing single-cell sequencing data. Here I am mainly focusing on how we would process B-cell sequencing data using the 10X Genomics platform.

(If you are interested in the sequencing workflow before we get a fastq file, read this amazing blog post by 10X themselves.)

What is in a sequence?
Cell Ranger
Data Cleaning
Transcriptomic Analysis

What is in a sequence?

Diving straight into the processing pipeline after you have got the sequencing reads, I’ll first introduce how a sequence looks like at this point, and some of the quick commands to clean and standardise the sequences.

Each sequence is composed of two Illumina Adapters (tagging each “Reads”), a Unique molecular identifier (UMI), a cell barcode, a sample barcode and the sequence of interest. For each sequencing run, you have 2 reads: forward and backward, marked by the indices in the Illumina adapter/primer. The UMI is provided by the 10X kits, to mark each unique sequence. The cell barcode marks all the sequences from the same cell, whereas the sample barcode tags all the sequences of the same sample/run.

During contig assembly, small fragments of the unique sequence (tagged by the same UMI) in a cell are stitched back together – by aligning to a reference genome. This yields the following terminology:

barcode – unique for each cell
contig – unique sequence
annotation – VDJ information.

Cell Ranger

Cell Ranger provides a lot of functions to process the raw reads. For Immune Profiling,

mkfastq: from raw base call to FASTQ.
vdj: from FASTQ (of the V(D)J libraries) to assemble V(D)J transcripts (using UMIs). Output: Clonotypes and CDR3 sequences in a long list of fasta/fastq, csv, bam and vloupe files.
count: from FASTQ (of the 5′ Gene Expression and Feature Barcode libraries), and performs alignment, filtering, barcode and UMI counting.

Assuming you have downloaded the fastq files from the Sequence Read Archive,

#!/usr/bin/env bash
cellranger vdj 
    --id=D45_BCR 
    --fastqs=fastqs/D45_BCR/ 
    --sample=D45 
    --reference=../refdata-cellranger-vdj-GRCh38-alts-ensembl-3.1.0_human/ 
    --localcores=9 
    --localmem=200

N.B. The pair of fastq.gz files should be in the format of: D45_S1_L001_R1_001.fastq.gz and D45_S1_L001_R2_001.fastq.gz. Sometimes the downloadables from SRA are renamed differently.

To work on residue-level, we need igblastn to translate the output filtered_contig.fasta into amino acids.

#!/usr/bin/env bash
DBPATH="/path/to/ncbi-igblast-1.15.0/yourdb/human"
vdb=${DBPATH}/human_V
ddb=${DBPATH}/human_D
jdb=${DBPATH}/human_J
species="human"

query=/path/to/study/filtered_contig.fasta
output_file=/path/to/study/filtered_contig_output.igblastn

PATH="/path/to/ncbi-igblast-1.15.0/":$PATH

IGBLASTNLOC="/path/to/ncbi-igblast-1.15.0" # Somewhere with a folder called "internal_data" and "optional_file"

igblastn -germline_db_V=${vdb} \
        -germline_db_D=${ddb} \
        -germline_db_J=${jdb} \
        -organism=${species} \
        -show_translation \
        -num_alignments_D=1 \
        -num_alignments_V=1 \
        -num_alignments_J=1 \
        -ig_seqtype=Ig \
        -num_clonotype=0 \
        -outfmt=19 \
        -auxiliary_data=${IGDATA}/optional_file/${species}_gl.aux \
        -query=${query} \
        -out=${output_file} \
        -num_threads=5

For non-BCR and non-transcriptomic sequences, you would carry out the following steps:

Use fastp for quality and length filtering. It will tell you the expected insert size.
Remove redundant sequences (i.e. deduplicate) with fastx_collapser.

#!/usr/bin/env bash
# Find insert size
fastp -i fastqs/D45_AgBarcode_R1_01.fastq.gz 
        -I fastqs/D45_AgBarcode_R2_01.fastq.gz 
        -o fastqs/D45_AgBarcode_R1_01_out.fastq.gz 
        -O fastqs/D45_AgBarcode_R2_01_out.fastq.gz
# Deduplicate
gunzip -c D45_AgBarcode_R1_01_out.fastq.gz | fastx_collapser -v -o D45_AgBarcode_R1_01_out_dedup.fastq
# Zip
gzip D45_AgBarcode_R1_01_out_dedup.fastq

N.B. In the snippet above I prevented the original D45_AgBarcode_R1_01.out.fastq.gz from completely unzipping (large file!), by gunzip -c with a bar (|).

For transcriptomic analysis, we use cellranger count to process the data.

#!/usr/bin/env bash
cellranger count 
    --id=D45_BCR 
    --transcriptome=refdata-gex-GRCh38-2020-A
    --fastqs=fastqs/D45_BCR/ 
    --sample=D45 
    --localcores=9 
    --localmem=200

The output contains filtered_feature_bc_matrix_h5.h5 which will be used in the following pipeline.

Data Cleaning

Disclaimer: Most of the processing code snippets were adapted from https://github.com/felixhorns/SingleBCell-FluVaccine.

A lot of things can introduce noise to the data, so we apply a few filters to remove noise and extract signals.

First we look at the overall summary statistics of each sample in metrics_summary.csv. It tells us the following and we would check if any lanes fall outside of the “normal” range:

“Number of Read Pairs” (log scale)
“Estimated Number of Cells”
“Mean Read Pairs per Cell” (log scale)
“Cells With Productive IGH Contig”

#!/usr/bin/env python
summaryfile = os.path.join(data_dir,"cell_ranger_metrics_summary.csv.gz")
df_summary = pd.read_csv(summaryfile,header=0)

After that we can dive into the all_contig_annotations.csv, to find singlet cells with exactly 1 “productive” heavy chain IGHV, and 1 “productive” light chain: IGKV/IGLV.

#!/usr/bin/env python
infile = os.path.join(data_dir,"all_contig_annotations.csv")
df_all_contig_annotations = pd.read_csv(infile, header=0)
# Count total contigs assembled per cell
df_contig_aggr = df_all_contig_annotations.groupby(["productive", "sample", "barcode", "chain"]).size().unstack(fill_value=0)
df_contig_aggr_filtered = df_contig_aggr.xs("True", level="productive")
# Filter for cells having exactly 1VH+1VL
df_singlets = df_contig_aggr_filtered.loc[(df_contig_aggr_filtered["IGH"] == 1) &
                                 (((df_contig_aggr_filtered["IGL"] == 1) & (df_contig_aggr_filtered["IGK"] == 0)) |
                                  ((df_contig_aggr_filtered["IGL"] == 0) & (df_contig_aggr_filtered["IGK"] == 1)))]

# Filter contigs for these cells
df_all_contig_annotations_valid = df_all_contig_annotations.set_index(["sample", "barcode"]).loc[df_singlets.index]

# Filter contigs for only IGH, IGL, or IGK; as some contains "Multi"
df_all_contig_annotations_valid = df_all_contig_annotations_valid.loc[df_all_contig_annotations_valid["chain"].isin(["IGH", "IGL", "IGK"])]

# Filter contigs for only productive (there can be non-productive sequences with the same contig)
df_all_contig_annotations_valid = df_all_contig_annotations_valid.loc[df_all_contig_annotations_valid["productive"] == "True"]

Transcriptomic analysis

There are many packages that we can use for transcriptomic analysis, such as scanpy.

At a basic level, scanpy comes with codes and visualsation tools that help dissect the patterns in the gene expression data. The analysis pipeline is composed of:

PCA / tSNE to reduce the dimension and find gene expression level that correlates to the developmental states of B-cells.

#!/usr/bin/env python
# tSNE
infile = os.path.join(data_dir,"filtered_gene_bc_matrices_h5.h5")
adata = sc.read_10x_h5(infile, genome="GRCh38")
genes_markers = ["CD19", "CD27", "MS4A1", "CD3E", "CD4", "LYZ", "TCL1A", "IL4R", "AICDA"]

sc.pl.tsne(adata, color=genes_markers[0:3])

Load isotype annotation

A dictionary of barcode: H/L contig_id from cell_annotations_contigs.csv.gz is matched to a dictionary of H/L contig_id: VDJ/C annotation from all_contig_annotations.csv.gz. In this case, we pull out the c_gene because it tells us which isotype the B cell belongs to (e.g. IGHM, IGHG).

#!/usr/bin/env python
# Convert VDJ barcodes to GEX barcodes
# VDJ data: [sample] and [barcode] are separated
# GEX data: [barcode]-[sample ID] in a single column
sample_to_number = {"023-002_D7_Lane9": 1,
                        "023-002_D7_Lane10": 2,
                        "023-002_D7_Lane11": 3,
                        "023-002_D7_Lane12": 4}

def vdj_to_gex(row):
    sample_id = sample_to_number[row['sample']]
    barcode_seq = row['barcode'].split('-')[0]
    return f"{barcode_seq}-{sample_id}"
    
# Load cell barcode: contigs information
cell_contigs = pd.read_csv(os.path.join(data_dir,"cell_annotations_contigs.csv.gz"), header=0)
cell_contigs.rename(columns={'cell_annotations_contigs.csv':'sample'},inplace=True)
cell_contigs = cell_contigs[cell_contigs['sample'].isin(sample_to_number.keys())]
cell_contigs['barcode_GEX'] = cell_contigs[['sample','barcode']].apply(vdj_to_gex,axis=1)

# Load VDJ annotations of each contig
infile = os.path.join(data_dir,"all_contig_annotations.csv.gz")
contig_annotation = pd.read_csv(infile, header=0)
contig_annotation.rename(columns={"all_contig_annotations.csv":"sample"},inplace=True)
contig_annotation = contig_annotation[contig_annotation['sample'].isin(sample_to_number.keys())]
contig_annotation['barcode_GEX'] = contig_annotation[['sample','barcode']].apply(vdj_to_gex,axis=1)
contig_annotation.set_index("contig_id", inplace=True)

# Connect the two dataframes
# C gene contains the isotype information
def find_c_gene(chain_type, adata_obs_index):
    """
    Find C genes of the given chain type and adata.obs.index
    
    Parameters
    ----------
    chain_type : str
        chain type to be determined
        'H' or 'KL'
    adata_obs_index : adata.obs.index object
    
    Output
    ------
    list
        List of barcode-to-c_gene information in the order of the adata_obs_index
    """
    if chain_type == 'H':
        allowed_chains = ['IGH']
        cell_contigs_field = 'contig_id_IGH'
    else:
        allowed_chains = ['IGK','IGL']
        cell_contigs_field = 'contig_id_IGKL'
    # ensure each cell barcode only has one contig
    # barcode --> contig
    barcode_to_contig = dict(cell_contigs[cell_contigs["barcode_GEX"].isin(adata.obs.index)][["barcode_GEX",cell_contigs_field]].values)

    # contig --> vdj annotation
    contig_to_c_gene = dict(contig_annotation[(contig_annotation["barcode_GEX"].isin(adata.obs.index))&(contig_annotation["c_gene"].apply(lambda x: x[:3] in allowed_chains))].reset_index()[["contig_id","c_gene"]].values)
    
    # barcode --> vdj annotation
    barcode_to_c_gene = {b:contig_to_c_gene for b,c in barcode_to_contig.items() if c in contig_to_c_gene}
    return [barcode_to_c_gene[b] if b in barcode_to_c_gene else "Not detected" for b in adata_obs_index]

adata.obs["c_gene_IGH"] = find_c_gene(chain_type='H', adata_obs_index=adata.obs.index)
adata.obs["c_gene_IGKL"] = find_c_gene(chain_type='KL', adata_obs_index=adata.obs.index)

Filters for the following:
- Remove redundancy
- Remove genes that were not expressed
- Count the number of genes observed in each cell
- Normalise the counts.

#!/usr/bin/env python
# Make variable names unique: append a number in case any variable names are redundant
adata.var_names_make_unique()
# Remove genes that are not expressed
sc.pp.filter_genes(adata, min_cells=1)
# Calculate the number of genes observed per cell
sc.pp.filter_cells(adata, min_genes=0)

# Save raw log-transformed counts
adata.raw = sc.pp.log1p(adata, copy=True)
# Normalize total counts per cell
sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e6) # counts per million

Plotting!

You may want to inspect the basics of how many UMIs/genes are in each cell. Then you’d typically plot PCA and tSNE, coloured by different clustering. Some ideas would be their isotypes and class switching, genes markers (e.g. CD19, CD27) and the number of days after infection/vaccination. With PCA, we can also find out how much variance is explained by each principal component.

Here you go with the basic commands and data exploration angles for scBCR and transcriptomic datasets!