BCR repertoire analysis using ProtVec

Published: 20 August 2021| Version 1 | DOI: 10.17632/37tz3dkzkv.1
Inyoung Kim


Analyzing B-cell receptor (BCR) repertoires is immensely useful in evaluating one’s immunological status. Conventionally, repertoire analysis methods have focused on comprehensive assessment of clonal compositions, including V(D)J segment usage, nucleotide insertion/deletion, and amino acid distribution. Here, we introduce a novel computational approach that applies deep-learning based protein embedding techniques to analyze BCR repertoires. By selecting the most frequently occurring BCR sequences in a given repertoire and computing the sum of the vector representations of these sequences, we represent an entire repertoire as a 100-dimensional vector and eventually as a single data point in vector space. We demonstrate that our new approach enables us to not only accurately cluster repertoires of COVID-19 patients and healthy subjects, but also efficiently track minute changes in immunity conditions as patients undergo a course of treatment over time. Furthermore, using the distributed representations, we successfully trained an XGBoost classification model that achieved over 87% mean accuracy rate given a repertoire of CDR3 sequences. Data acquisition and Pre-processing : The raw data acquired from the Observed Antibody Space (OAS) BCR dataset included heavy and light chains as well as several different isotypes. We decided to use sequences from immunoglobulin heavy G chain (IGHG) due to its strong association with SARS-CoV-2. Additionally, the raw data did not include read counts of unique CDR3 sequences, but rather read counts of unique BCR sequences. As a result, we added a pre-processing step that counted the number of unique CDR3 sequences in a given repertoire. For the attributes used to search the OAS sequences, we set ‘Chain’ to ‘Heavy’, ‘Isotype’ to ‘IGHG’, ‘Disease’ to ‘SARS-COV-2’ or ‘None’, ‘BSource’ to ‘PBMC’, ‘Vaccine’ to ‘None’, and ‘Species’ to ‘human’. 25 studies including 3 COVID-19 studies were downloaded and used for analysis. In total, this study analyzed 106 COVID-19 patients and 349 healthy subjects (later reduced to 322 due to a lack of sufficient unique CDR3 sequences in 27 healthy subjects). This vector representation can be used to classify or visualize an individual's immune status. Data file: 428 vectors of IGHG repertoire representations : 106 Covid-19 repertoire representations, 322 Healthy repertoire representations 311 vectors of IGHA repertoire representations : 23 Covid-19 repertoire representations, 288 Healthy repertoire representations