Prediction of T-cell receptor (TCR) interactions with MHC-peptide complexes remains highly challenging. This challenge is primarily due to three dominant factors: data accuracy, data scarceness, and problem complexity. Here, we showcase that “shallow” convolutional neural network (CNN) architectures are adequate to deal with the problem complexity imposed by the length variations of TCRs. We demonstrate that current public bulk CDR3β-pMHC binding data overall is of low quality and that the development of accurate prediction models is contingent on paired α/β TCR sequence data corresponding to at least 150 distinct pairs for each investigated pMHC. In comparison, models trained on CDR3α or CDR3β data alone demonstrated a variable and pMHC specific relative performance drop. Together these findings support that T-cell specificity is predictable given the availability of accurate and sufficient paired TCR sequence data. NetTCR-2.0 is publicly available at https://services.healthtech.dtu.dk/service.php?NetTCR-2.0.
Bibliographical noteFunding Information:
We would like to thank DTU Multi-Assay Core (DMAC) for sequencing the set of novel paired-chain TCRs. This research was funded in part through the Independent research fund Denmark (DFF-7014-00055 to M.N), the Federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services (under Contract No. HHSN272201200010CERC to M.N), StG NextDART (677268 to S.R.H.), and the Lundbeck Foundation Experiment (R324-2019-1671 to A.K.B.).