Loading…

Accuracy and data efficiency in deep learning models of protein expression

Synthetic biology often involves engineering microbial strains to express high-value proteins. Thanks to progress in rapid DNA synthesis and sequencing, deep learning has emerged as a promising approach to build sequence-to-expression models for strain optimization. But such models need large and co...

Full description

Saved in:
Bibliographic Details
Published in:Nature communications 2022-12, Vol.13 (1), p.7755-7755, Article 7755
Main Authors: Nikolados, Evangelos-Marios, Wongprommoon, Arin, Aodha, Oisin Mac, Cambray, Guillaume, Oyarzún, Diego A.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Synthetic biology often involves engineering microbial strains to express high-value proteins. Thanks to progress in rapid DNA synthesis and sequencing, deep learning has emerged as a promising approach to build sequence-to-expression models for strain optimization. But such models need large and costly training data that create steep entry barriers for many laboratories. Here we study the relation between accuracy and data efficiency in an atlas of machine learning models trained on datasets of varied size and sequence diversity. We show that deep learning can achieve good prediction accuracy with much smaller datasets than previously thought. We demonstrate that controlled sequence diversity leads to substantial gains in data efficiency and employed Explainable AI to show that convolutional neural networks can finely discriminate between input DNA sequences. Our results provide guidelines for designing genotype-phenotype screens that balance cost and quality of training data, thus helping promote the wider adoption of deep learning in the biotechnology sector. Synthetic biology often involves engineering microbial strains to express high-value proteins. Here the authors build deep learning predictors of protein expression from sequence that deliver accurate models with fewer data than previously assumed, helping to lower costs of model-driven strain design.
ISSN:2041-1723
2041-1723
DOI:10.1038/s41467-022-34902-5