Loading…

A diploid assembly-based benchmark for variants in the major histocompatibility complex

Most human genomes are characterized by aligning individual reads to the reference genome, but accurate long reads and linked reads now enable us to construct accurate, phased de novo assemblies. We focus on a medically important, highly variable, 5 million base-pair (bp) region where diploid assemb...

Full description

Saved in:
Bibliographic Details
Published in:Nature communications 2020-09, Vol.11 (1), p.4794-4794, Article 4794
Main Authors: Chin, Chen-Shan, Wagner, Justin, Zeng, Qiandong, Garrison, Erik, Garg, Shilpa, Fungtammasan, Arkarachai, Rautiainen, Mikko, Aganezov, Sergey, Kirsche, Melanie, Zarate, Samantha, Schatz, Michael C., Xiao, Chunlin, Rowell, William J., Markello, Charles, Farek, Jesse, Sedlazeck, Fritz J., Bansal, Vikas, Yoo, Byunggil, Miller, Neil, Zhou, Xin, Carroll, Andrew, Barrio, Alvaro Martinez, Salit, Marc, Marschall, Tobias, Dilthey, Alexander T., Zook, Justin M.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Most human genomes are characterized by aligning individual reads to the reference genome, but accurate long reads and linked reads now enable us to construct accurate, phased de novo assemblies. We focus on a medically important, highly variable, 5 million base-pair (bp) region where diploid assembly is particularly useful - the Major Histocompatibility Complex (MHC). Here, we develop a human genome benchmark derived from a diploid assembly for the openly-consented Genome in a Bottle sample HG002. We assemble a single contig for each haplotype, align them to the reference, call phased small and structural variants, and define a small variant benchmark for the MHC, covering 94% of the MHC and 22368 variants smaller than 50 bp, 49% more variants than a mapping-based benchmark. This benchmark reliably identifies errors in mapping-based callsets, and enables performance assessment in regions with much denser, complex variation than regions covered by previous benchmarks. Accurate, phased assemblies are a key tool in understanding the human genome, particularly in highly polymorphic regions like the medically important MHC. Here the authors provide an assembly-based benchmark for this difficult-to-characterize region.
ISSN:2041-1723
2041-1723
DOI:10.1038/s41467-020-18564-9