Loading…

Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks

Both Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) have shown improvements over Deep Neural Networks (DNNs) across a wide variety of speech recognition tasks. CNNs, LSTMs and DNNs are complementary in their modeling capabilities, as CNNs are good at reducing frequency variat...

Full description

Saved in:

Bibliographic Details
Main Authors:	Sainath, Tara N., Vinyals, Oriol, Senior, Andrew, Sak, Hasim
Format:	Conference Proceeding
Language:	English
Subjects:	Context Hidden Markov models Neural networks Noise measurement Speech Speech recognition Training
Citations:	Items that cite this one
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-c407t-3d744788f10664f20ba5544103b00ddb5ae7eae43f290e5e6dfb688b1eff9ecd3
cites
container_end_page	4584
container_issue
container_start_page	4580
container_title
container_volume
creator	Sainath, Tara N. Vinyals, Oriol Senior, Andrew Sak, Hasim
description	Both Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) have shown improvements over Deep Neural Networks (DNNs) across a wide variety of speech recognition tasks. CNNs, LSTMs and DNNs are complementary in their modeling capabilities, as CNNs are good at reducing frequency variations, LSTMs are good at temporal modeling, and DNNs are appropriate for mapping features to a more separable space. In this paper, we take advantage of the complementarity of CNNs, LSTMs and DNNs by combining them into one unified architecture. We explore the proposed architecture, which we call CLDNN, on a variety of large vocabulary tasks, varying from 200 to 2,000 hours. We find that the CLDNN provides a 4-6% relative improvement in WER over an LSTM, the strongest of the three individual models.
doi_str_mv	10.1109/ICASSP.2015.7178838
format	conference_proceeding
fullrecord	<record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_7178838</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>7178838</ieee_id><sourcerecordid>7178838</sourcerecordid><originalsourceid>FETCH-LOGICAL-c407t-3d744788f10664f20ba5544103b00ddb5ae7eae43f290e5e6dfb688b1eff9ecd3</originalsourceid><addsrcrecordid>eNotkEtOwzAUAA0CiVJ6gm58gKa8Fzt2vEThK8JHapHYVU7yDIUkrpwElNtTia5mNxoNY3OEJSKYy4fsarV6XcaAyVKjTlORHrFzlEoLZYzWx2wSC20iNPB-wiaYxBAplOaMzbruCwBQKy21nLDHzLc_vh76rW9tveC5bz_46tOHPlpTaPgTNT6MC-6Guh556duWyp4qfk204880BFvv0f_68N1dsFNn645mB07Z2-3NOruP8pe7fXEelRJ0H4lKS7lvdghKSRdDYZNESgRRAFRVkVjSZEkKFxughFTlCpWmBZJzhspKTNn837slos0ubBsbxs3hg_gDfj9Qhg</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks</title><source>IEEE Xplore All Conference Series</source><creator>Sainath, Tara N. ; Vinyals, Oriol ; Senior, Andrew ; Sak, Hasim</creator><creatorcontrib>Sainath, Tara N. ; Vinyals, Oriol ; Senior, Andrew ; Sak, Hasim</creatorcontrib><description>Both Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) have shown improvements over Deep Neural Networks (DNNs) across a wide variety of speech recognition tasks. CNNs, LSTMs and DNNs are complementary in their modeling capabilities, as CNNs are good at reducing frequency variations, LSTMs are good at temporal modeling, and DNNs are appropriate for mapping features to a more separable space. In this paper, we take advantage of the complementarity of CNNs, LSTMs and DNNs by combining them into one unified architecture. We explore the proposed architecture, which we call CLDNN, on a variety of large vocabulary tasks, varying from 200 to 2,000 hours. We find that the CLDNN provides a 4-6% relative improvement in WER over an LSTM, the strongest of the three individual models.</description><identifier>ISSN: 1520-6149</identifier><identifier>EISSN: 2379-190X</identifier><identifier>EISBN: 1467369977</identifier><identifier>EISBN: 9781467369978</identifier><identifier>DOI: 10.1109/ICASSP.2015.7178838</identifier><language>eng</language><publisher>IEEE</publisher><subject>Context ; Hidden Markov models ; Neural networks ; Noise measurement ; Speech ; Speech recognition ; Training</subject><ispartof>2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, p.4580-4584</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c407t-3d744788f10664f20ba5544103b00ddb5ae7eae43f290e5e6dfb688b1eff9ecd3</citedby></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/7178838$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,2058,27925,54555,54920,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/7178838$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Sainath, Tara N.</creatorcontrib><creatorcontrib>Vinyals, Oriol</creatorcontrib><creatorcontrib>Senior, Andrew</creatorcontrib><creatorcontrib>Sak, Hasim</creatorcontrib><title>Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks</title><title>2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title><addtitle>ICASSP</addtitle><description>Both Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) have shown improvements over Deep Neural Networks (DNNs) across a wide variety of speech recognition tasks. CNNs, LSTMs and DNNs are complementary in their modeling capabilities, as CNNs are good at reducing frequency variations, LSTMs are good at temporal modeling, and DNNs are appropriate for mapping features to a more separable space. In this paper, we take advantage of the complementarity of CNNs, LSTMs and DNNs by combining them into one unified architecture. We explore the proposed architecture, which we call CLDNN, on a variety of large vocabulary tasks, varying from 200 to 2,000 hours. We find that the CLDNN provides a 4-6% relative improvement in WER over an LSTM, the strongest of the three individual models.</description><subject>Context</subject><subject>Hidden Markov models</subject><subject>Neural networks</subject><subject>Noise measurement</subject><subject>Speech</subject><subject>Speech recognition</subject><subject>Training</subject><issn>1520-6149</issn><issn>2379-190X</issn><isbn>1467369977</isbn><isbn>9781467369978</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2015</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotkEtOwzAUAA0CiVJ6gm58gKa8Fzt2vEThK8JHapHYVU7yDIUkrpwElNtTia5mNxoNY3OEJSKYy4fsarV6XcaAyVKjTlORHrFzlEoLZYzWx2wSC20iNPB-wiaYxBAplOaMzbruCwBQKy21nLDHzLc_vh76rW9tveC5bz_46tOHPlpTaPgTNT6MC-6Guh556duWyp4qfk204880BFvv0f_68N1dsFNn645mB07Z2-3NOruP8pe7fXEelRJ0H4lKS7lvdghKSRdDYZNESgRRAFRVkVjSZEkKFxughFTlCpWmBZJzhspKTNn837slos0ubBsbxs3hg_gDfj9Qhg</recordid><startdate>20150401</startdate><enddate>20150401</enddate><creator>Sainath, Tara N.</creator><creator>Vinyals, Oriol</creator><creator>Senior, Andrew</creator><creator>Sak, Hasim</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>20150401</creationdate><title>Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks</title><author>Sainath, Tara N. ; Vinyals, Oriol ; Senior, Andrew ; Sak, Hasim</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c407t-3d744788f10664f20ba5544103b00ddb5ae7eae43f290e5e6dfb688b1eff9ecd3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2015</creationdate><topic>Context</topic><topic>Hidden Markov models</topic><topic>Neural networks</topic><topic>Noise measurement</topic><topic>Speech</topic><topic>Speech recognition</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Sainath, Tara N.</creatorcontrib><creatorcontrib>Vinyals, Oriol</creatorcontrib><creatorcontrib>Senior, Andrew</creatorcontrib><creatorcontrib>Sak, Hasim</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Xplore</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Sainath, Tara N.</au><au>Vinyals, Oriol</au><au>Senior, Andrew</au><au>Sak, Hasim</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks</atitle><btitle>2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</btitle><stitle>ICASSP</stitle><date>2015-04-01</date><risdate>2015</risdate><spage>4580</spage><epage>4584</epage><pages>4580-4584</pages><issn>1520-6149</issn><eissn>2379-190X</eissn><eisbn>1467369977</eisbn><eisbn>9781467369978</eisbn><abstract>Both Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) have shown improvements over Deep Neural Networks (DNNs) across a wide variety of speech recognition tasks. CNNs, LSTMs and DNNs are complementary in their modeling capabilities, as CNNs are good at reducing frequency variations, LSTMs are good at temporal modeling, and DNNs are appropriate for mapping features to a more separable space. In this paper, we take advantage of the complementarity of CNNs, LSTMs and DNNs by combining them into one unified architecture. We explore the proposed architecture, which we call CLDNN, on a variety of large vocabulary tasks, varying from 200 to 2,000 hours. We find that the CLDNN provides a 4-6% relative improvement in WER over an LSTM, the strongest of the three individual models.</abstract><pub>IEEE</pub><doi>10.1109/ICASSP.2015.7178838</doi><tpages>5</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1520-6149
ispartof	2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, p.4580-4584
issn	1520-6149 2379-190X
language	eng
recordid	cdi_ieee_primary_7178838
source	IEEE Xplore All Conference Series
subjects	Context Hidden Markov models Neural networks Noise measurement Speech Speech recognition Training
title	Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T03%3A17%3A29IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Convolutional,%20Long%20Short-Term%20Memory,%20fully%20connected%20Deep%20Neural%20Networks&rft.btitle=2015%20IEEE%20International%20Conference%20on%20Acoustics,%20Speech%20and%20Signal%20Processing%20(ICASSP)&rft.au=Sainath,%20Tara%20N.&rft.date=2015-04-01&rft.spage=4580&rft.epage=4584&rft.pages=4580-4584&rft.issn=1520-6149&rft.eissn=2379-190X&rft_id=info:doi/10.1109/ICASSP.2015.7178838&rft.eisbn=1467369977&rft.eisbn_list=9781467369978&rft_dat=%3Cieee_CHZPO%3E7178838%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c407t-3d744788f10664f20ba5544103b00ddb5ae7eae43f290e5e6dfb688b1eff9ecd3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=7178838&rfr_iscdi=true