Loading…
Smoothing Convolutional Factorizes Inception V3 Labels and Transformers for Image Feature Extraction into Text Segmentation
In the concept of computer vision, object detection in video understanding cannot provide a contextual picture in the form of a semantic description of the video/image. For this reason, an object detection and feature extraction mechanism is needed and a video and image conversion technique into tex...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | 144 |
container_issue | |
container_start_page | 139 |
container_title | |
container_volume | |
creator | Triana Indah, Komang Ayu Darma Putra, I Ketut Gede Sudarma, Made Hartati, Rukmi Sari |
description | In the concept of computer vision, object detection in video understanding cannot provide a contextual picture in the form of a semantic description of the video/image. For this reason, an object detection and feature extraction mechanism is needed and a video and image conversion technique into text using the Inception-V3 and Transformer methods. Inception-V3 is a deep convolutional architecture that is a development model of Google-Net or Inception-V1. Improved system performance by adding additional factorization at the convolution stage to reduce existing connections or parameters without reducing the network used to extract image features with an input image size of 299 x 299 x 3 pixels. With a transformer architecture that uses a multi-head self-attention mechanism to predict words and recover words sequentially with an RNN encoder-decoder architecture. The research was carried out using 5 minute videos which produced a Tensorflow dataset of 1000 images and 5000 sentence captions. The model was evaluated with BLEU (Bilingual Evaluation Understudy), with average scores of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 obtained at 0.418, 0.367, 0.245, and 0.165 to produce predicted captions and real captions. |
doi_str_mv | 10.1109/ICSGTEIS60500.2023.10424317 |
format | conference_proceeding |
fullrecord | <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10424317</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10424317</ieee_id><sourcerecordid>10424317</sourcerecordid><originalsourceid>FETCH-LOGICAL-i119t-72a88120ebfb31c6a6701510e28c17d24c3928b750d90110adf04d9ba1bf62b73</originalsourceid><addsrcrecordid>eNo1kEFLwzAYhqMgOOb-gYeA584vSdskRynbLAw8tIq3kbZfZ6VNRprJ1D9vh3p64OV9n8NLyB2DJWOg7_Os2JSrvEghAVhy4GLJIOaxYPKCLLTUSiQgFFecX5IZV4JFMcDrNVmM4zsATD0pEzUj38XgXHjr7J5mzn64_hg6Z01P16YOzndfONLc1ng4x_RF0K2psB-psQ0tvbFj6_yAfqQTaT6YPdI1mnD0SFen4CfJedfZ4GiJp0AL3A9ogznHN-SqNf2Iiz_OyfN6VWaP0fZpk2cP26hjTIdIcqMU44BVWwlWpyaVwBIGyFXNZMPjWmiuKplAo2F6xzQtxI2uDKvalFdSzMntr7dDxN3Bd4Pxn7v_v8QPDxBikA</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Smoothing Convolutional Factorizes Inception V3 Labels and Transformers for Image Feature Extraction into Text Segmentation</title><source>IEEE Xplore All Conference Series</source><creator>Triana Indah, Komang Ayu ; Darma Putra, I Ketut Gede ; Sudarma, Made ; Hartati, Rukmi Sari</creator><creatorcontrib>Triana Indah, Komang Ayu ; Darma Putra, I Ketut Gede ; Sudarma, Made ; Hartati, Rukmi Sari</creatorcontrib><description>In the concept of computer vision, object detection in video understanding cannot provide a contextual picture in the form of a semantic description of the video/image. For this reason, an object detection and feature extraction mechanism is needed and a video and image conversion technique into text using the Inception-V3 and Transformer methods. Inception-V3 is a deep convolutional architecture that is a development model of Google-Net or Inception-V1. Improved system performance by adding additional factorization at the convolution stage to reduce existing connections or parameters without reducing the network used to extract image features with an input image size of 299 x 299 x 3 pixels. With a transformer architecture that uses a multi-head self-attention mechanism to predict words and recover words sequentially with an RNN encoder-decoder architecture. The research was carried out using 5 minute videos which produced a Tensorflow dataset of 1000 images and 5000 sentence captions. The model was evaluated with BLEU (Bilingual Evaluation Understudy), with average scores of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 obtained at 0.418, 0.367, 0.245, and 0.165 to produce predicted captions and real captions.</description><identifier>EISSN: 2831-400X</identifier><identifier>EISBN: 9798350382822</identifier><identifier>DOI: 10.1109/ICSGTEIS60500.2023.10424317</identifier><language>eng</language><publisher>IEEE</publisher><subject>BLEU ; Computer architecture ; Encoder-Decoder RNN ; Feature extraction ; Inception-V3 ; Object detection ; Smoothing methods ; System performance ; Testing ; Transformer ; Transformers</subject><ispartof>2023 International Conference on Smart-Green Technology in Electrical and Information Systems (ICSGTEIS), 2023, p.139-144</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10424317$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10424317$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Triana Indah, Komang Ayu</creatorcontrib><creatorcontrib>Darma Putra, I Ketut Gede</creatorcontrib><creatorcontrib>Sudarma, Made</creatorcontrib><creatorcontrib>Hartati, Rukmi Sari</creatorcontrib><title>Smoothing Convolutional Factorizes Inception V3 Labels and Transformers for Image Feature Extraction into Text Segmentation</title><title>2023 International Conference on Smart-Green Technology in Electrical and Information Systems (ICSGTEIS)</title><addtitle>ICSGTEIS</addtitle><description>In the concept of computer vision, object detection in video understanding cannot provide a contextual picture in the form of a semantic description of the video/image. For this reason, an object detection and feature extraction mechanism is needed and a video and image conversion technique into text using the Inception-V3 and Transformer methods. Inception-V3 is a deep convolutional architecture that is a development model of Google-Net or Inception-V1. Improved system performance by adding additional factorization at the convolution stage to reduce existing connections or parameters without reducing the network used to extract image features with an input image size of 299 x 299 x 3 pixels. With a transformer architecture that uses a multi-head self-attention mechanism to predict words and recover words sequentially with an RNN encoder-decoder architecture. The research was carried out using 5 minute videos which produced a Tensorflow dataset of 1000 images and 5000 sentence captions. The model was evaluated with BLEU (Bilingual Evaluation Understudy), with average scores of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 obtained at 0.418, 0.367, 0.245, and 0.165 to produce predicted captions and real captions.</description><subject>BLEU</subject><subject>Computer architecture</subject><subject>Encoder-Decoder RNN</subject><subject>Feature extraction</subject><subject>Inception-V3</subject><subject>Object detection</subject><subject>Smoothing methods</subject><subject>System performance</subject><subject>Testing</subject><subject>Transformer</subject><subject>Transformers</subject><issn>2831-400X</issn><isbn>9798350382822</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2023</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNo1kEFLwzAYhqMgOOb-gYeA584vSdskRynbLAw8tIq3kbZfZ6VNRprJ1D9vh3p64OV9n8NLyB2DJWOg7_Os2JSrvEghAVhy4GLJIOaxYPKCLLTUSiQgFFecX5IZV4JFMcDrNVmM4zsATD0pEzUj38XgXHjr7J5mzn64_hg6Z01P16YOzndfONLc1ng4x_RF0K2psB-psQ0tvbFj6_yAfqQTaT6YPdI1mnD0SFen4CfJedfZ4GiJp0AL3A9ogznHN-SqNf2Iiz_OyfN6VWaP0fZpk2cP26hjTIdIcqMU44BVWwlWpyaVwBIGyFXNZMPjWmiuKplAo2F6xzQtxI2uDKvalFdSzMntr7dDxN3Bd4Pxn7v_v8QPDxBikA</recordid><startdate>20231102</startdate><enddate>20231102</enddate><creator>Triana Indah, Komang Ayu</creator><creator>Darma Putra, I Ketut Gede</creator><creator>Sudarma, Made</creator><creator>Hartati, Rukmi Sari</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>20231102</creationdate><title>Smoothing Convolutional Factorizes Inception V3 Labels and Transformers for Image Feature Extraction into Text Segmentation</title><author>Triana Indah, Komang Ayu ; Darma Putra, I Ketut Gede ; Sudarma, Made ; Hartati, Rukmi Sari</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i119t-72a88120ebfb31c6a6701510e28c17d24c3928b750d90110adf04d9ba1bf62b73</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2023</creationdate><topic>BLEU</topic><topic>Computer architecture</topic><topic>Encoder-Decoder RNN</topic><topic>Feature extraction</topic><topic>Inception-V3</topic><topic>Object detection</topic><topic>Smoothing methods</topic><topic>System performance</topic><topic>Testing</topic><topic>Transformer</topic><topic>Transformers</topic><toplevel>online_resources</toplevel><creatorcontrib>Triana Indah, Komang Ayu</creatorcontrib><creatorcontrib>Darma Putra, I Ketut Gede</creatorcontrib><creatorcontrib>Sudarma, Made</creatorcontrib><creatorcontrib>Hartati, Rukmi Sari</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Triana Indah, Komang Ayu</au><au>Darma Putra, I Ketut Gede</au><au>Sudarma, Made</au><au>Hartati, Rukmi Sari</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Smoothing Convolutional Factorizes Inception V3 Labels and Transformers for Image Feature Extraction into Text Segmentation</atitle><btitle>2023 International Conference on Smart-Green Technology in Electrical and Information Systems (ICSGTEIS)</btitle><stitle>ICSGTEIS</stitle><date>2023-11-02</date><risdate>2023</risdate><spage>139</spage><epage>144</epage><pages>139-144</pages><eissn>2831-400X</eissn><eisbn>9798350382822</eisbn><abstract>In the concept of computer vision, object detection in video understanding cannot provide a contextual picture in the form of a semantic description of the video/image. For this reason, an object detection and feature extraction mechanism is needed and a video and image conversion technique into text using the Inception-V3 and Transformer methods. Inception-V3 is a deep convolutional architecture that is a development model of Google-Net or Inception-V1. Improved system performance by adding additional factorization at the convolution stage to reduce existing connections or parameters without reducing the network used to extract image features with an input image size of 299 x 299 x 3 pixels. With a transformer architecture that uses a multi-head self-attention mechanism to predict words and recover words sequentially with an RNN encoder-decoder architecture. The research was carried out using 5 minute videos which produced a Tensorflow dataset of 1000 images and 5000 sentence captions. The model was evaluated with BLEU (Bilingual Evaluation Understudy), with average scores of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 obtained at 0.418, 0.367, 0.245, and 0.165 to produce predicted captions and real captions.</abstract><pub>IEEE</pub><doi>10.1109/ICSGTEIS60500.2023.10424317</doi><tpages>6</tpages></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | EISSN: 2831-400X |
ispartof | 2023 International Conference on Smart-Green Technology in Electrical and Information Systems (ICSGTEIS), 2023, p.139-144 |
issn | 2831-400X |
language | eng |
recordid | cdi_ieee_primary_10424317 |
source | IEEE Xplore All Conference Series |
subjects | BLEU Computer architecture Encoder-Decoder RNN Feature extraction Inception-V3 Object detection Smoothing methods System performance Testing Transformer Transformers |
title | Smoothing Convolutional Factorizes Inception V3 Labels and Transformers for Image Feature Extraction into Text Segmentation |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T13%3A24%3A41IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Smoothing%20Convolutional%20Factorizes%20Inception%20V3%20Labels%20and%20Transformers%20for%20Image%20Feature%20Extraction%20into%20Text%20Segmentation&rft.btitle=2023%20International%20Conference%20on%20Smart-Green%20Technology%20in%20Electrical%20and%20Information%20Systems%20(ICSGTEIS)&rft.au=Triana%20Indah,%20Komang%20Ayu&rft.date=2023-11-02&rft.spage=139&rft.epage=144&rft.pages=139-144&rft.eissn=2831-400X&rft_id=info:doi/10.1109/ICSGTEIS60500.2023.10424317&rft.eisbn=9798350382822&rft_dat=%3Cieee_CHZPO%3E10424317%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i119t-72a88120ebfb31c6a6701510e28c17d24c3928b750d90110adf04d9ba1bf62b73%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10424317&rfr_iscdi=true |