Loading…

Smoothing Convolutional Factorizes Inception V3 Labels and Transformers for Image Feature Extraction into Text Segmentation

In the concept of computer vision, object detection in video understanding cannot provide a contextual picture in the form of a semantic description of the video/image. For this reason, an object detection and feature extraction mechanism is needed and a video and image conversion technique into tex...

Full description

Saved in:
Bibliographic Details
Main Authors: Triana Indah, Komang Ayu, Darma Putra, I Ketut Gede, Sudarma, Made, Hartati, Rukmi Sari
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page 144
container_issue
container_start_page 139
container_title
container_volume
creator Triana Indah, Komang Ayu
Darma Putra, I Ketut Gede
Sudarma, Made
Hartati, Rukmi Sari
description In the concept of computer vision, object detection in video understanding cannot provide a contextual picture in the form of a semantic description of the video/image. For this reason, an object detection and feature extraction mechanism is needed and a video and image conversion technique into text using the Inception-V3 and Transformer methods. Inception-V3 is a deep convolutional architecture that is a development model of Google-Net or Inception-V1. Improved system performance by adding additional factorization at the convolution stage to reduce existing connections or parameters without reducing the network used to extract image features with an input image size of 299 x 299 x 3 pixels. With a transformer architecture that uses a multi-head self-attention mechanism to predict words and recover words sequentially with an RNN encoder-decoder architecture. The research was carried out using 5 minute videos which produced a Tensorflow dataset of 1000 images and 5000 sentence captions. The model was evaluated with BLEU (Bilingual Evaluation Understudy), with average scores of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 obtained at 0.418, 0.367, 0.245, and 0.165 to produce predicted captions and real captions.
doi_str_mv 10.1109/ICSGTEIS60500.2023.10424317
format conference_proceeding
fullrecord <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10424317</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10424317</ieee_id><sourcerecordid>10424317</sourcerecordid><originalsourceid>FETCH-LOGICAL-i119t-72a88120ebfb31c6a6701510e28c17d24c3928b750d90110adf04d9ba1bf62b73</originalsourceid><addsrcrecordid>eNo1kEFLwzAYhqMgOOb-gYeA584vSdskRynbLAw8tIq3kbZfZ6VNRprJ1D9vh3p64OV9n8NLyB2DJWOg7_Os2JSrvEghAVhy4GLJIOaxYPKCLLTUSiQgFFecX5IZV4JFMcDrNVmM4zsATD0pEzUj38XgXHjr7J5mzn64_hg6Z01P16YOzndfONLc1ng4x_RF0K2psB-psQ0tvbFj6_yAfqQTaT6YPdI1mnD0SFen4CfJedfZ4GiJp0AL3A9ogznHN-SqNf2Iiz_OyfN6VWaP0fZpk2cP26hjTIdIcqMU44BVWwlWpyaVwBIGyFXNZMPjWmiuKplAo2F6xzQtxI2uDKvalFdSzMntr7dDxN3Bd4Pxn7v_v8QPDxBikA</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Smoothing Convolutional Factorizes Inception V3 Labels and Transformers for Image Feature Extraction into Text Segmentation</title><source>IEEE Xplore All Conference Series</source><creator>Triana Indah, Komang Ayu ; Darma Putra, I Ketut Gede ; Sudarma, Made ; Hartati, Rukmi Sari</creator><creatorcontrib>Triana Indah, Komang Ayu ; Darma Putra, I Ketut Gede ; Sudarma, Made ; Hartati, Rukmi Sari</creatorcontrib><description>In the concept of computer vision, object detection in video understanding cannot provide a contextual picture in the form of a semantic description of the video/image. For this reason, an object detection and feature extraction mechanism is needed and a video and image conversion technique into text using the Inception-V3 and Transformer methods. Inception-V3 is a deep convolutional architecture that is a development model of Google-Net or Inception-V1. Improved system performance by adding additional factorization at the convolution stage to reduce existing connections or parameters without reducing the network used to extract image features with an input image size of 299 x 299 x 3 pixels. With a transformer architecture that uses a multi-head self-attention mechanism to predict words and recover words sequentially with an RNN encoder-decoder architecture. The research was carried out using 5 minute videos which produced a Tensorflow dataset of 1000 images and 5000 sentence captions. The model was evaluated with BLEU (Bilingual Evaluation Understudy), with average scores of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 obtained at 0.418, 0.367, 0.245, and 0.165 to produce predicted captions and real captions.</description><identifier>EISSN: 2831-400X</identifier><identifier>EISBN: 9798350382822</identifier><identifier>DOI: 10.1109/ICSGTEIS60500.2023.10424317</identifier><language>eng</language><publisher>IEEE</publisher><subject>BLEU ; Computer architecture ; Encoder-Decoder RNN ; Feature extraction ; Inception-V3 ; Object detection ; Smoothing methods ; System performance ; Testing ; Transformer ; Transformers</subject><ispartof>2023 International Conference on Smart-Green Technology in Electrical and Information Systems (ICSGTEIS), 2023, p.139-144</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10424317$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10424317$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Triana Indah, Komang Ayu</creatorcontrib><creatorcontrib>Darma Putra, I Ketut Gede</creatorcontrib><creatorcontrib>Sudarma, Made</creatorcontrib><creatorcontrib>Hartati, Rukmi Sari</creatorcontrib><title>Smoothing Convolutional Factorizes Inception V3 Labels and Transformers for Image Feature Extraction into Text Segmentation</title><title>2023 International Conference on Smart-Green Technology in Electrical and Information Systems (ICSGTEIS)</title><addtitle>ICSGTEIS</addtitle><description>In the concept of computer vision, object detection in video understanding cannot provide a contextual picture in the form of a semantic description of the video/image. For this reason, an object detection and feature extraction mechanism is needed and a video and image conversion technique into text using the Inception-V3 and Transformer methods. Inception-V3 is a deep convolutional architecture that is a development model of Google-Net or Inception-V1. Improved system performance by adding additional factorization at the convolution stage to reduce existing connections or parameters without reducing the network used to extract image features with an input image size of 299 x 299 x 3 pixels. With a transformer architecture that uses a multi-head self-attention mechanism to predict words and recover words sequentially with an RNN encoder-decoder architecture. The research was carried out using 5 minute videos which produced a Tensorflow dataset of 1000 images and 5000 sentence captions. The model was evaluated with BLEU (Bilingual Evaluation Understudy), with average scores of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 obtained at 0.418, 0.367, 0.245, and 0.165 to produce predicted captions and real captions.</description><subject>BLEU</subject><subject>Computer architecture</subject><subject>Encoder-Decoder RNN</subject><subject>Feature extraction</subject><subject>Inception-V3</subject><subject>Object detection</subject><subject>Smoothing methods</subject><subject>System performance</subject><subject>Testing</subject><subject>Transformer</subject><subject>Transformers</subject><issn>2831-400X</issn><isbn>9798350382822</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2023</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNo1kEFLwzAYhqMgOOb-gYeA584vSdskRynbLAw8tIq3kbZfZ6VNRprJ1D9vh3p64OV9n8NLyB2DJWOg7_Os2JSrvEghAVhy4GLJIOaxYPKCLLTUSiQgFFecX5IZV4JFMcDrNVmM4zsATD0pEzUj38XgXHjr7J5mzn64_hg6Z01P16YOzndfONLc1ng4x_RF0K2psB-psQ0tvbFj6_yAfqQTaT6YPdI1mnD0SFen4CfJedfZ4GiJp0AL3A9ogznHN-SqNf2Iiz_OyfN6VWaP0fZpk2cP26hjTIdIcqMU44BVWwlWpyaVwBIGyFXNZMPjWmiuKplAo2F6xzQtxI2uDKvalFdSzMntr7dDxN3Bd4Pxn7v_v8QPDxBikA</recordid><startdate>20231102</startdate><enddate>20231102</enddate><creator>Triana Indah, Komang Ayu</creator><creator>Darma Putra, I Ketut Gede</creator><creator>Sudarma, Made</creator><creator>Hartati, Rukmi Sari</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>20231102</creationdate><title>Smoothing Convolutional Factorizes Inception V3 Labels and Transformers for Image Feature Extraction into Text Segmentation</title><author>Triana Indah, Komang Ayu ; Darma Putra, I Ketut Gede ; Sudarma, Made ; Hartati, Rukmi Sari</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i119t-72a88120ebfb31c6a6701510e28c17d24c3928b750d90110adf04d9ba1bf62b73</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2023</creationdate><topic>BLEU</topic><topic>Computer architecture</topic><topic>Encoder-Decoder RNN</topic><topic>Feature extraction</topic><topic>Inception-V3</topic><topic>Object detection</topic><topic>Smoothing methods</topic><topic>System performance</topic><topic>Testing</topic><topic>Transformer</topic><topic>Transformers</topic><toplevel>online_resources</toplevel><creatorcontrib>Triana Indah, Komang Ayu</creatorcontrib><creatorcontrib>Darma Putra, I Ketut Gede</creatorcontrib><creatorcontrib>Sudarma, Made</creatorcontrib><creatorcontrib>Hartati, Rukmi Sari</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Triana Indah, Komang Ayu</au><au>Darma Putra, I Ketut Gede</au><au>Sudarma, Made</au><au>Hartati, Rukmi Sari</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Smoothing Convolutional Factorizes Inception V3 Labels and Transformers for Image Feature Extraction into Text Segmentation</atitle><btitle>2023 International Conference on Smart-Green Technology in Electrical and Information Systems (ICSGTEIS)</btitle><stitle>ICSGTEIS</stitle><date>2023-11-02</date><risdate>2023</risdate><spage>139</spage><epage>144</epage><pages>139-144</pages><eissn>2831-400X</eissn><eisbn>9798350382822</eisbn><abstract>In the concept of computer vision, object detection in video understanding cannot provide a contextual picture in the form of a semantic description of the video/image. For this reason, an object detection and feature extraction mechanism is needed and a video and image conversion technique into text using the Inception-V3 and Transformer methods. Inception-V3 is a deep convolutional architecture that is a development model of Google-Net or Inception-V1. Improved system performance by adding additional factorization at the convolution stage to reduce existing connections or parameters without reducing the network used to extract image features with an input image size of 299 x 299 x 3 pixels. With a transformer architecture that uses a multi-head self-attention mechanism to predict words and recover words sequentially with an RNN encoder-decoder architecture. The research was carried out using 5 minute videos which produced a Tensorflow dataset of 1000 images and 5000 sentence captions. The model was evaluated with BLEU (Bilingual Evaluation Understudy), with average scores of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 obtained at 0.418, 0.367, 0.245, and 0.165 to produce predicted captions and real captions.</abstract><pub>IEEE</pub><doi>10.1109/ICSGTEIS60500.2023.10424317</doi><tpages>6</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier EISSN: 2831-400X
ispartof 2023 International Conference on Smart-Green Technology in Electrical and Information Systems (ICSGTEIS), 2023, p.139-144
issn 2831-400X
language eng
recordid cdi_ieee_primary_10424317
source IEEE Xplore All Conference Series
subjects BLEU
Computer architecture
Encoder-Decoder RNN
Feature extraction
Inception-V3
Object detection
Smoothing methods
System performance
Testing
Transformer
Transformers
title Smoothing Convolutional Factorizes Inception V3 Labels and Transformers for Image Feature Extraction into Text Segmentation
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T13%3A24%3A41IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Smoothing%20Convolutional%20Factorizes%20Inception%20V3%20Labels%20and%20Transformers%20for%20Image%20Feature%20Extraction%20into%20Text%20Segmentation&rft.btitle=2023%20International%20Conference%20on%20Smart-Green%20Technology%20in%20Electrical%20and%20Information%20Systems%20(ICSGTEIS)&rft.au=Triana%20Indah,%20Komang%20Ayu&rft.date=2023-11-02&rft.spage=139&rft.epage=144&rft.pages=139-144&rft.eissn=2831-400X&rft_id=info:doi/10.1109/ICSGTEIS60500.2023.10424317&rft.eisbn=9798350382822&rft_dat=%3Cieee_CHZPO%3E10424317%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i119t-72a88120ebfb31c6a6701510e28c17d24c3928b750d90110adf04d9ba1bf62b73%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10424317&rfr_iscdi=true