Loading…
Smoothing Convolutional Factorizes Inception V3 Labels and Transformers for Image Feature Extraction into Text Segmentation
In the concept of computer vision, object detection in video understanding cannot provide a contextual picture in the form of a semantic description of the video/image. For this reason, an object detection and feature extraction mechanism is needed and a video and image conversion technique into tex...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | In the concept of computer vision, object detection in video understanding cannot provide a contextual picture in the form of a semantic description of the video/image. For this reason, an object detection and feature extraction mechanism is needed and a video and image conversion technique into text using the Inception-V3 and Transformer methods. Inception-V3 is a deep convolutional architecture that is a development model of Google-Net or Inception-V1. Improved system performance by adding additional factorization at the convolution stage to reduce existing connections or parameters without reducing the network used to extract image features with an input image size of 299 x 299 x 3 pixels. With a transformer architecture that uses a multi-head self-attention mechanism to predict words and recover words sequentially with an RNN encoder-decoder architecture. The research was carried out using 5 minute videos which produced a Tensorflow dataset of 1000 images and 5000 sentence captions. The model was evaluated with BLEU (Bilingual Evaluation Understudy), with average scores of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 obtained at 0.418, 0.367, 0.245, and 0.165 to produce predicted captions and real captions. |
---|---|
ISSN: | 2831-400X |
DOI: | 10.1109/ICSGTEIS60500.2023.10424317 |