Loading…
Finding associations between natural and computer languages: A case-study of bilingual LDA applied to the bleeping computer forum posts
In the context of technical support, trails of technical discussions often contain a mixture of natural language (e.g., English) and software log excerpts. Uncovering latent links between certain problems and log excerpts that are often requested during the discussions of those problems enables the...
Saved in:
Published in: | The Journal of systems and software 2023-07, Vol.201, p.111651, Article 111651 |
---|---|
Main Authors: | , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | In the context of technical support, trails of technical discussions often contain a mixture of natural language (e.g., English) and software log excerpts. Uncovering latent links between certain problems and log excerpts that are often requested during the discussions of those problems enables the construction of a valuable knowledge base. Nevertheless, uncovering such latent links is challenging because English and software logs are two fundamentally different languages. In this paper, we investigate the suitability of multilingual LDA models to address the problem at hand. We study three models, namely: enriched LDA (M+), two-layer LDA (M2L), and off-the-shelf bilingual LDA (Mbi). We use approximately 8K discussion threads from a Bleeping Computer forum as our dataset. We observe that M2L performs the best overall, although it yields a substantially coarser-grained view of the discussed themes in the threads (20 topics, 0.3% of the documents). We also note that M+ outperforms Mbiachieving higher coherence, lower perplexity, and higher cross-lingual coverage ratio. We invite future studies to qualitatively assess the quality of the topics produced by the LDA models, such that the feasibility of employing such models in practice can be better determined.
•We expanded the selection of related studies from the bimodality/dual-channel area.•We discussed alternative approaches that balance two languages in LDA models.•We revised certain paragraphs of the paper to improve its soundness and clarity.•We made proper changes including typo fixing, term rewording, and issue fixing. |
---|---|
ISSN: | 0164-1212 1873-1228 |
DOI: | 10.1016/j.jss.2023.111651 |