Loading…

Construction of a high-precision general geographical location words dataset

Geographical location words (GLWs) are words associated with geographical locations. GLWs are the significant foundation for text data processing and social network location inference. In this paper, we propose a framework for constructing high-precision general GLWs datasets, and a Chinese GLWs dat...

Full description

Saved in:
Bibliographic Details
Published in:Computer standards and interfaces 2023-03, Vol.84, p.103692, Article 103692
Main Authors: Liu, Yimin, Luo, Xiangyang, Tao, Zhiyuan
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Geographical location words (GLWs) are words associated with geographical locations. GLWs are the significant foundation for text data processing and social network location inference. In this paper, we propose a framework for constructing high-precision general GLWs datasets, and a Chinese GLWs dataset (named GeoCN) is constructed. To some extent, GeoCN solves the problem of lacking a Chinese GLWs lexicon with diverse categories, high accuracy, and robust versatility. GeoCN consists of three parts: a) points of interest (POI) data collected based on the electronic map API, b) administrative division data constructed based on the national information platform, and c) GLWs data expanded and filtered by automated procedures and manual processing. We establish a GLWs glossary for each administrative region and map each GLW to its location. GeoCN covers 34 provincial-level administrative regions, 392 prefecture-level administrative regions, and 3,160 county-level administrative regions in China. The number of GLWs in GeoCN reaches 1,763,476, and the compressed file size is 117 MB. •A framework for constructing high-precision general GLWs datasets is proposed.•A Chinese geographical location word dataset (called GeoCN) is constructed.•GeoCN has more categories of location words, higher precision, and stronger universality.•GeoCN can support natural language processing and social network location inference.
ISSN:0920-5489
1872-7018
DOI:10.1016/j.csi.2022.103692