The construction of Chinese microblog gender-specific thesauruses and user gender classification (pdf)

Article PDF cannot be displayed. You can download it here:

https://appliednetsci.springeropen.com/counter/pdf/10.1007/s41109-018-0104-1

The construction of Chinese microblog gender-specific thesauruses and user gender classification

(2018) 3:47 Zhu et al. Applied Network Science https://doi.org/10.1007/s41109-018-0104-1 RESEARCH Applied Network Science Open Access The construction of Chinese microblog gender-specific thesauruses and user gender classification Zhiliang Zhu, Zejun Ke, Jiayin Cui, Hai Yu* and Guoqi Liu *Correspondence: Software College, Northeastern University, Shenyang, China Abstract Based on the statistical features, short text messages published by different gender users are different in terms of the words and semantics used. In this paper, two new features are constructed after constructing a gender-specific thesaurus. A new classification model is constructed by combining the traditional statistical features and the improved text implicitness feature. The experimental evaluation performed on the Sina Weibo dataset demonstrated the effectiveness of gender-specific thesaurus-based features, and the improved text implicitness feature improved the accuracy of gender classification to 84.7%. Keywords: Gender classification, Statistical feature, Gender-specific thesaurus, Machine learning Introduction With the popularization and rapid development of the Internet, social networks are favored and sought after by many Internet users due to their unique virtuality, diversity, innovation, freedom and alienation. Foreign social networks are represented by platforms such as Facebook, Twitter and Instagram, while domestic ones are represented by Sina Weibo, Tencent Weibo, Wechat, Baidu Post Bar and Zhihu. In particular, anonymity is an important feature of social networks. People may not need to provide their real identities in cyberspace, such as their names, ages, genders, and addresses. However, while social networks are growing, the drawbacks of anonymous remarks are constantly being magnified and exploited. Users are vulnerable to anonymous and fraudulent attacks when socializing online, including receiving false information and even being mentally or physically challenged. In many criminal cases, the perpetrators attempt to hide their addresses by using anonymous servers that hide their true identity. Therefore, it is imperative to design an effective identity tracking method for cyberspace forensics. One of the most important aspects of this is gender classification. In addition to the value of Internet user security, gender classification of users in social networks is also crucial to market intelligence. User’s gender information can be used in targeted advertising and product development, thereby improving the accuracy of personalized recommendations and enabling more effective business promotion and accurate ad serving. In scientific research, this information can provide the foundation © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Zhu et al. Applied Network Science (2018) 3:47 for the separation of gender topics, the discovery of gender hot words, behavioral analysis and emotional analysis. Currently, scholars usually construct features by using statistical analysis methods and semantic analysis methods. There is not much research on gender classification in Chinese because Chinese is much more complex than English. Furthermore, Chinese people are used to being euphemistic when they express themselves. In terms of resources, the Chinese public thesaurus is relatively limited. Therefore, gender classification research in Chinese is quite difficult. Based on the Sina Weibo dataset, we build a gender-specific thesaurus to provide resources for scholars to perform gender classification research in the future. Moreover, this paper allows the more accurate calculation of the implicitness in the Chinese language by improving the text implicitness calculation. Then, we combine some traditional statistics-based text features and expression features to construct the feature vectors for gender classification. Of course, our research focuses on normal gender recognition without regard to gender camouflage (i.e., one gender deliberately presents another person’s characteristics) because we are based on user characteristics for gender recognition, if a person disguise the features provided, then we are not extracting the correct features, naturally can not correctly identify the user’s gender, but this kind of gender camouflage is only a small part, so this article does not consider this complicated situation. This paper is organized as follows. “Related work” section surveys existing work in gender classification. “The extraction of a gender-specific thesaurus and the construction of feature vectors” section presents the construction of the feature matrix. “Experimental process and result analysis” section describes the model that we build, presents our experimental results, and analyzes the experiment. Finally, “Conclusions” section summarizes our findings and conclusions. Related work In recent years, although research on gender classification based on social networks has not been popular, related works have made some progress. Because of the differences in language, Before 2018, the Chinese word segmentation was word-level. In 2018, some scholars tried to reach the character-like level (Cao 2018), but Chinese only had semantic meaning above a single text level. If it was divided further, the original meaning was destroyed. Moreover, for Chinese language expression, the basic unit is more of a word level, and most language analyses except Chinese can reach the character level. Therefore, Chinese gender classification based on NLP is different from other languages. In the research of Chinese gender classifications, Liu and Niu (2016) proposed a gender identification method based on the feature extraction of emotional words and emotionrelated language style. Huang et al. (2014) proposed a microblog message representation model based on a tolerance rough set, constructed a feature vector by extracting genderbased feature differences in rough sets, and finally used the k-NN classifier to classify the experiments. Compared with the characteristic term frequency representation model, the accuracy rate is 7%. Tang and Lin (2010) achieved gender recognition based on different descriptions of men or women in various aspects. Qi (2017) selected the corpus of Tencent Weibo to extract the vocabulary dependency of short texts and compared it with the vocabulary features of the existing documents to some extent. This avoided the sparsity of short text feature sets, and the use of machine learning (such as the SVM Page 2 of 17 Zhu et al. Applied Network Science (2018) 3:47 Algorithm) was experimentally verified. S (...truncated)