Classification of Persian News Articles using Machine Learning Techniques

Document Type : Original Article

Authors

1 Department of Computational Linguistics, Regional Information Center for Science and Technology (RICeST), Shiraz, Fars, Iran

2 Department of Design and System Operations, Regional Information Center for Science and Technology (RICeST), Shiraz, Fars, Iran

Abstract

Automatic text classification, which is defined as the process of automatically classifying texts into predefined categories, has many applications in our everyday life and it has recently gained much attention due to the in-creased number of text documents available in electronic form. Classifying News articles is one of the applications of text classification. Automatic classification is a subset of machine learning techniques in which a classifier is built by learning from some pre-classified documents. Naïve Bayes and k-Nearest Neighbor are among the most common algorithms of machine learning for text classification. In this paper, we suggest a way to improve the performance of a text classifier using Mutual information and Chi-square feature selection algorithms. We have observed that MI feature selection method can improve the accuracy of Naïve Bayes classifier up to 10%. Experimental results show that the proposed model achieves an average accuracy of 80% and an average F1-measure of 80%.

Keywords


[1]V. K. Vijayan, K. R. Bindu, and L. Parameswaran, “A comprehensive study of text classification algorithms,” in 2017 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2017, 2017, vol. 2017.
[2] J. Novoviĉová, A. Malík, and P. Pudil, “Feature selection using improved mutual information for text classification,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 3138, 2004, doi: 10.1007/978-3-540-27868-9_111.
[3] Bahasine, S., et al., Feature selection using an improved Chi-square for Arabic text classification. Journal of King Saud University-Computer and Information Sciences,. Vol.32, No.2:pp. 225-231, 2020.
[4] J. R. Vergara and P. A. Estévez, “A review of feature selection methods based on mutual information,” Neural Computing and Applications, vol. 24, no. 1. 2014, doi: 10.1007/s00521-013-1368-0.
[5] F. Sebastiani, "Machine learning in automated text categorization". ACM computing surveys (CSUR), 2002. 34(1): p. 1-47.
[6] I. Moulinier and J.-G. Ganascia, “Applying an existing machine learning algorithm to text categorization,” 1996.
[7] P. M. Nadkarni, L. Ohno-Machado, and W. W. Chapman, “Natural language processing: An introduction,” Journal of the American Medical Informatics Association, vol. 18, no. 5. 2011, doi: 10.1136/amiajnl-2011-000464.
[5] M. K. Dalal and M. A. Zaveri, “Automatic Text Classification: A Technical Review,” Int. J. Comput. Appl., vol. 28, no. 2, 2011, doi: 10.5120/3358-4633.
[6] B. S. Harish, D. S. Guru, and S. Manjunath, “Representation and classification of text documents: A brief review,” IJCA, Spec. Issue Recent Trends Image Process. Pattern Recognit., no. 2, 2010.
[7] Mahinovs, A., et al., Text classification method review. 2007.
[8] R. Jindal, R. Malhotra, and A. Jain, “Techniques for text classification: Literature review and current trends,” Webology, vol. 12, no. 2, 2015.
[9] A. McCallum and K. Nigam, “A Comparison of Event Models for Naive Bayes Text Classification,” AAAI/ICML-98 Work. Learn. Text Categ., 1998, doi: 10.1.1.46.1529.
[10] D. W. Aha, D. Kibler, and M. K. Albert, “Instance-Based Learning Algorithms,” Mach. Learn., vol. 6, no. 1, 1991, doi: 10.1023/A:1022689900470.
[11] J.R., Quinlan, C4. 5: programs for machine learning. 2014: Elsevier.
[12] C. Cortes, and V. Vapnik, Support vector machine. Machine learning, 1995. 20(3): p. 273-297.
[13] M.E. Ruiz, and P. Srinivasan, "Automatic text categorization using neural networks." in Proceedings of the 8th ASIS SIG/CR Workshop on Classification Research. 1998.
[14] P. Domingos and M. Pazzani, “On the Optimality of the Simple Bayesian Classifier under Zero-One Loss,” Mach. Learn., vol. 29, no. 2–3, 1997, doi: 10.1023/a:1007413511361.
[15] J. H. Friedman, (1997). “On bias, variance, 0/1-loss, and the curse-of-dimensionality,” Data Min. Knowl. Discov., vol. 1, no. 1, doi: 10.1023/A:1009778005914.
[16] S. Gil-Begue, C. Bielza, and P. Larrañaga, (2021). "Multi-dimensional Bayesian network classifiers: A survey". Artificial Intelligence Review, Vol. 54, no. 1, (PP. 519-559). doi: 10.1007/s10462-020-09858-x
[17] N. Friedman, D. Geiger,  and  M. Goldszmidt (1997). Bayesian network classifiers. Machine learning, vol. 29, no. 2, (PP. 131-163). Springer. doi: https://doi.org/10.1023/A:1007465528199
[18] G. Singh, B. Kumar, L. Gaur,  and A.Tyagi, "Comparison between multinomial and Bernoulli naïve Bayes for text classification". In 2019 International Conference on Automation, Computational and Technology Management (ICACTM) (pp. 593-596). IEEE.
[19] L. Jiang, C. Li, S. Wang, and L. Zhang, “Deep feature weighting for naive Bayes and its application to text classification,” Eng. Appl. Artif. Intell., vol. 52, 2016, doi: 10.1016/j.engappai.2016.02.002.
[20] X. Zhu, Y. J. Ko, S. Berry, K. Shah, E. Lee, and K. Chan, “A Bayesian network meta-analysis on second-line systemic therapy in advanced gastric cancer,” Gastric Cancer, vol. 20, no. 4, 2017, doi: 10.1007/s10120-016-0656-7.
[21] J. Li, X. Y. Tong, L. Da Zhu, and H. Y. Zhang, “A Machine Learning Method for Drug Combination Prediction,” Front. Genet., vol. 11, 2020, doi: 10.3389/fgene.2020.01000.
[22] S. Paudel, P. W. C. Prasad, and A. Alsadoon, Feature Selection Approach for Twitter Sentiment Analysis and Text Classification Based on Chi-Square and Naïve Bayes, vol. 842, no. 1. 2018.
[23] R. Wongso, F. A. Luwinda, B. C. Trisnajaya, O. Rusli, and Rudy, “News Article Text Classification in Indonesian Language,” in Procedia Computer Science, vol. 116, 2017, doi: 10.1016/j.procs. 2017.10.039.
[24] L. Zhang, L. Jiang, C. Li, and G. Kong, “Two feature weighting approaches for naive Bayes text classifiers,” Knowledge-Based Syst., vol. 100, 2016, doi: 10.1016/j.knosys.2016.02.017.
[25] D. Tomar and S. Agarwal, “A survey on data mining approaches for healthcare,” Int. J. Bio-Science Bio-Technology, vol. 5, no. 5, 2013, doi: 10.14257/ijbsbt.2013.5.5.25.
[26] J. O. Pedersen and Y. Yang, “A Comparative Study on Feature Selection in Text Categorization,” Proceeding ICML ’97 Proc. Fourteenth Int. Conf. Mach. Learn., 1997, doi: 10.1093/bioinformatics/bth267.
[27]Y. Yang and X. Liu, “A re-examination of text categorization methods,” 1999, doi: 10.1145/312624. 312647.
[28] S. Tan, “Neighbor-weighted K-nearest neighbor for unbalanced text corpus,” Expert Syst. Appl., vol. 28, no. 4, 2005, doi: 10.1016/j.eswa.2004.12.023.
[29] M. Farhoodi and A. Yari, “Applying machine learning algorithms for automatic Persian text classification,” 2010.
[30] T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 1998, vol. 1398, doi: 10.1007/s13928716.
[31] L. S. Larkey, “Automatic essay grading using text categorization techniques,” SIGIR Forum (ACM Spec. Interes. Gr. Inf. Retrieval), 1998, doi: 10.1145/290941. 290965.
[32] L. S. Larkey, “Patent search and classification system,” 1999, doi: 10.1145/313238.313304.
[33] W. Lam, M. Ruiz, and P. Srinivasan, “Automatic text categorization and its application to text retrieval,” IEEE Trans. Knowl. Data Eng., vol. 11, no. 6, 1999, doi: 10.1109/69.824599.
[34] Y. Zhou, Y. Li, and S. Xia, “An improved KNN text classification algorithm based on clustering,” J. Comput., vol. 4, no. 3, 2009, doi: 10.4304/jcp.4.3.230-237.
[35] Y. Bao and N. Ishii, “Combining multiple k-nearest neighbor classifiers for text classification by reducts,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2002, vol. 2534, doi: 10.1007/3-540-36182-0_34.
[36] P. Soucy and G. W. Mineau, “A simple KNN algorithm for text categorization,” 2001, doi: 10.1109/icdm.2001.989592.
[37] L. Esmaeili, M. K. Akbari, V. Amiry, and S. Sharifian, “Distributed classification of Persian News (Case study: Hamshahri News dataset),” 2013, doi: 10.1109/ICCKE.2013.6682829.
[38] M. T. Pilevar, H. Feili, and M. Soltani, “Classification of Persian textual documents using learning vector quantization,” 2009,
[39] N. Maghsoodi and M. M. Homayounpour, “Using thesaurus to improve multiclass text classification,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2011, vol. 6609 LNCS, no. PART 2, doi: 10.1007/978-3-642-19437-5_20.
[40] M. H. Elahimanesh, B. Minaei-Bidgoli, and H. Malekinezhad, “Improving K-nearest neighbor efficacy for farsitext classification,” 2012.
[41] M. Parchami, B. Akhtar, and M. Dezfoulian, “Persian text classification based on K-NN using wordnet,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2012, vol. 7345 LNAI, doi: 10.1007/978-3-642-31087-4_30.
[42] A.Bagheri,  M. Saraee, and S. Nadi, "PSA: a hybrid feature selection approach for Persian text classification", Journal of Computing and Security, 2014. 1(4): p. 261-272.
[43] P. Ahmadi, M. Tabandeh, and I. Gholampour, “Persian text classification based on topic models,” 2016, doi: 10.1109/IranianCEE.2016.7585495.
[44] M. B. Dastgheib and S. Koleini, “Persian text classification enhancement by latent semantic space,” Int. J. Inf. Sci. Manag., vol. 17, no. 1, 2019.
[45] H. Eghbalzadeh, B. Hosseini, S. Khadivi, and A. Khodabakhsh, “Persica: A Persian corpus for multi-purpose text mining and natural language processing,” 2012, doi: 10.1109/ISTEL.2012.6483172.
[46] H. Almagrabi,  "Predicting the Helpfulness of Product Reviews: a Sentence Classification Approach", 2020, The University of Manchester (United Kingdom).
[47] S. Yadav and S. Shukla, “Analysis of k-Fold Cross-Validation over Hold-Out Validation on Colossal Datasets for Quality Classification,” 2016, doi: 10.1109/IACC.2016.25.
[48] H. K. Kim and M. Kim, “Model-induced term-weighting schemes for text classification,” Appl. Intell., vol. 45, no. 1, 2016, doi: 10.1007/s10489-015-0745-z.
[49] T. Wang, L. Liu, N. Liu, H. Zhang, L. Zhang, and S. Feng, “A multi-label text classification method via dynamic semantic representation model and deep neural network,” Appl. Intell., vol. 50, no. 8, 2020, doi: 10.1007/s10489-020-01680-w.
[50] Y. Li, D. F. Hsu, and S. M. Chung, “Combination of multiple feature selection methods for text categorization by using combinatorial fusion analysis and rank-score characteristic,” International Journal on Artificial Intelligence Tools, vol. 22, no. 2. 2013, doi: 10.1142/S0218213013500012.
[51] D. Agnihotri, K. Verma, and P. Tripathi, “An automatic classification of text documents based on correlative association of words,” J. Intell. Inf. Syst., vol. 50, no. 3, 2018, doi: 10.1007/s10844-017-0482-3.
[52] G. Kou, P. Yang, Y. Peng, F. Xiao, Y. Chen, and F. E. Alsaadi, “Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods,” Appl. Soft Comput. J., vol. 86, 2020, doi: 10.1016/j.asoc.2019.105836.
[53] J. Chen, H. Huang, S. Tian, and Y. Qu, “Feature selection for text classification with Naïve Bayes,” Expert Syst. Appl., vol. 36, no. 3 PART 1, 2009, doi: 10.1016/j.eswa.2008.06.054.
[54] H. Liu and R. Setiono, “Chi2: feature selection and discretization of numeric attributes,” 1995, doi: 10.1109/tai.1995.479783.
[55] S. Lee, J. Song, and Y. Kim, “An empirical comparison of four text mining methods,” J. Comput. Inf. Syst., vol. 51, no. 1, 2010, doi: 10.1080/08874417.2010.11645444.
[56] J. C. Lamirel, P. Cuxac, A. S. Chivukula, and K. Hajlaoui, “Optimizing text classification through efficient feature selection based on quality metric,” J. Intell. Inf. Syst., vol. 45, no. 3, 2014, doi: 10.1007/s10844-014-0317-4.
[57] J. He, A. H. Tan, and C. L. Tan, “On machine learning methods for Chinese document categorization,” Appl. Intell., vol. 18, no. 3, 2003, doi: 10.1023/A:1023202221875.
[58] J. Tang, S. Alelyani, and H. Liu, “Feature selection for classification: A review,” in Data Classification: Algorithms and Applications, 2014.
[59] S. Dumais, J. Platt, D. Heckerman, and M. Sahami, “Inductive learning algorithms and representations for text categorization,” 1998, doi: 10.1145/288627. 288651.
[60] D. D. Lewis and M. Ringuette, “A comparison of two learning algorithms for text categorization,” in Proceedings of SDAIR94 3rd Annual Symposium on Document Analysis and Information Retrieval, vol. 33, 1994.