A new method in text mining using fractional entropy

Document Type : Full length research Paper

Authors

1 Department of Physics, Faculty of Basic Science, Babol Noshirvani University of Technology, Babol, Iran

2 Department of Mathematics, Faculty of Basic Science, Babol Noshirvani University of Technology, Babol, Iran

Abstract

In this paper, we firstly review some definitions related to fractional calculus and fractional entropy, as a generalization of Shannon entropy. Then we introduce the generalized word importance metric based on fractional entropy. Using the proposed definition, we introduce a new text mining method based on fractional entropy. This method for keyword extraction of the Statistical Inference book by Casella and Berger (1990) shows that the F-measure value of the proposed text mining method, is higher than the related value for common text mining method based on Shannon entropy. These results indicate that the proposed text mining method based on fractional entropy is more comprehensive than the traditional text mining based on Shannon entropy.

Keywords

Main Subjects


 
[1] C.D. Manning, H. Schutze, Foundations of Statistical Natural Language Processing, The MIT Press, Cambridge, (1999).
 
[2] M.W. Berry, J. Kogan, Text Mining Applications and Theory, Wiley, New York, (2010).
 
[3] M. Ortuno, P. Carpena, P. Bernaola-Galvan, E. Munoz, A.M. Somoza, Keyword detection in natural languages and DNA, Europhysics Letter 57 (2002) 759-764. https://doi.org/10.1209/epl/i2002-00528-3
 
[4] H. Zhou, G.W. Slater, A metric to search for relevant words, Physica A 329 (2003) 309-327. https://doi.org/10.1016/S0378-4371(03)00625-3
 
[5] P. Carpena, P. Bernaola-Galvan, M. Hackenberg, A.V. Coronado, J.L. Oliver, Level statistics of words: Finding keywords in literary texts and symbolic sequences, Physical Review E 79 (2009) 035102. https://doi.org/10.1103/PhysRevE.79.035102
 
[6] J.P. Herrera, P.A. Pury, Statistical keyword detection in literary corpora, European Physical Journal B 63 (2008) 135-146.
 
[7] Z. Yang, J. Lei, K. Fan, Y. Lai, Keyword extraction by entropy difference between the intrinsic and extrinsic mode, Physica A 392
 
[8] A. Mehri, A.H. Darooneh, The role of entropy in word ranking, Physica A 390 (2011) 3157-3163. https://doi.org/10.1016/j.physa.2011.04.013
 
[9] A. Mehri, M. Jamaati, H. Mehri, Word ranking in a single document by Jensen-Shannon divergence, Physics Letters A 379 (2015) 1627-1632. https://doi.org/10.1016/j.physleta.2015.04.030
 
[10] R. Mihalcea, Random walks on text structures. CICLing 2006, LNCS, 3878 (2006) 249-262, Springer Heidelberg. https://doi.org/10.1007/11671299_27
 
[11] G. Zipf, Human Behavior and the Principle of Least Effort: An introduction to Human Ecology, Addison-Wesley Press, Cambridge, (1949).
 
[12] H.P. Luhn, The automatic creation of literature abstracts, IBM Journal of Research and Development 2 (1958) 159-165. https://doi.org/10.1147/rd.22.0159
 
[13] M. Mezard, A. Montanari, Information, Physics and Computation, Oxford University Press, Oxford, (2009).
 
[14] J.T. Machado, Fractional order generalized information, Entropy, 16 (2014) 2350-2361. https://doi.org/10.3390/e16042350
 
[15] D. Baeanu, K. Diethelm, E. Scalas, J.J. Trujillo, Fractional Calculus, world Scientific, Singapore, (2012).
 
[16] G.B. Bagci, The third law of thermodynamics and the fractional entropies, Physics Letters A 380 (2016) 2615-2618. https://doi.org/10.3390/e16042350
 
[17] G. Casella, R.L. Berger, Statistical Inference, Wadsworth, California, (1990).
 
 [18] A. Mehri, H. Agahi, H. Mehri-Dehnavi, A novel word ranking method based on distorted entropy, Physica A: Statistical Mechanics and its Applications, 521 (2019) 484-492. DOI: https://doi.org/10.1016/j.physa.2019.01.080
[19] D.L. Olson, D. Delen, Advanced Data Mining Techniques, Springer-Verlag, Berlin, (2008).