Long-Range Statistical Correlations in Human Language: A Case Study in Persian Language

Document Type : Full length research Paper

Author

Noshirvani University of Technology

Abstract

Complex structure of human language enables us to exchange very complicated information. This communication system obeys some common nonlinear statistical regularities. We investigate four important statistical features of Persian language. We perform our calculations for adopted works of six famous Persian litterateurs. Zipf’s law and Heaps’ law, which imply well-known power-law behaviors, are established in this language, showing a qualitative inverse relation with each other. Furthermore, the informational content associated with the words ordering, is measured by using an entropic metric. This metric can be applied in words relevancy ranking process. We also calculate fractal dimension of words in the text by using box counting method. The fractal dimension of each word, that is a positive value less than or equal to one, exhibits its spatial distribution in the text. Generally, we can claim that the Persian language follows the mentioned statistical laws, like the other languages studied in previous works.

Keywords

Main Subjects


 
[1] J.M. Smith, E. Szäthmáry, The Major Transitions in Evolution, Oxford University Press, Oxford, (1997).
 
[2] S. Romaine, the Evolution of Linguistic Complexity in Pidgin and Creole Languages, in: The Evolution of Human Languages (ed. J.A. Hawkins, M. Gell-Mann), Addison Wesley, Redwood City, (1992) 213-238.
 
[3] M.A. Montemurro, D.H. Zanette, Complexity and Universality in the Long-Range Order of Words, arXiv: 1503.01129v1 (2015).
[4] G. Zipf, Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology, Addison-Wesley Press, Cambridge, (1949).
 
[5] H.S. Heaps, Information Retrieval: Computational and Theoretical Aspects, Academic Press, New York, (2001).
 
[6] M. Ortuño, P. Carpena, P. Bernaola-Galvàn, E. Muñoz, A.M. Somoza, Keyword Detection in Natural Languages and DNA, Europhysics Letters 57 (2002) 759-764.
 
[7] T. Cover, J. Thomas, Elements of Information Theory, John Wiley & Sons, New York, (1991).
 
[8] http://ganjoor.net/.
 
[9] E. Najafi, A.H. Darooneh, The Fractal Patterns of Words in a Text: A Method for Automatic Keyword Extraction, PLoS ONE 10 (2015) e0130617.
[10] M.F. Barnsley, Fractals Everywhere, Morgan Kaufmann, San Francisco, (1993).
 
 
[12] S.T. Piantadosi, Zipf’s Word Frequency Law in Natural Language: A Critical Review and Future Directions, Psychonomic Bulletin & Review 21 (2014) 1112-1130.
[13] D.H. Zanette, Statistical Patterns inWritten Language, arXiv: 1412.3336v1 (2014).
 
[14] I. Moreno-Sánchez, F. Font-Clos, A. Corral, Large-Scale Analysis of Zipf’s Law in English Texts, arXiv: 1509.04486v1 (2015).
 
[15] J. Baixeries, B. Elvevåg, R. Ferrer-i-Cancho, The Evolution of the Exponent of Zipf’s Law in Language Ontogeny, PLoS ONE 8 (2013). e53227.
 
[16] F. Font-Clos, A. Corral, Log-Log Convexity of Type-Token Growth in Zipf’s Systems, Physical Review Letters 114 (2015) 238701.
 
[17] A. Corral, G. Boleda, R. Ferrer-i-Cancho, Zipf’s Law for Word Frequencies: Word Forms versus Lemmas in Long Texts, PLoS ONE 10 (2014) e0129031.
 
[18] A. Gelbukh, G. Sidorov, Zipf and Heaps Laws’ Coefficients Depend on Language, Lecture Notes in Computer Science 2004 (2001) 332-335.
 
[19] S. Havlin, The Distance Between Zipf Plots, Physica A 216 (1995) 148-150.
 
[20] A.E. Allahverdyan, W. Deng, Q.A. Wang, Explaining Zipf’s Law via Mental Lexicon, Physical Review E 88 (2013) 062804.
 
[21] B. Mandelbrot, Information Theory and Psycholinguistics: A Theory of Words Frequencies, Readings in Mathematical Social Science (1968) 350-368.
 
[22] S. Naranan, W.K. Balasubrahmanyan, Models for Power Law Relations in Linguistics and Information Science, Journal of Quantitative Linguistics 5 (1998) 35-61.
 
[23] V.V. Bochkarev, E.Y. Lerner, A.V. Shevlyakova, Deviations in the Zipf and Heaps laws in natural languages, Journal of Physics: Conference Series 490 (2014) 012009.
 
[24] A. Mehri, A.H. Darooneh, A. Shariati, The Complex Networks Approach for Authorship Attribution of Books, Physica A 391 (2012) 2429-2437.
 
[25] A. Mehri, A.H. Darooneh, The Role of Entropy in Word Ranking, Physica A 390 (2011) 3157-3163.
 
[26] A. Mehri, M. Jamaati, H. Mehri, Word Ranking in a Single Document by Jensen-Shannon Divergence, Physics Letters A 379 (2015) 1627-1632.
 
[27] B.B. Mandelbrot, The Fractal Geometry of Nature, W.H. Freeman and Company, New York, (1982).
 
[28] K. Falconer, Fractal Geometry, John Wiley & Sons, Chichester, (2003).
 
[29] A. Eftekhari, Fractal Geometry of Texts: An Initial Application to the Works of Shakespeare, Journal of Quantitative linguistics 13 (2006) 177-193.
 
[30] M. Ausloos, Measuring Complexity with Multifractals in Texts. Translation Effects, Chaos, Solitons & Fractals 45 (2012) 1349-1357.
 
[31] K.J. Hsu, A.J. Hsu, Fractal geometry of music, Proceeding of the National Academy of Sciences 87 (1990) 938-941.
 
[32] A. Mehri, S.M. Lashkari, Power-Law Regularities in Human Language, European Physical Journal B 89 (2016) 241.