File:Zipf-chin-1 Chinese texts - Red Mansion, Pentateuch, Voice of America.svg

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Original file (SVG file, nominally 512 × 504 pixels, file size: 908 KB)

Captions

Captions

Zipf Law plots for five Chinese texts: Red Mansion, Pentateuch, Voice of America

Summary

[edit]
Description
English: Zipf law plot (frequency as function of frequency rank) for the words in five texts in Chinese (Mandarin) language. The texts and the word frequency files are:
  • The classical Chinese novel Dream of the Red Chamber or Dream of the Red Mansion (Hong2 Lou2 Meng4) by Cao2 Xue3 Qin2 and Gao E (~1750); with some errors and omissions. In disambiguated pinyin (see below). Sample: ci3 kai1 juan3 di4.2 yi1 hui2 ye3 zuo4.2 zhe3 zi4 yun2 yin1 ceng2 li4.4 [...] dong1 bian1 wu1 nei4.1 guo4 lai2 dai4.1 le5 liu2.1 File chin/red/tot.1/gud.wfr (original 706889 words, truncated/filtered to 35027 words, N = 2420 distinct).
  • A Chinese translation of the first five books (Pentateuch) of the Old Testament Bible, Union Version (1919). In disambiguated pinyin. Sample: qi3 chu1.1 shen2.1 chuang4 zao4 tian1 di4 di4 shi4 kong1 xu1.1 hun4 [...] zhei4 li3 wo3 yao4 jiang1 yi1 qie4 jie4.8 ming4 lü4.1 File chin/ptt/tot.1/gud.wfr (original 174364 words, truncated/filtered to 35027 words, N = 1392 distinct).
  • Another Chinese translation of the Pentateuch, possibly the New Chinese Version (1992). In disambiguated pinyin. Sample: qi3 chu1.1 shen2.1 chuang4 zao4 tian1 di4 di4 shi4 kong1 xu1.1 hun4 [...] he2 hua2 File chin/ptn/tot.1/gud.wfr (original 193319 words, truncated/filtered to 35027 words, N = 1405 distinct).
  • Transcripts of 92 selected Voice of America broadcasts in Chinese 1996-1998. In disambiguated pinyin. Sample: ge4.1 wei4.1 ting1 zhong4 mei3.1 guo2 zheng4.1 fu3 jue2.2 ding4 jin4 yi1 [...] zheng4.1 zhi4.2 fan4.1 zheng4 shi4.13 kang4.1 yi4.2 hen3 File chin/voa/tot.1/gud.wfr (original 58813 words, truncated/filtered to 35027 words, N = 1616 distinct).
  • Transcripts of the same 92 selected Voice of America broadcasts. In ambiguous pinyin. Sample: ge4 wei4 ting1 zhong4 mei3 guo2 zheng4 fu3 jue2 ding4 jin4 yi1 bu4 dong4 [...] guo2 da4 lu4 qi2 File chip/voa/tot.1/gud.wfr (original 59476 words, truncated/filtered to 35027 words, N = 830 distinct).

In all these plots, each character (syllable, logoogram) is treated as a separate word. In the first four plots, the Chinese characters of the original text were mapped 1:1 from GB (Guo Biao) to pinyn with tone marks and disambiguating suffixes '.1', '.2' etc, so as to distinguish characters with the same pinyin -- e.g. 'zuo4', 'zuo4.1', 'zuo4.2'. In the last plot, the original file was a version transliterated by Ocrat.com, in pinyin with tone marks but without disambiguating suffixes, e.g. 'zuo4'; so that the same pinyin word may represent two or more different characters.

The word frequency files '*/*/*/gud.wfr' are available at the UNICAMP website. The original annotated full texts, before truncation/filtering, are in the companion files */*/org/main.src. The truncated/filtered texts -- one word per line, without punctuation -- are in */*/*/gud.tlw.
Date
Source Own work
Author Jorge Stolfi

Licensing

[edit]
I, the copyright holder of this work, hereby publish it under the following license:
w:en:Creative Commons
attribution share alike
This file is licensed under the Creative Commons Attribution-Share Alike 4.0 International license.
You are free:
  • to share – to copy, distribute and transmit the work
  • to remix – to adapt the work
Under the following conditions:
  • attribution – You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
  • share alike – If you remix, transform, or build upon the material, you must distribute your contributions under the same or compatible license as the original.

File history

Click on a date/time to view the file as it appeared at that time.

Date/TimeThumbnailDimensionsUserComment
current21:10, 15 May 2023Thumbnail for version as of 21:10, 15 May 2023512 × 504 (908 KB)Jorge Stolfi (talk | contribs)Rebuilt the file with small changes in dataset, colors
18:21, 9 May 2023Thumbnail for version as of 18:21, 9 May 2023512 × 504 (908 KB)Jorge Stolfi (talk | contribs)Uploaded own work with UploadWizard

Metadata