Explanation of the Database for the 1,945 Basic Japanese Kanji (J-1945D) TAMAOKA, Katsuo (Hiroshima University, Japan) KIRSNER, Kim (University of Western Australia, Australia) YANASE, Yushi (Ehime University, Japan) MIYAOKA, Yayoi (Hiroshima University, Japan) KAWAKAMI, Masahiro (Nagoya University, Japan) Produced on May 1, 2000 Address for correspondence: Katsuo Tamaoka, Institute for International Education, Hiroshima University, 1-2, 1-chome, Kagamiyama, Higashi-Hiroshima, Japan 739-8523 Tel: 0824-24-6288 (Office) e-mail: ktamaoka@hiroshima-u.ac.jp Japanese kanji provides a stimulus-rich environment for research focusing on the perceptual and cognitive processes required for reading, memory and language acquisition in general. There are several potentially important differences between the Japanese writing system and other writing systems. The first is that there are three different Japanese scripts: kanji, hiragana and katakana. Kanji developed from pictures used by the Chinese thousands of years ago to represent objects and events in the world around them. Some kanji have preserved their pictographic form and are still similar in appearance to the objects which they were intended to represent. Others were designed to represent more abstract ideas, and still others involve kanji combinations which were created to convey information about a related idea. A fourth type of kanji consists of elements that hint at pronunciation. Japanese also has two scripts representing morae (a slightly smaller unit of syllabaries) ? hiragana and katakana ? to depict the same set of 46 basic sounds. Hiragana is used for verb endings, parts of speech and to write words not usually written in kanji. Katakana is used to write words and names which are not of Japanese or Chinese origin. No further consideration will be given to hiragana and katakana in this paper or in the associated database. The second important difference has to do with the number of different kanji characters. In the version of kanji now in common use there is a total of 1,945 characters, and the pedagogic load associated with mastery of this set of characters is evident in the way acquisition is spread out over the first nine years of schooling. A third difference is the fact that kanji are constructed from a set of 214 constituents, or 'radicals'. With this being the case, all 1,945 basic kanji are constructed from one of these 214 radicals. There is a parallel in Indo-European writing systems where many words have evolved from the same stem or root of a remote language such as Latin. However, the radicals in kanji are 'pictographs' rather than letters. A fourth difference involves the fact that the spoken forms associated with an individual kanji character are often shared by several other kanji, and are therefore homophones. The English parallel involves words such as BARE and BEAR which are, unlike most kanji examples, are visually similar. A fifth difference involves the fact that many kanji have two readings, On and Kun, based on the words from Chinese and Japanese origins respectively. Homographs are of course present in the Indo-European languages as well, with the two interpretations of BANK being an obvious example. The database described and illustrated in this paper was created to facilitate general access to information about these and other distributional characteristics of kanji and, in turn, to facilitate research into the perceptual and cognitive processes that may or may not be unique to kanji. Kanji The kanji script used in the Japanese language consists of morphemic units, the smallest unit of meaning in a language. About 70 percent of the 51,962 words listed in a Japanese dictionary are composed of two kanji (Yokosawa & Umeda, 1988). In 1981, the Japanese government published a list of the 1,945 most commonly-used basic kanji. Called 'Jooyoo Kanji Hyoo' (?? ???), the list established the standard for kanji usage (Ministry of Education, Science and Culture, Government of Japan, 1987, 1998). According to a survey on frequency of kanji in print conducted by the National Language Research Institute (1976), 2,000 kanji encompassed 99.6 percent of the kanji used in three major Japanese newspapers, (Asahi, Mainichi and Yomiuri) published during 1966. Although the 1,945 basic kanji and the 2,000 kanji mentioned above were not identical in each case, it is estimated that the 1,945 basic kanji cover approximately 99 percent of kanji used in Japanese newspapers. In 1989, Ministry of Education, Science and Culture, Government of Japan released a revised version of the Japanese language curriculum (Nihon-go Gakushuu Shidoo Yooryo; ?????????) which included a list of kanji to be mastered from Grades 1 to 6 (Gakunen-betsu Kanji Haitoo-hyoo; ??? ?????). Of the 1,006 kanji on the list, 80 are taught in Grade 1, 160 in Grade 2, 200 each in both Grades 3 and 4, 185 in Grade 5, and 181 in Grade 6. All of these 1,006 kanji are taken from the 1,945 basic kanji. The remaining 939 kanji are taught from Grades 7 to 9. Because the Education Act in Japan stipulates that all Japanese citizens complete the ninth grade, every Japanese person must study all 1,945 basic kanji by Grade 9. We can, therefore, expect that native Japanese speakers educated in Japan will know at least these 1,945 basic kanji. The 1,945 basic kanji are ideal for experimental use in studies involving the Japanese language. The present database provides 27 cells which cover various aspects of the fundamental characteristics of the 1,945 basic Japanese kanji. This information is stored in a Microsoft Excel 2000 file. Using this database, researchers will be able to conduct planned experiments based on the known characteristics of selected kanji. The Kanji Table The database includes 27 variables describing the 1,945 basic kanji. Regarding the database cells from left to right, the variables are (1) ID classification according to the Japanese 50-Sound System (50-Onzu, ??? ?), (2) kanji orthography, (3) classification based on six categories provided by Shirakawa (1994), (4) school grade during which the kanji is taught, (5) number of strokes, (6) kanji frequency provided by the National Language Research Institute (1976), (7) kanji frequency published by Yokoyama, Sasahara, Nozaki and Long (1998), (8) kanji frequency on CD-ROM provided by Yokoyama, et al. (1998), (9) kanji neighborhood size of the left-hand side position provided by Kawakami (1997), (10) kanji neighborhood size of the right-hand side position provided by Kawakami (1997), (11) a total of kanji neighborhoods adding left-hand and the right-hand sides together, (12) accumulative kanji neighborhood of the left-hand side position, (13) accumulative kanji neighborhood of the right-hand side position, (14) a total of accumulative kanji neighborhood adding the left-hand and right-hand sides together, (15) name of radicals used for the kanji, (16) radical frequency in the 1,945 basic kanji, (17) number of constituents which construct the kanji, (18) number of kanji homophones, (19) number of On-readings, (20) On-reading pronunciations, (21) English translation of On-readings, (22) number of Kun-readings, (23) Kun-reading pronunciations, (24) English translation of Kun-readings, (25) On-reading frequency calculated from the index provided by the National Language Research Institute (1976), (26) On- and Kun-reading accumulative frequency calculated from the index of the National Language Research Institute (1976), and (27) On-reading ratio (Cell 25 divided by Cell 26). Cell 1: Kanji identification number as it ranks in the Japanese 50-Sound System. Cell 2: Kanji character. A presentation of character's orthography is provided in this cell. Cell 3: Kanji classification. The most common classification system for kanji is one developed by the Chinese and adopted by the Japanese which divides all kanji into six categories (Rikusho Bunrui, ????). These six categories are (1) 'pictographic kanji' (Shookei, ??) derived from the shapes of objects, (2) 'ideographic kanji' (Shiji, ??) which express ideas and qualities, (3) 'compound ideographic kanji' (Kaii, ??) formed by combining more than one internal constituent to represent ideas and their associations, (4) 'phonetic compound kanji' (Keisei, ??) constructed by phonetic and semantic components, (5) 'loan kanji' (Kashaku, ??) whose original sounds were adopted but not their original meaning and (6) 'analogous kanji' (Tenchuu, ??) which are new kanji patterned after old kanji to denote new meanings. To represent these categories, our Cell 3 used these six terms: '?? Pictographic', '?? Ideographic', '?? Com. Ideographic', '?? Phonetic', '?? Loan', and '?? Analogous'. (Analogous kanji are not found in the 1,945 basic Japanese kanji.) These six categories were based upon a system of categorization provided by Shirakawa (1994). It should be noted that there are five kanji (i.e., ?, ?, ?, ? and ?) that cannot be classified according to this system of categorization. These kanji are original Japanese characters (Kokuji, ??). They are labeled as '?? Original'. The kanji ? is also an original Japanese character but it is also a phonetic compound kanji, so it is listed as '?? Phonetic'. Cell 4: Grade of instruction. This cell specifies the school grade in which a kanji is taught. The figures for the 1,006 kanji follow the Japanese language curriculum as published by the Ministry of Education, Science and Culture, Government of Japan in 1989. Since the remaining kanji are taught in Grades 7-9, these are all indicated with the number '7'. Cell 5: Number of strokes. This cell specifies the number of strokes in each kanji. The strokes required to write a kanji are taken from a Japanese kanji dictionary edited by Kamata (1991). Cell 6: Frequency of occurrences for kanji in print (kanji frequency) as indexed in 1976. Kanji frequency was calculated from all words printed in three major newspapers (Asahi, Yomiuri and Mainichi) during the year 1966 and was published by the National Language Research Institute in 1976. The study sampled 991,375 kanji and, for each one, calculated the frequency of appearance in every 1000 kanji. Cell 7: Kanji frequency as indexed in 1998. Yokoyama, Sasahara, Nozaki and Long (1998) published frequency of occurrence data based on all the kanji in the Tokyo edition of the Asahi newspaper printed in 1993. We used Pearson's correlation to find the relationship between frequency of occurrence among the 1,945 basic kanji (n=1945) for 1966 as recorded in our seventh cell, and for 1993 as recorded in our eighth cell. The correlation was 0.969, a figure which indicates that the frequency of occurrence for printed kanji was stable over a period of 27 years, from 1966 to 1993. Cell 8: Frequency of occurrence for kanji on CD-ROM (KF on CD). Yokoyama et al. (1998) calculated and published frequency of occurrence data for all the kanji used in the CD-ROM version of Asahi newspaper (CD-HIASK'93). It contained 110,000 newspaper articles published in 1993. Thus, all the words in the Tokyo edition of the Asahi newspaper used in Cell 8 are included in the CD-ROM version. Pearson's correlation for the relationship between kanji frequency of occurrence established for 1966 and 1993 was 0.971. Thus, kanji frequency indexes changed little over 27 years. The correlation between the two kanji frequencies for the 1993 Tokyo editions of the Asahi newspaper and its CD-ROM version was 0.996. A smaller sampling of the newspaper texts was enough to represent the kanji frequency index. Cell 9: Kanji neighborhood size of the left-hand side of two-kanji compound words. An index of kanji neighborhood size is provided by Kawakami (1997). The term 'kanji neighborhood size' refers to the unit of one kanji combined with another kanji to created two-kanji compound words. Two-kanji compound words are produced by the combination of kanji placed in the left-hand and right-hand side positions of the word. Cell 9 provides the kanji neighborhood size of the left-hand side. Cell 10: Kanji neighborhood size of the right-hand side of two-kanji compound words. Cell 11: A total of kanji neighborhood size for both the left-hand and right-hand sides of two-kanji compound words. Cell 12: Accumulative kanji neighborhood of the left-hand side of two-kanji compound words. A kanji neighborhood size is simply a count of the number of two-kanji compound words with no consideration of word frequency. Thus, the accumulative kanji neighborhood is calculated by adding up all the frequency of occurrences for words in print provided by the National Language Research Institute (1973). Since two-kanji compound words can be produced by the combination of kanji placed in the left-hand or right-hand side of the word, Cell 12 provides the kanji neighborhood size of the left-hand side. Cell 13: Accumulative kanji neighborhood of the right-hand side of two-kanji compound words. Cell 14: A total of accumulative kanji neighborhoods of both the left-hand and right-hand sides of two-kanji compound words. Cell 15: Name of radical. This cell indicates the name of the radical used in kanji. Japanese kanji dictionaries traditionally classify kanji by way of 214 radicals. The name used for each radical has been taken from the Japanese kanji dictionary edited by Kamata (1987). Cell 16: Radical frequency calculated using the 1,945 basic kanji. The radical frequency indicates how many of the 1,945 basic kanji share the same radical. A large body of kanji (1,057 characters or 54.34% of the 1,945 basic kanji) is constructed with only 24 radicals out of a possible 214. Cell 17: Number of constituents. Single kanji are often composed of two or more constituents ? a radical, and a secondary elements. We conducted a survey of visually complex kanji to identify how native Japanese speakers divide up a single kanji character. Japanese speakers were asked to divide up kanji into smaller constituents by circling them. Our survey found that subjects were likely to divide up kanji based on radical units and then other elements. Our database followed this survey in defining the number of components in each kanji. Cell 18: Number of kanji homophones. A single kanji pronunciation is often shared by multiple kanji. For example, the sound /yoku/ could be written by five different characters of the 1,945 basic kanji. Each of these five kanji is indicated by the number 5. Both On- and Kun-readings were calculated for kanji homophones. A majority of kanji homophones were found in On-readings. Cell 19: Number of On-readings. Japanese kanji often have multiple On-readings. On-readings were adopted from the original Chinese pronunciation during the Chinese dynastic periods when contact occurred between Japan and China. The count for On-readings was taken from the On-readings listed in the kanji dictionary edited by Kamata (1991). There were 779 kanji (40.05%) out of 1,945 basic kanji which had only one type of either On-reading or Kun-reading. Cell 20: Pronunciation of On-readings. The pronunciation of On-readings is written in the Hepburn system of romanization provided by Nelson (1992). However, when pronunciation of an On-reading ends with a geminate sound, the symbol /Q/ is used. This phonemic symbol is common in presenting Japanese special sounds. In order to precisely describe Japanese sounds, long vowels are transcribed by repeating the same vowel twice such as /oo/ and /uu/. Some rare pronunciations of On-readings were excluded from the database. Cell 21: English translation of On-readings. Cell 22: Number of Kun-readings. Kun-readings originated in Japan. The count for the Kun-readings was taken from the Kun-readings listed in the kanji dictionary edited by Kamata (1987). Rare Kun-readings were not included. Cell 23: Pronunciation of Kun-readings. Kun-readings were written in the Hepburn system of romanization provided by Nelson (1992). Again, long vowels are transcribed by repeating the same vowel. There was no ending of the geminate sound /Q/ in Kun-readings. Rare pronunciations of Kun-readings were excluded from the database. Cell 24: English translation of Kun-readings. Cell 25: Accumulative frequency for kanji with On-readings. Using the kanji frequency index of 1976 provided by the National Language Research Institute, we calculated the frequency of occurrence for the On-reading(s) of each kanji. The resulting figures actually give frequency of occurrence as it appeared in the three newspapers published in 1966. For example, the kanji ?, which has an On-reading of /iku/, appeared in 19 different words used 918 times in the sample taken from the newspapers. Accumulative frequency is therefore listed as 918. The names of persons and places were not included in this frequency. When the overall accumulative kanji frequency was less than 9, the index of 1976 did not provide detailed frequencies of either On- or Kun-readings. Accumulative frequency for these kanji is indicated by a hyphen '-'. Cell 26: Accumulative frequency for combined On- and Kun-readings. Using the kanji frequency index of 1976, the total accumulative frequency was calculated for each kanji using both its On- and Kun-readings. In the process of calculation, Cell 26 excluded kanji used for proper nouns. This exclusion slightly alters the figures in this cell from the kanji frequency figures of Cell 6. Cell 27: On-reading ratio. A database consisting of 996 kanji published by Kaiho and Nomura (1983) provided On-reading ratios. Kaiho and Nomura calculated this ratio by working with kanji which have both an On- and Kun-readings. For each of these kanji, they totaled the number of subjects who had employed only the On-reading and then they simply divided this figure by the number of subjects who had applied both readings. The shortcoming of this approach is that the question of whether all subjects were familiar or not with both readings was disregarded. For our database, we calculated the On-reading ratio by dividing the accumulative On-reading frequency in Cell 25 by the accumulative frequency for the combined On- and Kun-readings in Cell 26. The On-reading ratios in the present database are objective figures using kanji frequency of occurrence. When a kanji has only one type of reading (either On or Kun), the figure is indicated by a hyphen '-'. References Kaiho, H., & Nomura, Y. (1983). Kanji joohoo shori no shinrigaku [Psychology of kanji information processing]. Tokyo: Kyouiku Shuppan. (In Japanese) Kamata, T. (1991). Kuwashii shoogakkoo kanji jiten [The detailed elementary school kanji dictionary]. Tokyo: Buneido. (in Japanese) Kawakami, M. (1997). JIS 1-shu kanji 2965 ji wo mochiite sakusei sareru kanji niji jukugosuu hyoo [Tables of two-kanji compound words constructed with 2,965 JIS 1st kanji characters]. School of Education Bulletin, Nagoya University, 44, 243-299. (In Japanese) Kokuritsu Kokugo Kenkyuujo [National Language Research Institute]. (1973). Shinbun no goi chosa (IV) [A study of Japanese word usage in newspapers]. Tokyo: National Language Research Institute. (In Japanese) Kokuritsu Kokugo Kenkyuujo [National Language Research Institute]. (1976). Gendai Shinbun no Kanji [Japanese kanji characters in modern newspapers]. Tokyo: National Language Research Institute. (In Japanese) Ministry of Education, Science and Culture, Government of Japan. (1978). Chuugakkoo shidoosho - Kokugohen [The Japanese language - the course of study at junior high-school]. Tokyo: Tokyo Shoseki. (In Japanese) Ministry of Education, Science and Culture, Government of Japan. (1987). Shoogakkoo shidoosho - Kokugohen [The Japanese language - the course of study at elementary school]. Osaka: Osaka Shoseki. (In Japanese) Ministry of Education, Science and Culture, Government of Japan. (1998). Monbushoo kokuji - Shoogakkoo gakushuu shidoo yooryoo [The announcement of the elementary school course of study by the Ministry of Education, Science and Culture, Government of Japan.]. Tokyo: Gyoosei. (in Japanese) Nelson, A. N. (1992). The modern reader's Japanese-English character dictionary (2nd revised edition, 35th printing). Tokyo: Charles E. Tuttle Company. Shirakawa, S. (1994). Jitoo [Kanji etymology]. Tokyo: Heibonsha. (in Japanese) Yokosawa, K., & Umeda, M. (1988). Processes in human Kanji-word recognition. Proceedings of the 1988 IEEE international conference on systems, man, and cybernetics (pp. 377-380). August 8-12, 1988, Beijing and Shenyang, China. Yokoyama, S., Sasahara, H., Nozaki, H., & Long, E. (1998). Shinbun denshi media no kanji – Asahi shinbun CD-ROM niyoru kanji hindo hyoo [Japanese kanji in the newspaper media – Kanji frequency index from the Asahi newspaper on CD-ROM]. Tokyo: Sanseidoo. (In Japanese) 1