Explanation of the Database for the 1,945 Basic Japanese Kanji
(J-1945D)

TAMAOKA, Katsuo (Hiroshima University, Japan)
KIRSNER, Kim (University of Western Australia, Australia)
YANASE, Yushi (Ehime University, Japan)
MIYAOKA, Yayoi (Hiroshima University, Japan)
KAWAKAMI, Masahiro (Nagoya University, Japan)

Produced on May 1, 2000

Address for correspondence:
Katsuo Tamaoka, Institute for International Education, Hiroshima University, 1-2, 1-chome,
Kagamiyama, Higashi-Hiroshima, Japan 739-8523
Tel: 0824-24-6288 (Office)
e-mail: ktamaoka@hiroshima-u.ac.jp

Japanese kanji provides a stimulus-rich environment for research focusing
on the perceptual and cognitive processes required for reading, memory and
language acquisition in general.
There are several potentially important differences between the Japanese
writing system and other writing systems. The first is that there are three
different Japanese scripts: kanji, hiragana and katakana. Kanji developed from
pictures used by the Chinese thousands of years ago to represent objects and
events in the world around them. Some kanji have preserved their pictographic
form and are still similar in appearance to the objects which they were intended
to represent. Others were designed to represent more abstract ideas, and still
others involve kanji combinations which were created to convey information
about a related idea. A fourth type of kanji consists of elements that hint at
pronunciation. Japanese also has two scripts representing morae (a slightly
smaller unit of syllabaries) ? hiragana and katakana ? to depict the same
set of 46 basic sounds. Hiragana is used for verb endings, parts of speech and
to write words not usually written in kanji. Katakana is used to write words and
names which are not of Japanese or Chinese origin. No further consideration
will be given to hiragana and katakana in this paper or in the associated
database.
The second important difference has to do with the number of different kanji
characters. In the version of kanji now in common use there is a total of 1,945
characters, and the pedagogic load associated with mastery of this set of
characters is evident in the way acquisition is spread out over the first nine
years of schooling. A third difference is the fact that kanji are constructed from
a set of 214 constituents, or 'radicals'. With this being the case, all 1,945 basic
kanji are constructed from one of these 214 radicals. There is a parallel in
Indo-European writing systems where many words have evolved from the
same stem or root of a remote language such as Latin. However, the radicals
in kanji are 'pictographs' rather than letters.
A fourth difference involves the fact that the spoken forms associated with
an individual kanji character are often shared by several other kanji, and are
therefore homophones. The English parallel involves words such as BARE and
BEAR which are, unlike most kanji examples, are visually similar. A fifth
difference involves the fact that many kanji have two readings, On and Kun,
based on the words from Chinese and Japanese origins respectively.
Homographs are of course present in the Indo-European languages as well,
with the two interpretations of BANK being an obvious example. The database
described and illustrated in this paper was created to facilitate general access
to information about these and other distributional characteristics of kanji and,
in turn, to facilitate research into the perceptual and cognitive processes that
may or may not be unique to kanji.

Kanji

The kanji script used in the Japanese language consists of morphemic units,
the smallest unit of meaning in a language. About 70 percent of the 51,962
words listed in a Japanese dictionary are composed of two kanji (Yokosawa
& Umeda, 1988). In 1981, the Japanese government published a list of the
1,945 most commonly-used basic kanji. Called 'Jooyoo Kanji Hyoo' (??
???), the list established the standard for kanji usage (Ministry of Education,
Science and Culture, Government of Japan, 1987, 1998). According to a
survey on frequency of kanji in print conducted by the National Language
Research Institute (1976), 2,000 kanji encompassed 99.6 percent of the kanji
used in three major Japanese newspapers, (Asahi, Mainichi and Yomiuri)
published during 1966. Although the 1,945 basic kanji and the 2,000 kanji
mentioned above were not identical in each case, it is estimated that the 1,945
basic kanji cover approximately 99 percent of kanji used in Japanese
newspapers.
In 1989, Ministry of Education, Science and Culture, Government of Japan
released a revised version of the Japanese language curriculum (Nihon-go
Gakushuu Shidoo Yooryo; ?????????) which included a list of kanji
to be mastered from Grades 1 to 6 (Gakunen-betsu Kanji Haitoo-hyoo; ???
?????). Of the 1,006 kanji on the list, 80 are taught in Grade 1, 160 in
Grade 2, 200 each in both Grades 3 and 4, 185 in Grade 5, and 181 in Grade
6. All of these 1,006 kanji are taken from the 1,945 basic kanji. The remaining
939 kanji are taught from Grades 7 to 9. Because the Education Act in Japan
stipulates that all Japanese citizens complete the ninth grade, every Japanese
person must study all 1,945 basic kanji by Grade 9. We can, therefore,
expect that native Japanese speakers educated in Japan will know at least
these 1,945 basic kanji.
The 1,945 basic kanji are ideal for experimental use in studies involving the
Japanese language. The present database provides 27 cells which cover
various aspects of the fundamental characteristics of the 1,945 basic Japanese
kanji. This information is stored in a Microsoft Excel 2000 file. Using this
database, researchers will be able to conduct planned experiments based on
the known characteristics of selected kanji.

The Kanji Table

The database includes 27 variables describing the 1,945 basic kanji.
Regarding the database cells from left to right, the variables are (1) ID
classification according to the Japanese 50-Sound System (50-Onzu, ???
?), (2) kanji orthography, (3) classification based on six categories provided
by Shirakawa (1994), (4) school grade during which the kanji is taught, (5)
number of strokes, (6) kanji frequency provided by the National Language
Research Institute (1976), (7) kanji frequency published by Yokoyama,
Sasahara, Nozaki and Long (1998), (8) kanji frequency on CD-ROM provided
by Yokoyama, et al. (1998), (9) kanji neighborhood size of the left-hand side
position provided by Kawakami (1997), (10) kanji neighborhood size of the
right-hand side position provided by Kawakami (1997), (11) a total of kanji
neighborhoods adding left-hand and the right-hand sides together, (12)
accumulative kanji neighborhood of the left-hand side position, (13)
accumulative kanji neighborhood of the right-hand side position, (14) a total of
accumulative kanji neighborhood adding the left-hand and right-hand sides
together, (15) name of radicals used for the kanji, (16) radical frequency in the
1,945 basic kanji, (17) number of constituents which construct the kanji, (18)
number of kanji homophones, (19) number of On-readings, (20) On-reading
pronunciations, (21) English translation of On-readings, (22) number of
Kun-readings, (23) Kun-reading pronunciations, (24) English translation of
Kun-readings, (25) On-reading frequency calculated from the index provided
by the National Language Research Institute (1976), (26) On- and Kun-reading
accumulative frequency calculated from the index of the National Language
Research Institute (1976), and (27) On-reading ratio (Cell 25 divided by Cell
26).

Cell 1: Kanji identification number as it ranks in the Japanese 50-Sound
System.
Cell 2: Kanji character. A presentation of character's orthography is provided
in this cell.
Cell 3: Kanji classification. The most common classification system for kanji is
one developed by the Chinese and adopted by the Japanese which divides all
kanji into six categories (Rikusho Bunrui, ????). These six categories are
(1) 'pictographic kanji' (Shookei, ??) derived from the shapes of objects,
(2) 'ideographic kanji' (Shiji, ??) which express ideas and qualities, (3)
'compound ideographic kanji' (Kaii, ??) formed by combining more than
one internal constituent to represent ideas and their associations, (4)
'phonetic compound kanji' (Keisei, ??) constructed by phonetic and
semantic components, (5) 'loan kanji' (Kashaku, ??) whose original
sounds were adopted but not their original meaning and (6) 'analogous
kanji' (Tenchuu, ??) which are new kanji patterned after old kanji to denote
new meanings. To represent these categories, our Cell 3 used these six terms:
'?? Pictographic', '?? Ideographic', '?? Com. Ideographic',
'?? Phonetic', '?? Loan', and '?? Analogous'. (Analogous
kanji are not found in the 1,945 basic Japanese kanji.) These six categories
were based upon a system of categorization provided by Shirakawa (1994). It
should be noted that there are five kanji (i.e., ?, ?, ?, ? and ?) that
cannot be classified according to this system of categorization. These kanji are
original Japanese characters (Kokuji, ??). They are labeled as '??
Original'. The kanji ? is also an original Japanese character but it is also a
phonetic compound kanji, so it is listed as '?? Phonetic'.
Cell 4: Grade of instruction. This cell specifies the school grade in which a kanji
is taught. The figures for the 1,006 kanji follow the Japanese language
curriculum as published by the Ministry of Education, Science and Culture,
Government of Japan in 1989. Since the remaining kanji are taught in
Grades 7-9, these are all indicated with the number '7'.
Cell 5: Number of strokes. This cell specifies the number of strokes in each
kanji. The strokes required to write a kanji are taken from a Japanese kanji
dictionary edited by Kamata (1991).
Cell 6: Frequency of occurrences for kanji in print (kanji frequency) as indexed
in 1976. Kanji frequency was calculated from all words printed in three major
newspapers (Asahi, Yomiuri and Mainichi) during the year 1966 and was
published by the National Language Research Institute in 1976. The study
sampled 991,375 kanji and, for each one, calculated the frequency of
appearance in every 1000 kanji.
Cell 7: Kanji frequency as indexed in 1998. Yokoyama, Sasahara, Nozaki and
Long (1998) published frequency of occurrence data based on all the kanji in
the Tokyo edition of the Asahi newspaper printed in 1993. We used
Pearson's correlation to find the relationship between frequency of
occurrence among the 1,945 basic kanji (n=1945) for 1966 as recorded in our
seventh cell, and for 1993 as recorded in our eighth cell. The correlation was
0.969, a figure which indicates that the frequency of occurrence for printed
kanji was stable over a period of 27 years, from 1966 to 1993.
Cell 8: Frequency of occurrence for kanji on CD-ROM (KF on CD). Yokoyama
et al. (1998) calculated and published frequency of occurrence data for all the
kanji used in the CD-ROM version of Asahi newspaper (CD-HIASK'93). It
contained 110,000 newspaper articles published in 1993. Thus, all the words
in the Tokyo edition of the Asahi newspaper used in Cell 8 are included in the
CD-ROM version. Pearson's correlation for the relationship between kanji
frequency of occurrence established for 1966 and 1993 was 0.971. Thus,
kanji frequency indexes changed little over 27 years. The correlation between
the two kanji frequencies for the 1993 Tokyo editions of the Asahi newspaper
and its CD-ROM version was 0.996. A smaller sampling of the newspaper
texts was enough to represent the kanji frequency index.
Cell 9: Kanji neighborhood size of the left-hand side of two-kanji compound
words. An index of kanji neighborhood size is provided by Kawakami (1997).
The term 'kanji neighborhood size' refers to the unit of one kanji combined
with another kanji to created two-kanji compound words. Two-kanji
compound words are produced by the combination of kanji placed in the
left-hand and right-hand side positions of the word. Cell 9 provides the kanji
neighborhood size of the left-hand side.
Cell 10: Kanji neighborhood size of the right-hand side of two-kanji compound
words.
Cell 11: A total of kanji neighborhood size for both the left-hand and right-hand
sides of two-kanji compound words.
Cell 12: Accumulative kanji neighborhood of the left-hand side of two-kanji
compound words. A kanji neighborhood size is simply a count of the number of
two-kanji compound words with no consideration of word frequency. Thus, the
accumulative kanji neighborhood is calculated by adding up all the frequency
of occurrences for words in print provided by the National Language Research
Institute (1973). Since two-kanji compound words can be produced by the
combination of kanji placed in the left-hand or right-hand side of the word, Cell
12 provides the kanji neighborhood size of the left-hand side.
Cell 13: Accumulative kanji neighborhood of the right-hand side of two-kanji
compound words.
Cell 14: A total of accumulative kanji neighborhoods of both the left-hand and
right-hand sides of two-kanji compound words.
Cell 15: Name of radical. This cell indicates the name of the radical used in
kanji. Japanese kanji dictionaries traditionally classify kanji by way of 214
radicals. The name used for each radical has been taken from the Japanese
kanji dictionary edited by Kamata (1987).
Cell 16: Radical frequency calculated using the 1,945 basic kanji. The radical
frequency indicates how many of the 1,945 basic kanji share the same radical.
A large body of kanji (1,057 characters or 54.34% of the 1,945 basic kanji) is
constructed with only 24 radicals out of a possible 214.
Cell 17: Number of constituents. Single kanji are often composed of two or
more constituents ? a radical, and a secondary elements. We conducted a
survey of visually complex kanji to identify how native Japanese speakers
divide up a single kanji character. Japanese speakers were asked to divide up
kanji into smaller constituents by circling them. Our survey found that subjects
were likely to divide up kanji based on radical units and then other elements.
Our database followed this survey in defining the number of components in
each kanji.
Cell 18: Number of kanji homophones. A single kanji pronunciation is often
shared by multiple kanji. For example, the sound /yoku/ could be written by
five different characters of the 1,945 basic kanji. Each of these five kanji is
indicated by the number 5. Both On- and Kun-readings were calculated for
kanji homophones. A majority of kanji homophones were found in On-readings.
Cell 19: Number of On-readings. Japanese kanji often have multiple
On-readings. On-readings were adopted from the original Chinese
pronunciation during the Chinese dynastic periods when contact occurred
between Japan and China. The count for On-readings was taken from the
On-readings listed in the kanji dictionary edited by Kamata (1991). There were
779 kanji (40.05%) out of 1,945 basic kanji which had only one type of either
On-reading or Kun-reading.
Cell 20: Pronunciation of On-readings. The pronunciation of On-readings is
written in the Hepburn system of romanization provided by Nelson (1992).
However, when pronunciation of an On-reading ends with a geminate sound,
the symbol /Q/ is used. This phonemic symbol is common in presenting
Japanese special sounds. In order to precisely describe Japanese sounds,
long vowels are transcribed by repeating the same vowel twice such as /oo/
and /uu/. Some rare pronunciations of On-readings were excluded from the
database.
Cell 21: English translation of On-readings.
Cell 22: Number of Kun-readings. Kun-readings originated in Japan. The count
for the Kun-readings was taken from the Kun-readings listed in the kanji
dictionary edited by Kamata (1987). Rare Kun-readings were not included.
Cell 23: Pronunciation of Kun-readings. Kun-readings were written in the
Hepburn system of romanization provided by Nelson (1992). Again, long
vowels are transcribed by repeating the same vowel. There was no ending of
the geminate sound /Q/ in Kun-readings. Rare pronunciations of Kun-readings
were excluded from the database.
Cell 24: English translation of Kun-readings.
Cell 25: Accumulative frequency for kanji with On-readings. Using the kanji
frequency index of 1976 provided by the National Language Research Institute,
we calculated the frequency of occurrence for the On-reading(s) of each kanji.
The resulting figures actually give frequency of occurrence as it appeared in
the three newspapers published in 1966. For example, the kanji ?, which
has an On-reading of /iku/, appeared in 19 different words used 918 times in
the sample taken from the newspapers. Accumulative frequency is therefore
listed as 918. The names of persons and places were not included in this
frequency. When the overall accumulative kanji frequency was less than 9,
the index of 1976 did not provide detailed frequencies of either On- or
Kun-readings. Accumulative frequency for these kanji is indicated by a hyphen
'-'.
Cell 26: Accumulative frequency for combined On- and Kun-readings. Using
the kanji frequency index of 1976, the total accumulative frequency was
calculated for each kanji using both its On- and Kun-readings. In the process of
calculation, Cell 26 excluded kanji used for proper nouns. This exclusion
slightly alters the figures in this cell from the kanji frequency figures of Cell 6.
Cell 27: On-reading ratio. A database consisting of 996 kanji published by
Kaiho and Nomura (1983) provided On-reading ratios. Kaiho and Nomura
calculated this ratio by working with kanji which have both an On- and
Kun-readings. For each of these kanji, they totaled the number of subjects
who had employed only the On-reading and then they simply divided this figure
by the number of subjects who had applied both readings. The shortcoming
of this approach is that the question of whether all subjects were familiar or not
with both readings was disregarded. For our database, we calculated the
On-reading ratio by dividing the accumulative On-reading frequency in Cell 25
by the accumulative frequency for the combined On- and Kun-readings in Cell
26. The On-reading ratios in the present database are objective figures using
kanji frequency of occurrence. When a kanji has only one type of reading
(either On or Kun), the figure is indicated by a hyphen '-'.

References

Kaiho, H., & Nomura, Y. (1983). Kanji joohoo shori no shinrigaku [Psychology
of kanji information processing]. Tokyo: Kyouiku Shuppan. (In Japanese)

Kamata, T. (1991). Kuwashii shoogakkoo kanji jiten [The detailed elementary
school kanji dictionary]. Tokyo: Buneido. (in Japanese)

Kawakami, M. (1997). JIS 1-shu kanji 2965 ji wo mochiite sakusei sareru kanji
niji jukugosuu hyoo [Tables of two-kanji compound words constructed with
2,965 JIS 1st kanji characters]. School of Education Bulletin, Nagoya
University, 44, 243-299. (In Japanese)

Kokuritsu Kokugo Kenkyuujo [National Language Research Institute]. (1973).
Shinbun no goi chosa (IV) [A study of Japanese word usage in
newspapers]. Tokyo: National Language Research Institute. (In Japanese)

Kokuritsu Kokugo Kenkyuujo [National Language Research Institute]. (1976).
Gendai Shinbun no Kanji [Japanese kanji characters in modern
newspapers]. Tokyo: National Language Research Institute. (In Japanese)

Ministry of Education, Science and Culture, Government of Japan. (1978).
Chuugakkoo shidoosho - Kokugohen [The Japanese language - the
course of study at junior high-school]. Tokyo: Tokyo Shoseki. (In
Japanese)

Ministry of Education, Science and Culture, Government of Japan. (1987).
Shoogakkoo shidoosho - Kokugohen [The Japanese language - the
course of study at elementary school]. Osaka: Osaka Shoseki. (In
Japanese)

Ministry of Education, Science and Culture, Government of Japan. (1998).
Monbushoo kokuji - Shoogakkoo gakushuu shidoo yooryoo [The
announcement of the elementary school course of study by the Ministry of
Education, Science and Culture, Government of Japan.]. Tokyo: Gyoosei.
(in Japanese)

Nelson, A. N. (1992). The modern reader's Japanese-English character
dictionary (2nd revised edition, 35th printing). Tokyo: Charles E. Tuttle
Company.

Shirakawa, S. (1994). Jitoo [Kanji etymology]. Tokyo: Heibonsha. (in
Japanese)

Yokosawa, K., & Umeda, M. (1988). Processes in human Kanji-word
recognition. Proceedings of the 1988 IEEE international conference on
systems, man, and cybernetics (pp. 377-380). August 8-12, 1988, Beijing
and Shenyang, China.

Yokoyama, S., Sasahara, H., Nozaki, H., & Long, E. (1998). Shinbun denshi
media no kanji � Asahi shinbun CD-ROM niyoru kanji hindo hyoo
[Japanese kanji in the newspaper media � Kanji frequency index from the
Asahi newspaper on CD-ROM]. Tokyo: Sanseidoo. (In Japanese)