An online database of phonological representations for Mandarin Chinese
PING LI
0
0
Pennsylvania State University, University Park
,
Pennsylvania
A Web-based database is developed to provide psycholinguists with a large-scale phonological representation system for all Mandarin Chinese monosyllables. The construction of the system is based on the slot-based phonological pattern generator (PatPho), with an adequate consideration of the language-specific features of the Chinese phonology. Users can retrieve the relevant phonological representations through an interactive query system on the Web. The query outcomes can be saved in a number of formats, such as Excel spreadsheets, for further analyses. This representation system can be used for a variety of purposesin particular, connectionist language modeling and, more generally, the study of Chinese phonology.
-
Researchers in connectionist modeling of language
have for some time been concerned with the issue of
phonological representations of the relevant linguistic input
to the model. How to faithfully represent the phonological
patterns of words and the differences between words in a
language has been discussed since the pioneering work of
Rumelhart and McClelland (1986) on the acquisition of
the English past tense. Recent development in this field
favors the approach in which a words pronunciation is
coded on a slot-based representation, while taking into
consideration the articulatory features of phonemes in
the word (Joanisse & Seidenberg, 1999; MacWhinney &
Leinbach, 1991; Plunkett & Juola, 1999). In particular,
the phonology of a word is encoded in terms of a template
with a fixed set of slots; each phoneme of the word is
assigned to a different slot, depending on which syllable it
belongs to and at which position it appears in the syllable,
such as the onset, nucleus, or coda.
Most recently, on the basis of this idea of syllabic
templates, Li and MacWhinney (2002) introduced a
phonological pattern generator (PatPho) for
connectionist modeling. PatPho is able to represent English words
with variable length (up to three syllables) in a syllabic
template of CCCVVCCCVVCCCVVCCC, with Cs
representing consonants, Vs representing vowels, and
each CCCVVCCC representing one syllable. This
system accurately captures the phonological features of
English words and has been successfully applied in our
connectionist models of child language development (Li,
Farkas, & MacWhinney, 2004; Li, Zhao, & MacWhinney,
2007; Zhao & Li, in press).
The phonological representation of words is also
an important issue in the connectionist study of other
languages. For example, Chinese has an ideographic writing
system, and it has always been a difficult problem for
connectionist models to correctly represent the phonology
of Chinese characters. To solve this problem, different
researchers have developed different representational
systems (e.g., Hsiao & Shillcock, 2004; Xing, Shu, & Li,
2004). Although these systems have greatly improved our
understanding of language acquisition and language
processing in Chinese, there are some problems with these
systemsnotably, in terms of their generalizability for
computational models other than their own.
In Hsiao and Shillcocks (2004) work, the pronunciation
of Chinese monosyllabic characters was represented by a
27-dimension binary vector. In their coding, the first 14
dimensions of the vector represent the phonetic features
of an initial constant, the next 8 dimensions represent
those of a nucleus vowel, 3 other dimensions represent the
final constant, and the final 2 dimensions represent four
tones in Mandarin Chinese. A significant advantage of
their system is the parsimony of the binary codes (0 or 1),
which allows their computational model to be tractable.
The parsimony, however, introduces certain problems
that may limit the accuracy of their representations. For
example, only a single nucleus vowel can be represented
in their system, which is inconsistent with Chinese
phonology, which allows two or even three vowels to be
clustered together (i.e., diphthongs or triphthongs). Hsiao
and Shillcocks representations therefore cannot capture
the vowel structure in Chinese. Another problem is related
to the tones in Mandarin Chinese. Because there are five
tones (including a neutral tone) in Mandarin Chinese, the
two-node binary representations in Hsiao and Shillcocks
system are unable to represent all the five tones.
Xing et al.s (2004) phonological representation
of Chinese characters was based on PatPho. It splits
Chinese monosyllables into three partsinitial, final,
and toneand uses six slots to represent the tone and
the phonemes that can occur in different positions of the
syllable. Each slot consists of five units, and each unit can
be assigned a real value between 0.0 and 1.0 to represent
a specific articulatory feature of the phoneme. In total, a
30-dimensional feature vector with real values can be used
to represent the pronunciation of a Chinese character. This
system, as compared with Hsiao and Shillcocks (2004),
can successfully code the diphthongs and triphthongs
in its representation and is able to capture the phonetic
features of Chinese syllables.
One minor problem with Xing et al.s (2004) system is
that five units are used to represent a phoneme or a tone.
However, as we will discuss below, three nodes are sufficient
to represent the features of a phoneme, and a single unit
with varying real numbers is able to represent all the five
tones. As such, Xing et al.s system has some redundancy,
and there is room to reduce its computational complexity.
This representation also heavily relies on the Pinyin system
(the standard romanization system for Mandarin Chinese;
Institute of Linguistics of the Chinese Academy of Social
Sciences, 2002). The Pinyin system is simple and easy to
learn, but its simplicity also causes the problem that many
different phonemes have to be represented by the same
letter. For example, the Pinyin letter i could represent
three phonemes that are similar but different, according to
its varying positions in a syllable. A similar situation holds
for a, o, e, and so on, since phonemic differences are
not clearly represented in the system.
Although connectionist modeling of Chinese has
become an increasingly important topic in psycholinguistic
research, there has not yet been a convenient tool with
which investigators can accurately generate large-scale
phonological representations of Chinese characters. The
issue is even more serious for researchers who are not
familiar with the Chinese language but, nevertheless, want
to do comparative studies, as well as for investigators
whose native language is Chinese but who are not trained
in the Pinyin system. It would be convenient for these
investigators to obtain simple, easily accessible, and
vector-based representations of Chinese pronunciations.
Our online phonological database of Chinese characters
is designed to help researchers to do just that.
Here, we introduce a phonological re (...truncated)