NIPS'97 Workshop

Neural Models of Concept Learning

Saturday, Dec 6, 1997; Breckenridge, CO


Abstracts

[Browne] [Clouse] [Gasser, Colunga] [Harris] [de Heaulme] [Honkela] [Li] [Lund, Burgess] [Raijmakers] [Regier] [Scheler] [Schyns] [Tenenbaum] [Thornton]

The following is a list of abstracts we have received

A neural state machine approach to concept and language learning

C. Browne
Neural Systems Group
Imperial College, London

Previous work [4][3] has presented a schematic single neural architecture capable of representational redescription. This paper reviews the issues, outlined below, surrounding psychological theories of concept learning and associated connectionist models. It then presents a state space approach to the developmental, cognitive theory of representational redescription. A schematic neural architecture is described and expanded into a definite, quantitative proposal for a single neural architecture capable of learning concepts by the autonomous redescription of its own internal states. The proposed system is based around the synthesis of proven neural techniques. The implications of the system for the emergence of language are considered.

About a decade ago Fodor and Pylyshyn presented a useful challenge to the neural information processing community [6]. The challenge was to demonstrate that connectionist networks were capable of representing compositional, systematic information hierarchies. Such structures are known in cognitive psychology as concepts. Fodor and Pylyshyn's gauntlet is currently widely disregarded, but it is important for the numerous refutations [13][12][15] which it has elicited from the connectionist community.

Taking a slightly different perspective on Fodor and Pylyshn's publication, they raise an important, as yet unanswered, issue for all cognitive models. The issue at stake is that of concept learning, described by cognitive psychology. Two salient theories of concept learning exist, one in purely cognitive psychology, the other within developmental, cognitive psychology. The former is Harnad's theory of Categorical Perception (CP) [7], the latter being Karmiloff-Smith's proposed process of representational redescription (RR) [11].

Clark and Karmiloff-Smith [5] even go so far as to posit that the capacity for RR defines the dividing line between cognizers and noncognizers. Their analyses of the capacity of various artificial neural networks to perform RR suggest that no architecture to date, including those which are cited in refutations of Fodor and Pylyshyn's challenge, has achieved this goal [11][5]. Computational models are disregarded for two, related reasons. First, symbol processing systems are arguably subject to the symbol grounding problem [8]. Secondly, both RR and CP describe bottom up systems of concept learning. Connectionist models are particularly adept at describing bottom-up processes, in contrast to the top-down constraints of Good Old-Fashioned Artificial Intelligence.

To date, with notable exceptions [14][2], little work exists on constructing a neural network model of the RR process. Aleksander has described an interesting approach based on a state space theory of cognition [1], which has recently been extended and contrasted with CP [3]. In brief, Aleksander suggests that representational redescription occurs through the amalgamation of simple states at one cognitive level into larger, more complex states at a higher cognitive level. For example, all the sensory states representing individual experiences of particular glasses can be amalgamated to form a larger, complex state which represents the concept ``glass''. Both Harnad [9] and Aleksander [1] suggest that language emerges from learned linguistic tag for category representations.

The simple states can be grouped into the higher-level complex states in two ways. The original suggestion [1] was that complex states were formed by associating states with a common linguistic tag. In the case of the example, this would mean that all the glasses would have been attributed a name tag of ``glass'' through the intervention of an instructor. The construction of the complex state representing the concept ``glass'' is then easily envisaged. The more recent work [3] has identified some limitations in the linguistic naming scheme and argued that it is not true representational redesription, in Karmiloff-Smith's terms [11]. A scheme of representation similar to that proposed by Harnad [7] is proposed, by which category representations are constructed from unique features which differentiate category members from nonmembers in the context of all the previously encountered confusable alternatives. However as Harnad later describes, word origin can be attributed to labelling existing categories [9] and such an account is consistent with evolutionary theory [10].

References

1 I Aleksander. Impossible Minds: My neurons, My consciousness. IC Press, 1996.

2 G. Bartfai. Personal communication, 1997.

3 C. J. Browne, R. Evans, N. Sales, and I. Aleksander. Consciousness and neural cognizers. Neural Networks, In press. Special Issue on Consciousness.

4 C. J. Browne and S. Parfitt. Iconic learning and epistemology. In Proceedings of the International Conference New Trends in Cognitive Science, Vienna, Austria, 1997. ASoCS Technical Report 97-01.

5 A. Clark and A. Karmiloff-Smith. The cognizer's innards. Mind and Language, 8(4):487-519, 1993.

6 J.A. Fodor & Z.W. Pylyshyn. Connectionism and cognitive architecture: a critical analysis. Cognition, 28:3-71, 1988.

7 S. Harnad. Category induction and representation. Cambridge University Press, New York, 1987.

8 S. Harnad. The symbol grounding problem. Physica D, 42:335-346, 1990.

9 S. Harnad. The origins of words: A psychological hypothesis. Nodus Publishers, Muenster, 1995.

10 S. Harnad. On the virtues of theft over honest toil: Grounding language and tho ught in sensorimotor categories. In Proceedings of the Hang Seng Centre Conference on Language and T hought, 1996.

11 A Karmiloff-Smith. Beyond Modulartity. MIT Press, 1995.

12 J. Pollack. Recursive distributed representations. Artificial Intelligence, 46(1-2):77-105, 1990.

13 P. Smolensky. The constituent structure of connectionist mental states: a reply to Fodor and Pylyshyn. The Southern Jnl of Philosophy, XXVI, Supplement:137-161, 1987.

14 C. Thornton. A general model of implicit/explicit transition in representational redescription. In Proceedings of the First Australian Conference on Cognitive Science., 1995.

15 T. Van Gelder. Compositionality: a connectionist variation on a classical theme. Cognitive Science, 14:355-384, 1990.

Constraints on Representation of Lexical Semantics in Attractor Network Models

Dan Clouse
University of Claifornia, San Diego

Attractor network models of the lexicon are fairly popular these days, claiming to account for numerous effects in reaction time studies with human subjects -- frequency effects, priming effects, orthographic and phonological neighborhood effect. These models have also been used to account for deficits seen in deep and surface dyslexia. We are interested in extending these models to account for the effect of word concreteness on reaction time in lexical decision. Carlton James (1975) found that reaction times for concrete nouns are faster than for abstract nouns for low frequency words. We are also interested in the effect of semantic neighborhood density (i.e. the number of words which have meanings similar to a word of interest) on reaction time.

In this talk, we present the results of simulating a reaction time study using an attractor network model. In this study, we trained an attractor network to map from orthography to semantics, where both orthography and semantics are represented using random bit patterns. The representation of semantics is manipulated systematically to simulate concrete and abstract words, and several levels of semantic neighborhood density. One might expect that, in general, the semantic representation of a concrete word would contain more features than the representation of an abstract word. Therefore, in our representation, abstract words have fewer bits active than do concrete words. Two parameters control semantic neighborhood density. One parameter controls the radius of a neighborhood, while the other determines the number of neighbors within that radius.

The results of the simulation show an interference effect of neighborhood density. The network takes longer to settle for words which have many close neighbors suggesting that semantic neighborhood effects may be interesting to look for in a study of human subjects. However, there is no reliable effect of concreteness in the simulation results. We propose two alternative explanations for this missing concreteness effect. Either attractor networks are a bad processing model of the lexicon, or sparse patterns are a bad representation of abstract words in an attractor network model.

Linguistic Relativism and the Acquisition of Spatial Relations

Michael Gasser
Eliana Colunga
Indiana University

Recently there has been renewed interest in linguistic relativism, the idea that language influences thought (e.g., Gumperz & Levinson, 1996). One reason for this interest is the replacement of vague early speculation with more concrete proposals concerning specific effects. In our view, the issues would be clarified further if specific learning mechanisms were factored into the picture. Any complete account of the mutual effects of language and thought should accommodate the development of pre-linguistic cognition and show how pre-linguistic cognition supports the learning of language, as well as how the learning of language affects cognition. We consider the implications of a particular neural network model of the development of spatial cognition for the relativism debate.

Specifically, we focus on one area for which there is recent relevant evidence, the effect of semantic differences between languages on the course of semantic development. If the concepts that are the building blocks of thought are already in place when the semantics of natural language is acquired (either because they are innate or because they have already been learned on the basis of non-linguistic experience), then the particular way in which the target language slices up conceptual space should not have a significant effect on the order in which words and structures are learned. If, on the other hand, language-specific semantics has a lot to do with the emergence of concepts, the kinds of generalizations which learners make should depend strongly on the nature of the target language itself. Recent work by Bowerman and colleagues (Bowerman, 1996) suggests that this is the case, at least in the domain of space. In particular, children learning different languages over-generalize spatial terms in markedly different ways. Their linguistic behavior seems to be based on the semantics of the target language rather than on any pre-existing spatial concepts. The implication is that whatever concepts are in place at the beginning of language acquisition are modified further as language is acquired.

We consider these findings in the context of Playpen, an evolving neural network model of the development of spatial concepts and spatial language. Playpen is a generalized Hopfield network with separate WHAT and WHERE modules to handle objects and nouns on the one hand and relations and spatial terms on the other. As a solution to the binding problem, units have relative phase angles in addition to activation; units which are in phase represent features of the same object. Our current focus is on the representation of spatial relations in high-level vision and language. Three features of the model which distinguish it from other "grounded" models of language acquisition (e.g., Dorffner, forthcoming; Regier, 1996) are relevant for the issues considered here; each is motivated on grounds independent of Bowerman's results.

  1. Linguistic meaning and non-linguistic concepts are not rigidly distinguished.
  2. Spatial relations take the form of patterns of activation across a layer of relation units, which explicitly represent micro-relations.
  3. Spatial relations are learned as the system discovers correlations of micro-relations across different domains relevant to spatial cognition (vision, proprioception, touch, language, etc.).

We propose the following account of the acquisition of spatial relations. During an initial, non-linguistic phase, the child is exposed to particular spatial configurations (e.g., a cup on a table) visually, as well as through other sensory modalities. Connections into and within the layer of spatial relation units are strengthened or weakened in response to correlations among input features. Early spatial relations are highly context-specific; they apply to particular objects or object categories occurring regularly in the same relationship. Spatial relations become more general and abstract as the child recognizes that certain features fail to correlate. For example, the shape of the supported object in a SUPPORT relationship is more or less irrelevant, and the child comes closer to an understanding of abstract SUPPORT as she factors this out of the relation. During the linguistic phase of learning, the child is exposed to language (e.g., the cup is on the table) together with non-linguistic perceptual input. This linguistic input represents another feature which correlates in particular ways with the non-linguistic features. The effect is to highlight certain features of pre-linguistic relations and to downplay others as the weights within clusters of spatial relation units are modified. Different languages will highlight or downplay different sorts of features, resulting in differing spatial relations. This in turn leads to different sorts of generalizations. When novel non-linguistic perceptual inputs are presented to the system, they may activate spatial relation units which overlap with those which have become associated with words. This results in the activation of inappropriate relation words. Thus the course of word learning depends both on pre-linguistic learning and on the semantic structure of the target language, as Bowerman argues it does.

In addition to language production, we consider the implications of the model for comprehension and for non-linguistic spatial tasks.

References

Bowerman, M. (1996). Learning how to structure space for language: a crosslinguistic perspective. In P. Bloom, M. A. Peterson, L. Nadel, & M. F. Garrett (Eds.), Language and space. Cambridge, MA: MIT Press.

Dorffner, G. (forthcoming). Categorization in early language acquisition -- accounts from a connectionist model. Language and Cognitive Processes.

Gumperz, J. J. & Levinson, S. C. (Eds.) (1996). Rethinking linguistic relativity. Cambridge: Cambridge University Press.

Regier, T. (1996). The human semantic potential: Spatial language and constrained connectionism. Cambridge, MA: MIT Press.

Distributed Representations and Mixed Schemas

Catherine L. Harris
Boston University

charris@bu.edu

Overview

In this talk, I introduce the notion of mixed schema, which is a representation containing both a linguistic category (like noun) and an overtly occurring expression (a word, such as first). An example of a mixed schema is firstnoun. Patterns which instantiate this schema include first time, first name, first place, first lady, first man to walk on the moon. I discuss how mixed schemas facilitate word and concept learning and may also underlie creative uses of language. Applied to the domain of familiar two-word combinations, a connectionist model shows how mixed schemas emerge when a system self-organizes in the course of extracting regularities in a corpus of utterances. Measurements of the representational strength of two-word patterns in the network are compared to recognition data in human subjects.

I focus on familiar two-word combinations because pattern strength can be quantified via text counts, and language user's reactions to them can be measured via recognition experiments. However, schemas of intermediary abstraction are to be expected in virtually all areas where humans acquire knowledge via regularity extraction.

Mixed Schemas, the Schematicity Continuum, and the Rule-List Fallacy

The Rule-List Fallacy (so-named by Langacker, 1987) is the assumption that there are two types of mental representations: abstract descriptions of patterns (such as the descriptions of rules in a grammar), and lists of exceptions. A phrase or sentence which fits a rule would not also be separately memorized. The division into rules and lists of exceptions was originally designed to prevent redundant encoding and achieve economic description (Chomsky, 1965). It makes less sense today with the modern recognition that the the brain has vast storage resources and may use massive redundant encodings.

Evidence against the division into rules and lists include examples where the meaning of a sentence probably has been stored as a memorized entity, yet is not an exception to rules of compositional semantics, as in She felt the baby kick.

The alternative to dividing linguistic units into generalizations over phrases and listings of phrases is the schematicity continuum (see also the proponents of construction grammar; Fillmore, 1988; Goldberg, 1992 and Harris, 1994). This is the proposal that generalizations over linguistic patterns are not limited to the maximally general level of linguistic categories, but occur for a variety of types of generalizations, and for generalizations of varying degrees of productivity.

The schematicity continuum only makes senses in a system with dynamic data structures. Data structures are dynamic when they are not a fixed part of the overall system, but are emergent, or implicit in the working of the system. For example, consider the lose+noun pattern, instantiated by lose track, lose sight, lose touch. We can identify the lose+noun generalization using statistical analysis tools on the hidden units of a simple recurrent network, but the generalization does not exist independently of the representations of its members.

The foregoing reminds us that there is nothing special or marked about mixed schemas within the schematicity continuum. The reason to focus on them is they provide a concrete case where rival theories make different predictions. For convenience I will refer to the two rival theories as the "multi-schemas" view and the "rule+list" view.

Linguistic Questions

Important questions in linguistics and language learning are what principles govern the ease of learning a generalization R and how easily R can be creatively extended. The connectionist framework used here suggests the following factors:

Experiment Predictions

Experimental participants recognize familiar word combinations (plaid skirt) more easily than merely legal combinations (green skirt), while legal combinations are recognized more easily than anomalous combinations (Harris, 1997). In work reported in this paper, these same techniques have been applied to phrases which can be assimilated to both an adjective+noun schema as well as a mixed schema, such as first+noun.

According to the multi-schemas view, speed of recognition of a phrase is influenced both by the phrase's frequency, and by its schema's frequency, where the schema can be any pattern less specific than the phrase. Consider the phrase high rule. Assuming that this phrase is unfamiliar, then its recognition will be facilitated by the strength of mixed schemas such as high+noun and also the general adjective+noun schema. According to the rule+list view, only the adjective+noun schema will be relevant.

These predictions can be tested using unfamiliar phrases which can be assimilated to either a low or high frequency schema. For example, text counts reveal fewer instances of low+noun phrases than of high+noun phrases. The multi-schema view that predicts better recognition of phrases such as high bicycle compared to low bicycle, while the rule+list view predicts no difference.

Advocates of the rule+list view could claim that high is a better adjective than low; that a feature of words is their adjectival goodness, and that adjectival goodness speeds assimilation to the adjective+noun schema. To refute these objections, phrases were collected in which the initial word could be used as either a noun or an adjective (fire town vs fire around).

Design of Simulations

Familiar word combinations were selected from Lund & Burgess (1996) and embedded in sentences generated by a phrase structure grammar, thus creating a corpus of language-like sequences. A simple recurrent network was used to predict the next word in the sentence, following Elman (1993).

To investigate learning and generalization, different training corpora were constructed in which frequency and diversity of word phrase patterns were systematically controlled. (In the initial simulations, diversity was investigated by varying the type-token ratios, although in later planned simulations, semantics will be used so that diversity will refer to conceptual domains.) Ease of learning was measured using number of training cycles until the identity and/or grammatical category of the second word of a phrase could be predicted to low error. The networks' reaction to novel phrases such as fire town and fire around were measured by amount of error generated by the second word.

Differential against referential definition of concepts : the notion of " conceptual reservoir

Michel de Heaulme, MD, PhD
Medical Informatics
CHU Pitié-Salpêtrière
Paris, France
E-mail : mdeh@pratique.fr

By 1985 we had to face how to generalize medical information systems, but the medical field or the documentary purpose have no importance here : we ended to conclude that generalizing information processing gives rise by itself to a fundamental obstacle deeply relying on what it is called " concept " in computer processing. In fact all existing solutions, included the connexionnist ones, are working due to ad hoc usage of concepts only, which thus fails when this usage moves on. The fact is that it continually moves in the natural expression.

In an other hand, attractors or any pattern between terms are able to be detected, but no formal techniques is able to give them any relevant meaning except for some predetermined situations. Worst, the reality of an attractor will rely on predetermined meanings fixed by some formal semantics, what condamn a system to detect only what it already know.

In fact the usage of " learning " is absolutely ambiguous, for a system only " learns " what it has been directed to detect. After having isolated this difficulty from the ones proprerly belonging to intentionality, linguistics or pragmatics, we finally conclude that the definition of formal systems itself was involved in the question of conceptual learning when the field taken into account grows up, i. e. in the situation of generalization when formal semantics is no longer sufficient to tightly describe a given field.

This is related to the fact that concepts are logically defined as something which has to end to a reference (given or constructed from given atoms) as it has been stated by G. Frege definining the formal logic. But precisely a new concept is still permanently able to appear, and if it has to be referred to what it is already known, it cannot be new. We propose to identify the appropriate referential forms of concepts necessary to perform information processing from a differential reaction of words to each others, according to the general view of F. de Saussure. Obviously this will not able to identify a completely new intentionality, but at least is able to considerably improve the flexibility in detecting possible meanings attached to a set of terms detected by one of the technologies used for conceptual extraction. Thus our idea is not to replace the present technologies by some " hyper " other one, but only to complete them by an agent abling to specify what could be the meanings of the concepts in cause. Formal semantics do the same, but only within ad hoc usage. Differential here means that basic concepts have no definition by themselves, contrary to the referential approach.

Let us take nomenclatures to make a clear example : our approach would lead to process not the code of an item, as computers do, but its textual definition, as humans do. The differential aspect starts with the fact that the natural words are themselves character strings. But if we continue to consider these words as being previously referenced, as formal semantics do, we don't quit the formal definition of words, and we are sure then to meet the generalization problem. The surprizing conclusion which comes is that a " formal word " has not the power of a " natural word ", contrary what it is usually stated. Taking a natural word as a formal word is a restriction only necessary to state referential systems.

Now what is the " natural " property to pick up ? To introduce a differential behavior of concepts, we have to make relative the absolute part of formal words'definitions. We can use for that the definition of F. de Saussure : a word has a value only in presence of all other words within a given language. Not by itself, as are referenced formal words. The key of differential system is to organize what is " in presence of all the other words " : it means that natural words have not to be represented by formal words having a referential definition, as it is done in any formal (referential) semantics, but have to be organized only between themselves. In other terms they are their own way of representation. Our solution is to realize what we call a " conceptual reservoir " unikely using natural words. The entries are all the words of a language, and the conceptual definitions of each entry unikely use these words, no special meta-words. It is this kind of reentrance which realizes the differential aspect : a word has never a signification by itself, only in relation to (all) the other words present in its conceptual definition, which are themselves entries for the system. It is obviously also a kind of connexionnist property. On can demonstrate that this structure cannot be reduced to a semantic network : it is a flexible combination of semantic networks, each of them appearing to be only a fixed solution among all those that a conceptual reservoir provides.

In brief concepts are not longer predetermined, but are determined dynamically only : any concept is then the result of an interpretation, not the detection of a predetermined reference. Formally, this kind of system is incomplete : it will be completed only in presence of the words provided by the data. Then the conceptual reservoir reacts to these words and provides the adequate references in cause according to the usage of all other words met into data and into the conceptual reservoir. Words of a natural language carry an intrinsic information which formal words have not : a natural language appears as a system of self-consisting norms ruling the usage of words. It is this kind of self-consistency that we try to pick up through the conceptual definitions related to each entry, and which cannot be predetermined because it results from a differential matching of all the used words used to express it.

We are realizing on these basis a conceptual reservoir for neuroradiology, using both the books and the reports of the field (about 4000 entries). Of course neuroradiology is not relevant by itself to the problem of generalization. In fact it would rather add specific problems.

We propose to link such conceptual reservoir, that necessarily humans are only able to realize from their own language, to any pattern of information provided by the automatic technologies used for conceptual learning. If one data pattern matches one possible consistency belonging to the reservoir, the resulting concept will be taken as reference to continue the processing as usual. We are interested in discussing the theory of differential systems specially for its connexionnist aspects and up to the kind of complexity it introduces.

Self-Organizing Maps for Concept Learning

Timo Honkela
Helsinki University of Technology
Neural Networks Research Center
P.O.Box 2200
FIN-02015 HUT, Finland
e-mail: Timo.Honkela@hut.fi

Kohonen's Self-Organizing Map (SOM) can be used to analyze the use of words with their contexts in natural language texts. In the resulting map, denoted as a word category map, conceptually interrelated words tend to fall into the same or neighboring node. The map also reflect the hierarchical relations between conceptually organized areas. The area of nouns may, for instance, be automatically divided into the areas of animate and inanimate objects.

The overall organization of a word category map reflects the syntactic categorization of the words. In a typical map with most common words the verbs and the nouns form areas of their own on the opposite sides of the map. That seems to be in line with the results of neurophysiological studies. Already the text input as such, with the statistical properties of the contextual relations, is sufficient for automatical creation, i.e. emergence, of meaningful implicit categories when the unsupervised learning paradigm is adopted.

In the presentation, it will be discussed how the use of the SOM for concept learning is related to systems theory and statistical induction. In addition, unsupervised learning can be considered as a model of individuality or subjectivity in creating a conceptual model. This model will be contrasted with (a) the symbolic models of knowledge representation such as semantic networks and (b) the multilayered neural network models that use predetermined categories and are intended to learn to classify in a supervised manner.

The map nodes in a word category map can be considered as adaptive prototypes of corresponding classes. The scheme of combining symbol and context information can be used in creating a model of how predetermined classifications may influence the organization of the conceptual structure. The possibility of using various types of contexts for words will be considered with references to studies in which multimodal information is used.

An Emergentist Approach to Lexical Semantic Acquisition

Ping Li
University of Richmond
Richmond, Virginia 23173

Human behavior is often guided by intuitions. Sometimes we can use explicit rules to guide our decisions and categorizations, but often we confront complex patterns of data that cannot be summarized by a single, explicit rule. In language processing, we often have intuitions about why a linguistic marker or construction is or is not appropriate without being able to convert these intuitions into precisely formulated rules. Because these intuitions are so hard to formalize, language researchers tend to ignore those aspects of language that are guided by intuition, choosing instead to study patterns and rules that are most accessible to conscious reflection. To illustrate how linguistic behavior is guided by intuitions, Whorf (1956) drew our attention to the use of the English reversive prefix "un-". There is no single label or rule that tells us when we can use "un-" and when we cannot. If anything, such a rule is hidden, or "intangible" in Whorf's terms. Whorf proposed that there is a "cryptotype", a hidden or covert semantic category that governs the proper use of the "un-" prefix. He further reminded us that, despite the difficulty linguists have in characterizing this cryptotype, native speakers of English do have an intuitive feel for which verbs can be prefixed with "un-" and which cannot.

In this project, we propose that linguistic intuitions about the meaning of words or cryptotypic semantic representations emerge out of the learning of the relationships among a network of weighted semantic features. Whorf's cryptotypes, and semantic categories in general, can be precisely described with mechanisms of this emergentist approach that is embodied largely in current connectionist principles. Traditional methods that rely on rule systems or categorical structures are less effective in analyzing these problems and in describing lexical semantic acquisition in general, since the semantic properties that unite different members of a category constitute an intricate network, varying in how many features are relevant to the category members, how strongly each feature is activated in the representation of the category, and how features overlap with each other across category members. Our simulated model (Li, 1993; Li & MacWhinney, 1996, in progress) provides an explanation for some semantic problems that have been traditionally considered subtle, elusive, or even "intangible". These include: (a) Whorf's cryptotype that governs the use and learning of the English reversive prefix "un-"; (b) lexical aspect marking for the resultative, telic, and change of state verbs in crosslinguistic studies of the acquisition of Chinese, English, French, Italian, and Japanese; and (c) the use and acquisition of nominal classifiers in Mandarin Chinese. In sum, our model provides a "tangible" understanding of the "intangible" semantic structures of the sort discussed by Whorf. It also provides a formal account of the processes underlying the learning of semantic intuitions in relation to morphological productivity. In the problems that we studied, semantic features interact collaboratively in terms of summed activation to support a category that licenses the use of a particular morphological device, such as reversive prefixes, aspect markers, or classifiers.

Related to this project is our study of the organizational and reorganizational processes in lexical semantic learning. As learning progresses, the learner has to find out meaningful clusters from the increasing number of lexical items, organize them in a coherent structure, and also be able to adapt the structure as new input comes in. Depending on the similarities and consistencies between the old structure and the new input, the learner may make generalizations that are overly general (i.e., overgeneralization). Over time, the overgeneralization errors will be weeded out by the fine-tuning of the system (i.e., recovery from overgeneralizations). A emergentist account of the structural changes in the learning process provides mechanistic terms and a formal system for Piagetian notions of cognitive development.

REFERENCES

Bowerman, M. (1983). Hidden meanings: the role of covert conceptual structures in children's development of language. In D. Rogers & J. Sloboda (Eds.), The acquisition of symbolic skills. New York: Plenum.
Clark, E. (1973). What is in a word? On the child's acquisition of semantics in his first language. In T.E. Moore (Ed.), Cognitive development and the acquisition of language. New York: Academic Press.
Clark, E., Carpenter, K., & Deutsch, W. (1995). Reference states and reversals: Undoing actions with verbs. Journal of Child Language, 22, 633-662.
Cottrell, G., & Plunkett, K. (1994). Acquiring the mapping from meanings to sounds. Connection Science, 6, 379-412.
Elman, J. (1993). Learning and development in neural networks: the importance of starting small. Cognition, 48, 71-99.
Lakoff, G. (1987). Women, fire, and dangerous things. Chicago: The University of Chicago Press.
Li, P. (1993). Cryptotypes, form-meaning mappings, and overgeneralizations. In E. V. Clark (Ed.), Proceedings of the 24th Child Language Research Forum, Center for the Study of Language and Information, Stanford University, 162-178.
Li, P., & MacWhinney, B. (1996). Cryptotype, overgeneralization, and competition: A connectionist model of the learning of English reversive prefixes. Connection Science, 8 , 1-28.
Li, P., & Bowerman, M. (in press). The acquisition of grammatical and lexical aspect in Chinese. First Language.
MacWhinney, B. (1996). Lexical connectionism. In P. Broeder & J. Murre (eds.) Cognitive approaches to language learning. Cambridge, MA: The MIT Press.
Whorf, B. (1956). Thinking in primitive communities. In J. B. Carroll (Ed.), Language, thought, and reality. The MIT Press.

Recurrent neural networks and global co-occurrence models: Developing contextual representations of word-meaning

Kevin Lund
Curt Burgess (curt@cassandra.ucr.edu)
Department of Psychology
University of California, Riverside

One prominent approach to developing meaningful representations for words was developed by Osgood (1957) and involves obtaining judgements from human subjects using a semantic differential procedure; ratings for each word on a number of bipolar adjective scales are collected; these adjective scales (e.g., good-bad, big-small) constituted semantic feature dimensions. These ratings form a vector of scores which can be compared in order to compute word similarity. While these human-judgement based approaches can yield good results, they have two serious problems. First, they require the experimenter to choose an appropriate set of features on which words are to be rated; these features may not be appropriate for a wide range of word representations (in particular, it is problematic to devise a set of features which will represent both abstract and concrete words and different grammatical classes). The other drawback to these approaches is that they require very large numbers of human judgements for relatively small number of words.

High-dimensional semantics More recently, a theory of word meaning has appeared which holds that much information pertaining to word meaning exists in the contexts in which words are found. Our model of meaning, HAL (the Hyperspace Analog to Language), is an implementation of this theory (Lund & Burgess, 1996; Burgess & Lund, 1997); others exist as well (Landauer & Dumais, 1997; Foltz, 1996).

HAL operates by forming a matrix of word co-occurrences based on a corpus of text. Each axis of the matrix is indexed by the list of vocabulary items to be tracked, so that each word has its own row and column in the matrix. As the corpus is processed, words within a certain distance of each other are scored as co-occurring; their cells in the matrix are incremented by an amount proportional to the number of words separating them. This process can be likened to sliding a "window" along the corpus; the word in the middle of the window has co-occurrences recorded with the other words. If, for instance, the window were five words wide, the window would contain a total of eleven words: the central words, the previous five words, and the following five words. The central word would score a co- occurrence of five with each of the two immediately adjacent words, and a one with each of the words on the ends of the window.

Table 1 shows an example co-occurrence matrix; the sentence at the top of the figure was processed using a five-word window, with the resulting matrix at the bottom of the figure.

This window is moved one word at a time over the entire corpus. When the text has been processed, co-occurrence vectors can be extracted from the matrix for individual words; this process consists simply of concatenating the row and column for a particular word. These vectors can be compared for similarity, much like Osgood's vectors using the semantic differential technique.

Recurrent networks Another approach to automated acquisition of semantics was pioneered by Elman (1990). This approach uses a recurrent neural network which is trained to predict upcoming words from a corpus. When the network has been trained, hidden unit activation values for each input word are used as word representations. This technique may not appear to have much in common with statistical analyses of word co-occurrence, but both form their representations based on word context, so they should yield similar results.

Experiment 1

Elman (1990) used a SRN (simple recurrent network) to develop semantic representations for words. The corpus used was one constructed using a small grammar, forming 2 to 3 word sentences from a 29 word lexicon; this grammar was used to construct a set of sentences totaling ~29,000 words.

This corpus was fed into a neural network consisting of the three standard layers plus a fourth context layer which echoed the hidden layer. The network was trained to predict the next word, given the current word and whatever historical information was contained in the context layer. At the end of training, the hidden layer activation values for each word were taken as word representations. A clustering of these representations is shown in Figure 1a; the network was clearly able to develop plausible representations for the words. Note that not only were "semantic" distinctions made, but also words were separated by grammatical class.

Our approach to replicating this used the HAL model and its global co-occurrence algorithm. A co-occurrence matrix was constructed for the Elman corpus using a window size of one word. As the context represented in Elman's neural network consisted of only prior items, word vectors were extracted from the co- occurrence matrix using only matrix rows (representing prior co-occurrence), yielding twenty-nine vectors of twenty-nine elements each. These vectors were normalized to constant length in order to account for varying word frequency in the corpus. A hierarchical clustering of these vectors is shown in Figure 1b.

The cluster diagram using the global co-occurrence procedure is very similar to that obtained by Elman. Verbs and nouns are separated, as are animates. Within animates, words are grouped into small animals, dangerous animals, and people. Our diagram differs from Elman's mainly in a slightly different organization of the verbs.

Discussion

The recurrent neural network approach and the global co- occurrence approach to the generation of semantic structure have been compared and their results found to be similar. Why should such apparently dissimilar approaches yield the same results? We believe the answer is that both techniques capitalize on the similarity of context between semantically and/or grammatically similar words in order to construct representations of their meanings. Virtually the only thing which the two approaches have in common, in fact, is that they both have context information available to them. That they both find the same basic structure within the vocabulary argues strongly that context is a valid and fundamental carrier of information pertaining to word meaning - both at the semantic and grammatical level.

When should one approach be preferred to another? The recurrent network technique appears to be more sensitive to grammatical nuances. It also produces more compact representations, as the vectors are shorter than the vocabulary size (one element per hidden unit){footnote 1}. However, it has a drawback in that it doesn't scale well to real-world vocabularies - if tens of thousands of words are to be tracked, not only would the network be huge, but training the it would be difficult and time consuming due to the sparseness of the representations to be learned.

These techniques are generally useful beyond making theoretical points about the relationship between meaning and context. Co-occurrence vectors from HAL have been used to account for the dissociation of semantic and associative priming with normals (Lund, Burgess, & Atchley, 1995; Lund, Burgess, & Audet, 1996) and with deep dyslexic patients (Buchanan, Burgess, & Lund, 1996), cerebral asymmetries in normals (Burgess & Lund, 1997a), grammatical and syntactic effects (Burgess & Lund, 1997b) and the semantics of proper names (Burgess, Livesay, & Lund, 1996). The present results suggest that the HAL model can be used without some of the overhead associated with connectionist models. This may be particularly important for large scale applications such as database retrieval systems or large scale memory models.

Table 1: Sample matrix, produced by applying a five-word co-occurrence window to the sentence "The horse ran past the barn fell." (Cells containing zeroes have been omitted, as have rows or columns containing all zeroes).

        barn    horse   past    raced   the
barn            2       4       3       6
fell    5       1       3       2       4
horse                                   5
past            4               5       3
raced           5                       4
the             3       5       4       2

Figure 1a: see Elman (1990)

Figure 1b: cluster tree from one-word window co- occurrence matrix. (note: this should be printed in 80 columns to avoid wrapping the figure)

  ___________|-> glass
 |           |-> plate
 |                       ________|-----> eat
 |                      |        |-----> smash
 |                      |                |-----> break
-|           |----------|        |-------|      ___|---> chase
 |           |          |        |       |-----|   |---> like
 |           |          |--------|             |___|---> move
 |           |                   |                 |---> smell
 |           |                   |       |---> see
 |           |                   |-------|    __|-> exist
 |-----------|                           |---|  |-> sleep
             |                               |--> think
             |                     |-----> book
             |          |----------|      _|-> bread
             |          |          |-----| |-> cookie
             |          |                |-> sandwich
             |----------|           _________|-> car
                        |          |         |-> rock
                        |----------|          _______|---> dragon
                                   |         |       |___|--> lion
                                   |         |           |-->monster
                                   |---------|        ___|--> cat
                                             |       |   |__|-> dog
                                             |-------|      |-> mouse
                                                     |       _|-> boy
                                                     |   |--| |-> girl
                                                     |---|  |-> man
                                                         |--> woman

FOOTNOTES********************************

{1}1 Co-occurrence matrices can, however, be reduced in size. Sections with low variance can simply be eliminated with little impact on practical results. Alternately, variance decomposition techniques can be (and have been) applied with excellent results (Landauer & Dumais, 1997)

Maartje Raijmakers
Developmental Psychology, University of Amsterdam
Roetersstraat 15, 1018 WB Amsterdam,
The Netherlands

The acquirement of qualitative new cognitive capacities is both theoretically (nature-nurture discussion) and empirically (Piagetian research) an important subject of Child Psychology. According to the learning paradox, which implies a nativist position, it is impossible to acquire qualitative new cognitive capacities. The learning paradox is often used as a counter argument against epigenetic theories of development, like the cognitive stage theory of Piaget. Recent theories of self-organization, like catastrophe theory and synergetics, might solve the learning paradox. The most important feature of non-linear dynamic systems, in this context, is the appearance of self-organization. Some simulation models of cognitive developmental, especially neural networks, are non-linear dynamic systems, which might have the potential to show qualitative cognitive development. However, precise criteria to establish the acquirement of more powerful structures by neural network models appear to be difficult to obtain. I will present formal and precise criteria that can be applied to simulation models of learning and cognitive development (Raijmakers, 1996). These criteria concern three aspects of a developing neural network: the dynamics, the power of a structure, and the functionality of the system. The operationalization of the criteria is based on catastrophe and bifurcation theory, complexity measures, and the discrimination-shift task.

In addition, I developed a complete implementation of an Adaptive Resonance Theory (ART) network, including all regulatory and logical functions, as one system of ordinary differential equations capable of stand-alone running in real time (Raijmakers & Molenaar, 1997). This means that transient behavior is kept in tact. This implementation of ART, which is called Exact ART, is based on ART2 (Carpenter & Grossberg, 1982) and Grossberg's (1980) original ideas. Exact ART includes an implementation of a Gated Dipole Field and an implementation of the Orienting Sub-System. The representation layer of Exact ART (F2) is subjected to a numerical bifurcation analysis (Raijmakers, Van der Maas & Molenaar, 1997). Fold-bifurcations are found in the activity of the representation layer under variation of several architecture parameters. The stability of the ART network with the F2-layer in different dynamic regimes is maintained and the behavior is functional in Exact ART. Through a bifurcation the learning behavior of Exact ART may change from forming local representations to forming distributed representations. The appearance of bifurcations is one of the mentioned criteria of qualitative cognitive development.

The third criterion concerns the cognitive plausibility of the learning process. In the sixties and seventies, the development of discrimination learning has been examined extensively by the discrimination-shift paradigm. Recently, the discrimination-shift paradigm has been receiving new attention for two reasons. A key issue for research on computational models for learning and memory concerns the nature of representations of input-output relations. The discrimination-shift paradigm is proposed as an empirical test to characterize the nature of the representations that are formed in a neural network (Raijmakers, Van Koten & Molenaar, 1996). An issue concerning the development of human discrimination learning has been the possible discontinuity of the developmental process. Based on catastrophe theory, new tools to test the nature of the developmental process have been proposed (Van der Maas & Molenaar, 1992). To test one of the catastrophe flags, modality, we fit mixtures of distributions derived from Markov models to empirical data of a cross-sectional developmental study. In addition, a nonparametric test is applied to test the modality of the distribution. Currently, I am extending the Exact ART model such that it can perform a discrimination-shift learning task. In addition, the developmental process will be modeled by varying architectural parameters in addition to learning by means of changing weights. One of the main objectives is to model the qualitative properties of the empirical data, particularly the bimodal distributions of the total number of errors before a solution.

References:

Carpenter, G. & Grossberg, S. (1987). ART 2: Self-organization of stable category recognition codes for analog input patterns. Applied Optics, 26 (23), 4919-4930.
Grossberg, S. (1980). How does a brain build a cognitive code. Psychological Review, 87, 1-51.
Raijmakers, M.E.J. & Molenaar, P.C.M. (in druk). EXACT ART: A Complete Implementation of an ART Network. Neural Networks , 10 (4): 649-669.
Raijmakers, M.E.J. (1996). Epigenesis in neural network models of cognitive development: bifurcations, more powerful structures, and cognitive concepts. Ph.D. Thesis at the University of Amsterdam.
Raijmakers, M.E.J., Van Der Maas, H.L.J. & Molenaar, P.C.M. (1996). Numerical Bifurcation Analysis of Distance-Dependent On-center Off-surround Shunting Neural Networks. Biological Cybernetics, 75 (6): 495-507.
Raijmakers, M.E.J., Van Koten, S. & Molenaar, P.C.M. (1996) On the validity of simulating stagewise development by means of PDP-networks: Application of catastrophe analysis and an experimental test of rule-like network performance. Cognitive Science 20-1: 101-136.
Van Der Maas, H.L.J. & Molenaar, P. (1992). Stagewise cognitive development: an application of catastrophe theory. Psychological Review 99 (3): 395-417.

A Connectionist Account of the Mutual Exclusivity Bias for Word-Learning

Terry Regier
Psychology Department
University of Chicago

Introduction

Young children tend to assume that if an object has one name, it cannot have another. This "mutual exclusivity bias" may assist children in learning the meanings of words (Markman, 1989), by providing a source of implicit negative evidence. If the child assumes that something called a 'dog' cannot also be a 'cat', that helps to delineate the semantic extent of the word 'cat'. Children appear to receive little explicit negative evidence when learning word meanings, so implicit negative evidence of this sort would be helpful.

Two problems with this line of argumentation have been brought up in the literature, which we may think of as the weakness problem and the fallibility problem. The weakness problem is that the mutual exclusivity bias is weak just when it most urgently needed for word-learning. The bias is weakest in children 2-2.5 years of age (Liittschwager and Markman, 1994; Merriman and Stevenson, in press; Mervis et al., 1994) - and this is the age when children are starting to build up their vocabularies. The bias grows strong in later years - but by then children have already acquired a sizeable lexicon (Merriman and Bowman, 1989). The fallibility problem is that the bias is clearly incorrect much of the time - the same referent can be 'Fido', 'the dog', 'our pet', and so on. Thus the bias can present the learner with *false* implicit negative evidence.

In this presentation I suggest that the solution to each of these two problems lies in the other. I present a connectionist model that uses mutual exclusivity to acquire word meanings, and demonstrate that the bias is most effective if it is weak early in learning. This fact in turn stems from, and compensates for, the fallibility of the bias. The simulations thus suggest a rationale for the observed early weakness of the bias. Further simulations demonstrate that a weak form of mutual exclusivity can also account for overgeneralization patterns that have been observed in children learning spatial terms in Dutch and English.

The Simulations

The simulations examine mutual exclusivity in the domain of spatial relations naming. The connectionist model is a structured multi-layer perceptron (Regier, 1996), trained under back-propagation. It accepts a perceptual scene as input, showing a located object in some spatial configuration relative to a ground object. The network is then trained to classify the scene into one of the spatial categories being learned - there is one output node for each of these categories. No explicit negative evidence is provided - only positive examples for each of the categories, and implicit negative evidence derived from that positive evidence through mutual exclusivity. Thus, a positive example of 'above' is implicitly taken as a negative example of 'below', 'in', 'outside', etc. The purpose of the simulations was to examine the efficacy of learning given different strengths for mutual exclusivity.

Implementing a weak mutual exclusivity bias is straightforward. Since weight changes in the network are a function of the error, we can weaken the bias by deliberately attenuating the error caused by implicit negative evidence. We let beta be a "seriousness" parameter, and incorporate it into the usual error calculation:

E(i,p) = 1/2 ( (t(i,p) - o(i,p)) x beta )^2

For positive evidence, beta is always 1: positive evidence is taken seriously. For implicit negative evidence, beta can be set to either 1 (mutual exclusivity at full strength), 0 (equivalent to no negative evidence), or some value in between (weakened mutual exclusivity). The work presented here demonstrates that when mutual exclusivity is employed in weak or attenuated form, it may effectuate learning despite the presence of false implicit negative evidence -- that is, despite the fallibility of the bias. However, the bias is ineffective when taken at full strength throughout learning: the false implicit negative evidence introduced through the fallible bias impedes learning. In the simulations, learning is most effective when the bias is weak early in learning, and then eventually increases in power, matching the the empirically observed profile. This weak-to-strong progression works since initial learning with a weak bias yields a preliminary grasp of the extension of each of the concepts. Based on this, the model may then decide, on the basis of extensional overlap in these initial meanings, which words should not provide implicit negative evidence for each other -- that is, which word-pairs appear to be exceptions to the mutual exclusivity rule. Once the child has identified such ``troublemakers'', and stopped deriving implicit negative evidence from them, the child's confidence in the bias may grow, resulting in a strengthened bias for all remaining words. Thus, the central intuition behind this proposal is that children *bootstrap* themselves up from an initial understanding of a word's meaning, obtained using mutual exclusivity in weak form, to a more complete understanding, obtained using a stronger version of the bias once extensional overlaps have been identified. The significance of this idea is that it provides a rationale for the otherwise puzzling weak-to-strong transition that is empirically observed. Further support for this account is given by the fact that it correctly predicts overgeneralization patterns in children learning spatial terms in Dutch and English.

References

Liittschwager, J. and Markman, E. (1994). Sixteen and 24-month-olds' use of mutual exclusivity as a default assumption in second label learning. Developmental Psychology, 30:955-968.

Markman, E. (1989). Categorization and Naming in Children: Problems of Induction. MIT Press, Cambridge.

Merriman, W. and Bowman, L. (1989). The mutual exclusivity bias in children's word learning. Monographs of the Society for Research in Child Development, 54(220).

Merriman, W. and Stevenson, C. (in press). Restricting a familiar name in response to learning a new one: Evidence for the mutual exclusivity bias in young 2-year-olds. Child Development.

Mervis, C. B., Golinkoff, R. M., and Bertrand, J. (1994). Two-year-olds readily learn multiple labels for the same basic-level category. Child Development, 65:1163-1177.

Regier, T. (1996). The Human Semantic Potential: Spatial Language and Constrained Connectionism. MIT Press, Cambridge.

A dynamic model of word concept acquisition

Gabriele Scheler
Institut für Informatik
TU München
D 80290 München
scheler@informatik.tu-muenchen.de

Classificational Learning of Word Meaning

When we look at processes of conceptual learning, we may examine the formation of linguistic categories, and in particular the establishment of word concepts for several reasons:

In the domain of the mental lexicon, primary processes of associational learning are obviously counterbalanced by structuring and re-structuring processes. At the same time we seem to have continuous learning which is not dependent on the development of the individual (learning new words is possible at every age), therefore restructuring in learning may not be relegated to a purely developmental perspective. Already Chomsky [2] noted that associational learning, i.e. the principles of habituation (learning frequent patterns) and novelty detection (learning rare, unexpected patterns), has to be balanced by active principles of structuring the input to effect a freedom from the stochastic peculiarities of the input. In [4] we examined the possibilities of understanding semantic representation formation as a form of data compression, i.e. leading from single instances (``cases'', ``exceptions'') to partial rules capturing important generalizations (``cryptotypes'') and complete analogical rules. By incorporating some ``observational biases'' into the algorithm, the tolerance for stochastic variation in the input data could be improved - i.e. different datasets led to the same result structure. Since then we have developed a more complete model of lexical acquisition ([13]) which incorporates the important principle of active understanding (cf. [3]), i.e. an intrinsic selection by the learning system of input data from an unstructured repository.

Changing Representations during Learning

In the following we present a model of lexical acquisition which uses the principle of ``re-description'' or active understanding in the following way: Co-occurrence data are sampled, they are weighted and selective attention mechanisms are applied in order to obtain a set of associational data. These data may be interpreted as similarity relations and used for clustering (classification) of words. Similarly, these data can be used to obtain multidimensional feature sets as distributed semantic representations of lexical words.

The obtained feature description for individual words may however be used to define new contextual classes. Thus the set of associational input data is changed from the perspective of the system, since the classification of contexts has changed. This is the basis for a ``dynamic'' process of concept formation.

What does this mean in practice?

While we are sampling a text for co-occurrence data, we may use an imposed structure to improve the significance of the data obtained. We use this principle for instance when we adhere to sentence boundaries or phrase structure or head-modifier relations. Similarly, we may define a lexical context by the abstract semantic features (classes) that a surface word possesses (belongs to). (Classes of course are just an alternative way of describing features.) For instance, we may classify occurrences of ``color terms'' (blue, mauve, orange) according to the type of noun they occur with : toys, clothes, cars etc. For that we need semantic feature data on a number of nouns to determine which of the abstract classes they belong to. On the other hand, co-occurrence with color terms adds a dimension to semantic descriptions of nouns, which may affect their overall classification. This means that the definition of contextual classes in further samplings of the data may have changed.

It is clear then, that the process of discovering semantic relations is a dynamic process, where the representations change during the acquisition process. We have applied a mechanism of acquisition which uses contextual substitution, and feature decay (fading of unsupported features) on medium-sized texts in order to extract prepositional meanings and ``cryptotypes'' in the domain of spatial relations. The results are interesting in that we can observe a transition from a predominance of verb-preposition subclassification concerning motion, direction and locality to prepositional-noun subclassifications which concern shape and function of concrete as well as abstract nouns (in the domain of prepositional meaning). Changing the various parameters produces different shapes of learning curves.

A significant unanswered question within this model is the relation of the actual, contextual meaning that a word may have in a certain utterance context, and the general, lexical representation that is built up during the acquisition process and that a word is instantiated with at each point in time. Rather than using the full lexical representation to define a particular context for a target word, we may want to use a selection of features which conform to the actual, contextual meaning. This implies a model of word meaning assignment, which is essentially a feature selection process. The principle of meaning assignment as selection of contextually relevant features has been explored before ([12], [11]). However, the corresponding experiments in this setting have not been performed yet.

Discussion

A number of models in connectionist linguistics ([7, 8, 10]) have shown that categorization processes are prominent in language, not only in phonology, but also in the domain of morphological and lexical meaning. The related processes of categorical perception ([5]) provide a link to neurobiological theories of perception and cortical processing in general. Purely associational accounts of categorical learning, however, fall short of producing high-performance systems of lexical acquisition (cf. [1] for a state-of-the-art review) as well as providing psychologically adequate models (cf. [9]) for two reasons:

We have made some suggestions concerning the first problem ([4]), and we have provided a model which incorporates a form of dynamic learning in answer to the second one ([13]). These models however are poorly integrated at present, and it seems clear that much work needs to be done to overcome the limitations of categorization learning. Since neurobiology does not provide obvious answers here, new theoretical models may be expected to emerge in the future.

References

1 B. Boguraev and J. Pustejovsky, editors. Corpus Processing for Lexical Acquisition. MIT Press, 1996.

2 N. Chomsky. Aspects of the Theory of Syntax. MIT Press, 1965.

3 P. Churchland, V. Ramachandran, and T. Sejnowski. A critique of pure vision. In C. Koch and J. Davis, editors, Large-Scale Neuronal Theories of the Brain. MIT Press, 1994.

4 Niels Fertig and Gabriele Scheler. Constructing semantic representations using the MDL principle. In Proceedings of HELNET '97, 2-5 October, Montreux, Switzerland, 1997.

5 S. Harnad, editor. Categorical Perception: The groundwork of Cognition. Cambridge University Press, 1987.

6 B. MacWhinney. The CHILDES manual. Lawrence Erlbaum, (2nd ed.) 1995.

7 J. L. McClelland and A. Kawamoto. Mechanisms of sentence processing: Assigning roles to constituents. In D. E. Rumelhart and J. L. McClelland, editors, Parallel distributed processing: Explorations in the microstructure of cognition, pages 77-109. Cambridge, MA: MIT Press, 1986.

8 Risto Miikkulainen. Subsymbolic Natural Language Processing: An Integrated Model of Scripts, Lexicon, and Memory. Neural Network Modeling and Connectionism Series. MIT Press, 1993.

9 Reinhard Rapp. Die Berechnung von Assoziationen. (The computation of associations) Georg Olms, 1996.

10 Terry Regier. The human semantic potential: Spatial language and constrained connectionism. MIT Press, 1996.

11 Gabriele Scheler and Kerstin Fischer. The many functions of discourse particles: A computational model. In Proceedings of the Nineteenth Annual Conference of the Cognitive Science Society, Stanford, August 7-10, 1997, 1997.

12 Gabriele Scheler. Learning the semantics of aspect. In Harold Somers, editor, New Methods in Language Processing. University College London Press, 1997.

13 Gabriele Scheler. Feature Discovery for Semantic Representations. Cognitive Science, 1998. (in preparation)

The Necessity of Flexible Features for Concept Learning

Philippe Schyns
Univ. of Glasgow

Most current models of category learning leave aside the issue of feature learning and feature development. Their feature set is fixed and unaffected by the classification and learning processes. Classification and learning processes, however, operate on a stable featural analysis (a perceptual organization) of the ever changing retinal input. Even though our sophisticated visual apparatus probably comes equipped with a priori ways of analyzing and organizing retinal images, there are occasions when a relevant perceptual analysis is not readily available. For example, complete novices reading chest X-rays (e.g., Christensen, Murry, Holland, Reynolds, Landay & Moore, 1981), sexing chicken (Biederman & Shiffrar, 1987), and categorizing dermatosis (Norman, Brooks, Coblenz & Babcock, 1992) have little understanding of the relevant dimensional structure of these categories. Even when told what the signs of different diagnosis are, novices are not always able to see the features experts use to organize the input. If one takes a developmental perspective, it seems clear that infants and young children are not always able to analyze objects using all the stimulus dimensions that are used by adults (Smith, Carey, & Wiser, 1985; Smith & Kemler, 1978; Ward, 1983).

Thus, there is suggestive evidence that features are flexible --i.e., they adjust to the perceptual experience and the categorization history of the individual. Flexible features open the possibility that the same input is differently perceived and analyzed before being categorized. Hence, a complete theory of categorization and conceptual development should not only explain the ways in which object features are combined to form concepts, it should also explain the origin and the development of the features participating to the analysis of the input.

An important assumption of a fixed set stance is that primitives are the lowest, non-decomposable building blocks of object representation (see, e.g., Biederman, 1987). Using set notation, if Ft= {f1, f2, ..., fn} defines the fixed set of basic features available at time t, Rt the powerset of Ft, defines the repertoire of all objects representations expressible with, for example, feature conjunctions. That is, Rt = {-, f1, f2, ..., fn, f1 & f2, ..., f1 & fn, f2 & f3, ..., f2 & fn, ..., f1 & f2 & ... & fn}, where "&" represents the logical and. Although very simple, this characterization captures the idea that a set of fixed elements and construction rules can generate a large number of distinct object descriptions.

Componential theories of category learning should offer powerful principles for developing new representations, if only to account for the flexibility of conceptual development and the diversity of object categories the individual may encounter. Fixed space models would limit new concepts to new combinations of fixed features--i.e., to elements of Rt that are not actual concepts in memory. Consequently, if a categorization requires a feature that is not originally present in the fixed set (e.g., if Ft features must distinguish between two object descriptions f1 & fn and f1 & fn & fn+1) the categorization cannot be learned because it violates the assumption that the set is fixed. Similarly, if a categorization requires the decomposition of a basic feature (e.g., if f1 of Ft must be decomposed into f1* and f1** to distinguish between two objects f1* & f1** & fn and f1** & fn ), the categorical difference cannot be represented because it violates the assumption that basic features are not decomposable.

Of course, a clever theorist could always handcraft a fixed basis of features that is capable of solving a few, or even many instances of these two generic difficulties. These, however, do not challenge the competence of category learning theorists, but the competence of category learning theories. In sum, I am claiming that a fixed feature stance does not have a way of accounting for the creation of new features, and I will argue and empirically demonstrate that such a process will have to be included in a complete account of category learning.

References

Schyns, P. G. (in press). Diagnostic recognition: Task constraints, object information and their interactions. Cognition.

Schyns, P. G., & Rodet, L. (1997). Categorization creates functional features. Journal of Experimental Psychology: Learning, Memory and Cognition, 23, 681-696.

Schyns, P. G., Goldstone, R. L., & Thibaut, J. P. (in press). The development of features in object concepts. Behavioral and Brain Sciences, target article.

Joshua Tenenbaum
MIT Department of Brain and Cognitive Sciences

People routinely learn concepts from only a few positive examples, while machine pattern recognition systems generally require many examples of both positive and negative instances. I will present work on a Bayesian theory of concept learning aimed at bringing computational models of concept learning closer to the potential of human learning. The chief goals of this work are (1) a better understanding of the computational basis of human concept learning, and (2) more human-like computer learning algorithms, to support more adaptive and more natural paradigms for human-computer interaction.

Specifically, I will discuss how theories of concept learning may account for the phenomenon of bounded generalization from positive examples. Given a domain of objects and a set of examples of a new concept in that domain, people are usually willing to generalize the concept beyond those examples, but not indiscriminately to all objects in the domain. Often people's generalizations do not respect their a priori judgments of similarity; a new concept may dictate its own concept-specific notion of similarity that emphasizes features most relevant for that concept. Thus the problem of bounded generalization is not just a matter of "how far" to generalize a concept, but also "in what ways" to generalize.

Understanding how the generalization of a concept is bounded after only a *small number* of positive examples have been encountered is a central issue in both human and machine learning. First, consider a child learning the words "dog" or "grandmother." After seeing only a few instances of dogs or grandmothers, she will productively but very selectively (if not always by normal adult criteria) apply those terms to new entities. Most dogs will probably qualify as "dogs" to her, daschunds or poodles may not, some cats or bears may, ghiraffes and elephants and horses and cows and ants and mice and people most likely do not. Receiving negative feedback on counterexamples, as when she calls a doggish-looking cat "dog" and then hears "No, cat", may be helpful in establishing the precise extensions of new concepts, but is definitely not necessary before she is willing to call, on some nonarbitrary grounds, certain things "dogs" and not others.

Now consider a computer system for interactive scene analysis, which learns to extract image regions satisfying a particular visual concept such as "leaves", "sand", or even "dogs". As a human user, I would like this system to automatically label all "leaves" or "dogs" regions in an image or set of images, given a few regions that I have labeled as examples of "leaves" or "dogs". Ideally, I would like the computer to be just as productive yet selective as the child above in generalizing these concepts to new image regions, because that is how I am used to interacting with other human learners. Providing more than a few examples of "leaves" regions quickly gets tedious, and providing good negative examples, good instances of "non-leaves", may be difficult for the average user.

Currently, neither machine learning models nor models from cognitive psychology have an adequate account of these kinds of phenomena, even for very simple domains. Most models of concept learning in these fields assume that both positive *and* negative examples are available to the learner, and the negative examples are essential for solving the problems of "how far" and "in what ways" to generalize. Typically, it is assumed that a concept C should be generalized to objects which are in some sense closer to the positive examples of C than to the negative examples of C ("how far"), as measured by a distance function which usually gives highest weight to the feature(s) that best *discriminate* between the positive and negative examples of C ("in what ways"). This assumption shows up in some form in both standard pattern recognition techniques (K-nearest neighbors classification, linear or nonlinear discriminant analysis), as well as in standard models of concept learning in cognitive psychology (including most similarity-to-exemplar and similarity-to-prototype models).

The "novelty detection" approach to concept learning, based on estimating the probability density function (PDF) for a concept's positive instances, does not explicitly require negative examples and is becoming more popular in both machine learning and cognitive science. However, this approach has two major drawbacks as an explanation for how people generalize a concept from only a few positive examples. First, PDF estimation has primarily been studied in the limit of infinite training data. Second, even if the PDF of a concept is known exactly, an arbitrary threshold is still required in order to classify new objects, leaving unanswered the problem of "how far" to generalize a learned concept.

In order to give a principled account of bounded generalization in human learning, I have developed a theory of concept learning as Bayesian inference. This theory is related to previous psychological models of category-based inductive inference proposed by R. Shepard, J. Feldman, and D. Osherson et al., which emerge as special cases or approximations to the Bayesian formulation.

Given a universe U of objects, the learner observes one or more examples E = {E_1, ..., E_n} of a concept C. Crucially, the examples E are assumed to be randomly sampled (e.g. independently according to a uniform density) from the extension (subset of positive instances) of C. This random sampling assumption embodies the idea that successful concept learning requires a "representative" set of examples. The learner is then required to estimate P(x in C|E), the probability that C should be generalized to apply to some new object x in U given the observed examples E. In the Bayesian theory, the learner estimates P(x in C|E) by treating the examples E as statistical evidence for different possible extensions of C and then integrating the predictions of all possible extensions h_i, weighted by their posterior probability:

    P(x in C|E) = Sum [P(x in C|h_i) p(h_i|E)]   .  
                   i

The possible extensions h_i are just subsets of the universe U, and P(x in C|h_i) is simply 1 if h_i contains x and 0 otherwise. The posterior probability p(h_i|E) of hypothesis h_i is calculated from Bayes' rule, assuming a prior density p(h_i) over hypotheses and a likelihood p(E|h_i) corresponding to the random sampling assumption mentioned above.

All other things being equal, this statistical model assigns higher posterior probability to smaller, more specific hypotheses than to larger, more general hypotheses that are equally consistent with the observed examples. To see how this version of Occam's Razor leads to reasonably bounded generalizations, consider a simple one-dimensional universe of objects, such as micro-organisms varying in size. Suppose you are trying to learn a concept C which might reasonably be assumed to correspond to some connected interval in this space, such as micro-organisms that cause a certain intestinal illness. Suppose further that your only other knowledge about C consists of having observed examples E = {E_1, ..., E_n} of organisms that cause this illness, with sizes between s_min and s_max. Now you need to estimate the probability P(x in C|E) that a new and larger organism x, of size s_x > s_max, would also cause this illness. According to the Bayesian theory described above, P(x in C|E) would be given by:

                  [                  1                  ]^(n-1)
    P(x in C|E) = [ ----------------------------------  ]        .
                  [ (1 + (s_x - s_max)/(s_max - s_min)) ]

Here, the probability of generalization becomes a function of the distance from x to the nearest exemplar s_max, scaled by the variability of observed examples (s_max - s_min) and the number of observed examples n. Thus we have a principled answer to the question of "how far" to generalize the concept C; generalizations become both more conservative and confident as n increases or the variability (s_max - s_min) decreases. Reasonable bounds on generalization may be obtained by thresholding at P(x in C|E) = .5; for the case of n=2 examples with sizes s_1 and s_2, a new organism x is then predicted to be a positive instance of C if the distance between x and the nearer of s_1 or s_2 is no more than |s_1 - s_2|. In the case of multidimensional stimuli, a similar analysis gives a principled answer to the question of "in what ways" to generalize a concept.

Based on this theory of concept learning, two lines of empirical research are in progress. We are studying the generalization bounds of human subjects in concept learning experiments, to establish the extent to which this simple theory can explain real behavior. We are also developing a computer system for interactive visual scene analysis, in which the computer learns to extract image regions defined by a particular concept (e.g. a certain texture or object class) after seeing a small sample of that concept selected by the user. I hope to report preliminary results on these projects at the workshop.

Elephants for the Blind
Learning Deep Concepts by Contrastive Redescription

Chris Thornton
University of Sussex
UK

Introduction

In the parable of the Blind Men and the Elephant, `six men of Indostan' take it in turns to examine an elephant. The first pronouces it to be `very like a wall', the second to be `very like a spear', the third to be very like a tree' and so on (quotations from the poem by John Godfrey Saxe). None of the blind men produces an appropriate comparison. Each one is `partly right' but all are `in the wrong.'

The story is evidently very old but for anyone interested in artificial intelligence, the message seems decidedly up-to-date. The comparisons produced by the blind men are just like the `definitions' typically produced by machine learning methods. Each one latches onto some gross feature of the relevant data which happens to be distinctive in the case considered. But all turn out to be `in the wrong', reminding us of the way in which machine learning definitions frequently turn out to yield poor generalisation. The story is particularly reminiscent of the the neural network which learned to `correctly' distinguish images of tanks from images of non-tanks by noticing that tank images were---within the data presented---noticeably lighter than non-tank images. This latterday Blind Man effectively proclaimed the tank to be `very like a sunny day'.

The tendency of learners to latch onto gross, statistical features of the input data is naturally seen as a problem, particularly when target concepts are known *not* to be manifest in this form. But there are reasons to believe that learners inevitably behave this way [Clark and Thornton, Brains and Behavior 1997] and it may therefore be more profitable to examine the ways in which `blind man' definitions can be better utilised. In the work described in the full version of this paper, I have investigated contrastive redescription, a weak method for deep concept learning based on Karmiloff-Smith's psychological model of representational redescription [Karmiloff-Smith, BOOK, 1992] In contrastive redescription, blind-men definitions are not used `as is', e.g., for purposes of classification. Rather they are utilised for *redescribing* the input data. The redescription process may be iterated recursively leading to the production of a structure of data layers. Provided that reasonable feedback is available (i.e., provided the classification of input points accurately reflects the target concept), this process has the effect of gradually re-moulding the input data so as to being salient details to the statistical surface. Eventually, the target concept becomes manifest in the gross statistical form which is readily exploied by conventional learning methods.

The paper describes the contrastive redescription method in full and gives experimental details relating to its performance on a range of problems.