Computing Point-of-View: Modeling and Simulating Judgments of Taste

səhifə	9/28
tarix	26.06.2016
ölçüsü	8.55 Mb.

1 ... 5 6 7 8 9 10 11 12 ... 28

3.2 Culture mining

Culture mining is a technique for acquiring the connectedness of a cultural topology by text processing a cultural corpus. The technique was developed to support person modeling in the realms of cultural taste and taste for food. It depends on, and is compatible with work in text mining and information-theoretic measures of similarity. What is added to the literature is an end-to-end macroscopic account of mining—from corpus selection, to normalization, to statistical analysis, to post-processing, resulting in cultural topology as a semantic fabric. The rest of this section 1) reviews related work in text mining and situates our approach; and 2) narrates an “algorithm” for culture mining.
§
Related work. I would like to implicate several thematics in related work—musical similarity, social network analysis, and text mining.

Musical artist similarity. Whitman and Lawrence (2002) inferred a similarity matrix model of musical artists using a text mining approach and a peer-to-peer similarity approach. For their text mining approach, musical artists were annotated with “community metadata”—adjectives and noun phrases scraped by typing the artist into a web search engine—and artist similarity was assessed as the degree to which their community metadata overlapped. For their peer-to-peer similarity approach, they harvested user’s file folders on the peer-to-peer network OpenNap, and learned artist-to-artist similarity using the TF-IDF metric (Salton & McGill 1983) by treating same-folder artists as co-occurrences. Related to (Whitman & Lawrence 2002) is Whitman and Smaragdis’s (2002), which reported that cultural signatures for music genres could be used, in conjunction with the auditory signal, to classify unknown artists based on style similarity. Ellis et al. (2002) infers a similarity matrix of similarity between 400 artists, evaluating the performance of several metrics including community metadata, and variations of the Erdös measure (cf. six degrees of Kevin Bacon). Baumann & Hummel (2005) continues measurement of artist similarity based on shared text in web search results, and essays a characterization of this method as ‘cultural metadata’.

Social network analysis. Much research has examined the explicit structure of social networks, and studied their topologies via graph theory. Newman (2001) mined scientific coauthorship networks and found that collaborations ‘funneled’ through gatekeeper scientists. In our taste fabric identity hubs constitute a similar topological feature. Jensen and Neville (2002) mined structured metadata relations from the Internet Movie Database (imdb.com) and learned a Bayesian network model to represent and predict item distances probabilistically. They also model the relational semantics of social network relations implied between movie actors from the Internet Movie Database and the Hollywood Stock Exchange (www.hsx.com). Finin et al. (2005) examine how the FOAF (“friend-of-a-friend”) ontology applies Semantic Web concepts to enable efficient exchange of and search over social information, illustrating how social networks could develop with its semantics already explicit. Finally one work which considers the semantic content entailments of social network users is McCallum, Corrada-Emmanuel, and Wang’s (2005) modeling of Author-Recipient-Topic correlations in a social network messaging system. Given the topic distributions of email conversations, the ART model could predict the role-relationships of author and recipient. The work considered group clusters and dyadic relationship dynamics but does not consider cultural aggregates.

Text mining. Work in text mining has employed information-theoretic metrics of similarity (Lin 1998) to perform unsupervised feats such as characterizing the meaning of their words by their surrounding context. Latent semantic analysis (LSA) (Landauer & Dumais 1997) is a popular method for comparing the meanings of texts using the Singular Value Decomposition (SVD). Pointwise mutual information (PMI) (Church & Hanks 1990) is the similarity metric currently used for culture mining. Turney (2001) adapted PMI into PMI-IR (similar to Bayes) to support the mining of synonyms based on PMI interpretations of search engine results. The machine learning step of ‘culture mining’ can be most directly compared with the induction of term-term similarity by assuming document closure (Brin et al. 1997); that work used a metric called interest—a measure of variable dependence, i.e. PMI without the logarithm.

Culture mining. The technique introduced here prescribes a method for modeling cultural topology from a cultural corpus of semi-structured texts, i.e. 100,000 social network profiles. The premise that a large-scale social network community can support taste modeling assumes an emergent semantics is possible in this domain (Staab et al. 2002; Aberer et al. 2004). Our systematic finding of pairwise similarity between a large number of items in a space can be compared with (Brin et al. 1997; Whitman & Lawrence 2002; Ellis et al. 2002). Our method advocates recall-maximizing heuristic normalization of text fragments into ontological descriptors and surrounding metadata. The addition of metadata annotations to descriptors can be compared with Whitman & Lawrence’s (2002) community metadata annotation approach. This work’s item-item rather than user-user approach to modeling social networks distinguishes it from user-centric analyses (Newman 2001; Jensen & Neville 2002; McCallum, Corrada-Emmanuel, & Wang 2005). The result of culture mining is a semantic fabric that can be used for recommendation without further preserving original user vectors or item vectors, making the technique an alternative to user-user collaborative filtering (Shardanand & Maes 1995) and item-item collaborative filtering (Sarwar et al. 2001).

§

Algorithm. An “algorithm” for an end-to-end modeling of cultural topology from cultural corpora is presented. Unlike a real algorithm, the one presented here prescribes not only the machine learning metric but also a walk through the various decision spaces encountered in the modeling process. The algorithm generalizes from experience in mining taste fabric from social network profiles. The steps are—1) considering the fitness of a cultural corpus for mining; 2) assembling knowledge resources and disambiguation heuristics for maximizing extraction of descriptors from the cultural corpus; and 3) machine learning and post-processing.

Cultural corpus selection. To judge the fitness of a cultural corpus, we need to define culture and decide what of its aspects are most important to capture. In semiotics, Barthes (1964) proposed culture to be the set of symbols and interpretive strategies salient to the unconscious of a population; plus the system of connections that organize these symbols, and the valence of each symbol—its degree of privilege. Similarly, Geertz (1973) described cultures as ‘webs of significance’ spun by people. What sort of textual corpus can fairly capture the symbolism, connotations, and values of a culture? Two promising candidates can be considered—1) a very large sampling of the everyday texts of participants in a culture; or 2) a few prototypal texts that epitomize the breadth and depth of the culture. The latter kind of corpus is authoritative—it assumes that a culture’s symbols, connotations, and values can be idealized, and that there are essential texts that also fairly represent the culture. Authoritative corpora have the advantage of being clean and having a high signal-to-noise ratio; disadvantages are that it can be hard to find enough quantity of this kind of text for mining, and that intuitions leading to selections of these texts may encapsulate a bias. In contrast, the former kind of corpus is populist—it does not presuppose that the culture has been essentialized in textual form. Such a corpus is appropriate for modeling motivated by ethnography because it is more likely to yield new insight; disadvantages are that the sample embodied by the corpus may represent a skewed distribution, and it may be harder to assess the systematic biases of a large sampling than the bias of a few authoritative texts. Other considerations for corpus gathering include noting the time-span of texts collected into the corpus, and noting how amenable the text is to normalization via semantic recognition.
Heuristic normalization. A large and multi-layered lexicon of cultural symbols needs to be assembled—preferably one that exhaustively enumerates all the symbols that are possible and mentioned in the cultural corpus. For example, a hierarchy of 22,000 interest and identity descriptors was assembled from numerous folksonomies such as Wikipedia, DMOZ, Allmusic, and Allrecipes, to normalize social network profiles. Identity descriptors were distinguished from interest descriptors because they are possible semantic mediators, which belong at a higher level of abstraction; this harkens to Barthes’s idea that cultural symbols form heterogeneous rather than homogeneous systems. Using these 22,000 descriptors and considering error associated with natural language processing, 68% of segmented profile tokens were recognized into this ontology, with 8% false positives, which we suggest is a high number considering that a person’s enumerated interests are not constrained by any vocabulary in the profile elicitation form. To boost machine-learning performance, recognition rates should be maximized heuristically. One heuristic is to make pragmatic assumptions about ambiguous terms in the corpus—for example, popularity data associated with the hierarchy of 22,000 descriptors was used to disambiguate abridgements into their most likely referent, e.g. “bach -- > J.S. Bach.” Another heuristic is to consider gathering initial statistics on frequencies of words and phrases in the raw corpus to guide ontology gathering—for example, a percentage of the 1,000 assembled identity descriptors were templates of the form “____ lovers,” filled with what corpus statistics said were the most prevalent interest descriptors. The allowed handling of identities which were premised from prominent interests, e.g., “Star Trek lovers,” “chocolate lovers.” A third heuristic to improve the coverage of machine learning is to add metadata annotations to recognized descriptors, plus some uncertainty discount. So instead of submitting ‘The Unbearable Lightness of Being’ to machine learning, its uncertain metadata (‘Milan Kundera’, 0.5), (‘fiction’, 0.25), (‘philosophy’,0.25) should also be submitted; thus, robustness is improved and more general regularities can be uncovered.
Machine learning. Similarities between descriptors can be calculated using a variety of metrics such as TF-IDF (Salton & McGill 1983), Pointwise Mutual Information (PMI) (Church & Hanks), Bayes, or any number of other metrics and variations (Lin 1998). One decision is—will similarity be learned between two descriptors, or between multiple descriptors? PMI is a symmetric measure of similarity, and we assumed that descriptors were symmetrically similar; however Tversky (1977) pointed out cases in which similarity is asymmetric (e.g. when two descriptors are not peers). Finally, in order to learn the connectedness between descriptors, examples of connectedness are needed, and this requires the identification of set identities in the corpus. In the case of social network profiles, we assumed that a person’s whole profile had a set identity, namely, its descriptors were bound by aesthetic closure. This assumption was consistent with cultural theories suggesting personal taste and style coherence (Bourdieu 1984; Solomon & Assael 1987; McCracken 1988). Some other assumptions could have been to assume closure within interest categories, or closure within tightly knit groups of people in the social network. Once set identity was chosen, all pairwise combinations resulting from that identity were collected as examples for machine learning via the PMI statistic. After an n by n similarity matrix is learned, a post-processing step should threshold away links whose affinities fall below the minimum score that could be considered significant. What is left is a semantic fabric that captures the cultural topology. To produce visualizable subsets of this dense network, further thresholding needs to performed; the remaining links may be used to produce self-organizing maps and spring-loaded graphs, e.g. Figure 1 -3.
To summarize this section, an approach to ‘culture mining’ was compared to related work in musical similarity, social network analysis, and text mining. Then, an “algorithm” for ‘culture mining’ was narrated—the decision space of assembling a cultural corpus was described, a strategy for robust normalization was given, and machine learning and its assumptions were discussed. In the next two sections, two technologies that support person modeling are discussed.

1 ... 5 6 7 8 9 10 11 12 ... 28