People have names. People have nicknames. People get called names and sometimes names get called people even when they’re not, because some natural language processing tool or statistical model has hiccoughed. Some people’s names might be names that are the same as other people’s names. Some names have no common spelling and some names become overly common when names are transliterated into Latin characters.
This many-to-many set of problems, focused around named-entity recognition and name resolution is one of the things I’ve been working on recently. The first problem, that of recognition, is one that I’ve mostly been leaving to others, which leaves me with the problems of splitting and merging. I’ll focus on the former for today.
The aim of splitting is to find all the appearances of a given name – let’s take the exact phrase “Michael Jackson” as an example – and to work out how many different people these appearances actually refer to. A quick glance at Wikipedia suggests that there are over 30 “Michael Jackson”s that we may encounter in an average document, but from my experience it’s likely to be constrained to about 7 distinct people in mainstream news. Remember, we’re only interested in “Michael Jackson” and so the slew of “Mike Jackson”s, always referred to in that manner, are not of interest.
Whilst there is always a good amount of existing work in this area, one of the advantages at my disposal is that I have a simply staggering amount of news documents available for my research. With a little magic, I can see all of the organisations, people, places and more that appear alongside any reference to “Michael Jackson” in a document, and – sure enough – some patterns emerge.
There is, for example, a large set of documents with mentions of Jackson Five and Epic, but there is also a substantial set mentioning Department of Homeland Security and George Bush. Whilst not every document about the pop star will contain a reference to Epic (a record label), they will generally contain a reference to some other feature that has previously occurred with “Michael Jackson” and Epic, or some feature strongly related to Epic. Similar connections hold true for the American politician, the BBC executive and other unique persons.
There may be some small overlaps, due to the prevalence of small-world networks (like the below) in media-worthy systems. Perhaps a Homeland Security employee was once an intern at Epic. She is therefore what we refer to as a critical point, the node which joins together two distinct clusters in a network, but these blips fade into insignificance once a moderately-sized collection of evidence is available.
It is these clusters that can be used to split a name into multiple separate entities in an unsupervised fashion. If I know nothing about any Michael Jackson, I can still tell you that there are seven separate contexts in which “Michael Jackson” appears (one related to Epic, one related to George Bush and so on) and therefore split the term appropriately. Obviously, if I did have any knowledge available in advance, the job simply becomes easier.
The remaining hard part for splitting is simply working out the best way to identify the clusters. Try it by eye in the above, starting from the chap on his own — it’s harder than it seems. I’ve had answers as low as 4 and as high as 14, but I’ll let you decide what feels best, based on intuition.
Then I’ll ask you to solve for the general case. And patent your answer as my own.
Xx
Could you weight references by age, so the year in which either a name first appeared & disappeared, or the age the thing they are related to appeared. And use age as some sort of axis to map out a 3 dimensional graph, it’s quite hard to see distinct groups when they are on a 2d plane, maybe if they could be represented in 3d space it would be easier to differentiate them. It’s late and I’m not sure if I made any sense, but I’m going to bed.
Sure, that makes a lot of sense.
A Dirichet Process Mixture (DPM) is a good way of expressing an unknown and potentially infinite number of clusters – this isn’t so important when clustering people but can be useful with more generic requirements.
I used to work with probability theory, specifically on hidden time-sensitive Dirichlet distributions and I came across some research from 2005 which was more related to clustering, but is very relevant to your point.
It’s online at http://pages.cs.wisc.edu/~jerryzhu/pub/tdpmTR.pdf and I haven’t yet tried it out. It (like some views of modern physics) adds time as a separate dimension to an unknown number of spatial dimensions.
Currently, I treat the time histogram (or aggregates/features from it) as simply an attribute of an entity, and use that as a measure of similarity to members in a cluster rather than a contributing factor towards the clustering coefficient. A combination of similarity and clustering is then used to give final results.