People have names. People have nicknames. People get called names and sometimes names get called people even when they’re not, because some natural language processing tool or statistical model has hiccoughed. Some people’s names might be names that are the same as other people’s names. Some names have no common spelling and some names become overly common when names are transliterated into Latin characters.
This many-to-many set of problems, focused around named-entity recognition and name resolution is one of the things I’ve been working on recently. The first problem, that of recognition, is one that I’ve mostly been leaving to others, which leaves me with the problems of splitting and merging. I’ll focus on the former for today.
The aim of splitting is to find all the appearances of a given name – let’s take the exact phrase “Michael Jackson” as an example – and to work out how many different people these appearances actually refer to. A quick glance at Wikipedia suggests that there are over 30 “Michael Jackson”s that we may encounter in an average document, but from my experience it’s likely to be constrained to about 7 distinct people in mainstream news. Remember, we’re only interested in “Michael Jackson” and so the slew of “Mike Jackson”s, always referred to in that manner, are not of interest.
Whilst there is always a good amount of existing work in this area, one of the advantages at my disposal is that I have a simply staggering amount of news documents available for my research. With a little magic, I can see all of the organisations, people, places and more that appear alongside any reference to “Michael Jackson” in a document, and – sure enough – some patterns emerge.
There is, for example, a large set of documents with mentions of Jackson Five and Epic, but there is also a substantial set mentioning Department of Homeland Security and George Bush. Whilst not every document about the pop star will contain a reference to Epic (a record label), they will generally contain a reference to some other feature that has previously occurred with “Michael Jackson” and Epic, or some feature strongly related to Epic. Similar connections hold true for the American politician, the BBC executive and other unique persons.
There may be some small overlaps, due to the prevalence of small-world networks (like the below) in media-worthy systems. Perhaps a Homeland Security employee was once an intern at Epic. She is therefore what we refer to as a critical point, the node which joins together two distinct clusters in a network, but these blips fade into insignificance once a moderately-sized collection of evidence is available.
It is these clusters that can be used to split a name into multiple separate entities in an unsupervised fashion. If I know nothing about any Michael Jackson, I can still tell you that there are seven separate contexts in which “Michael Jackson” appears (one related to Epic, one related to George Bush and so on) and therefore split the term appropriately. Obviously, if I did have any knowledge available in advance, the job simply becomes easier.
The remaining hard part for splitting is simply working out the best way to identify the clusters. Try it by eye in the above, starting from the chap on his own — it’s harder than it seems. I’ve had answers as low as 4 and as high as 14, but I’ll let you decide what feels best, based on intuition.
Then I’ll ask you to solve for the general case. And patent your answer as my own.