Sunday, April 20, 2008

Mocking Intelligence

I had to write out this stuff. This is taken straight from a presentation by Peter Norvig, Director of Research at Google. The discussion was on training using data sets, algorithms, machine learning, clustering of data sets and finding out potential patterns and outliers.

A nice problem was explained on String Segmentation. I won't try to create my own version but would cite some of the examples mentioned during the presentation. When dealing with texts written in languages such as Chinese etc many a times the need of spaces in between words is ignored. A Human mind reading the same can easily recognize the pattern and thus understand from the context as to what it is trying to convey.

But think about the computer brain. It goes nuts while interpreting these combinations. For example, it is easy to observe what the following line tries to say:

livelisteningparty :: live listening party

But combinations like,
smallandinsignificant becomes 'small and in significant' while it should have been 'small and insignificant'. Hence we conclude that semantic training on the data set is required. Again there might be words which are actually in the context but do not rank high in the training set to make an impact. One of the training sets which he had mentioned was of 1.7B size but would still fail to recognize an uncommon dictionary word and would break it into highly ranked separate clusters lacking any meaning altogether.

Ok, now the fun part. The examples next follow particular nuisance created by this parsing. In each of the examples mentioned below, you have a website hosted somewhere on web. And see what the computer makes up while tagging them.

www.whorepresents.com : who represents provides Contact Info for Celebraties etc :: whore presents (Now imagine what the similar searches would lead to)

www.therapistfinder.com : Finds you a Therapist in California :: the rapist finder (Gosh! The Dept. of Investigation would buy this one out!!)

Now, this one we all use for something or the other (Cached results remember :)

www.experts-exchange.com : Provides inputs to your queries :: expert sexchange (Yeh I know you would say that the delimiter should not be ignored, but then do you know the very reason of having that! ... Yeh :)

www.penisland.net : pen island provides Custom made pens on internet :: (Aha! Pop Quiz Time .... Left for you)

So, still we need things out here to evolve. Computers need to socialize more I guess and know what fits in where.


1 comment:

Harish said...

lets have a quiz it would be interesting :)

SiteMeter