Question Detector for Twitter & Instant Messages


I looked online for a ready made question detector but I couldn’t find any, so i decided to code my own and post it online.

This question detector (in python) can work with any sentence but it’s designed specifically for twitter or instant messages (IM) . It mainly relies on the following:

  • The following assumption is made: any sentence containing a question mark is considered a question.
  • Using a twitter word tokenizer from NLTK to make sure all sentences classified as questions do contain at least one of such key words: “what, why, how, when, where, did, do, does, have, has, am, is, are, can, could, may , would, will, ? ..etc.”. It’s highly unlikely that a question does not have one of these tokens.
  • Using a naive bayes classifier on NLTK corpus ‘nps_chat’, which – alone has got an accuracy of 67% when cross validating it.

I have tested this detector on a small data set, getting an accuracy of 93%.

import nltk.corpus
from nltk.corpus import nps_chat
from nltk.tokenize import TweetTokenizer

class QuestionDetector():

    #Class Initialier:
    #- Creates naive bayes classifier using nltk nps_chat corpus.
    #- Initializes Tweet tokenizer
    #- Initializes question words set to be used
    def __init__(self):
        posts = nltk.corpus.nps_chat.xml_posts()
        featuresets = [(self.__dialogue_act_features(post.text), post.get('class')) for post in posts]
        size = int(len(featuresets) * 0.1)
        train_set, test_set = featuresets[size:], featuresets[:size]
        self.classifier = nltk.NaiveBayesClassifier.train(train_set)
        Question_Words = ['what', 'where', 'when','how','why','did','do','does','have','has','am','is','are','can','could','may','would','will','should'
"didn't","doesn't","haven't","isn't","aren't","can't","couldn't","wouldn't","won't","shouldn't",'?']
        self.Question_Words_Set = set(Question_Words)
        self.tknzr = TweetTokenizer()
    #Private method, Gets the word vector from sentance
    def __dialogue_act_features(self,sentence):
         features = {}
         for word in nltk.word_tokenize(sentence):
             features['contains({})'.format(word.lower())] = True
         return features
    #Public Method, Returns 'True' if sentance is predicted to be a question, returns 'False' otherwise
    def IsQuestion(self,sentence):
        if "?" in sentence:
            return True
        tokens = self.tknzr.tokenize(sentence.lower())
        if self.Question_Words_Set.intersection(tokens) == False:
            return False
        predicted = self.classifier.classify(self.__dialogue_act_features(sentence))
        if predicted == 'whQuestion' or predicted == 'ynQuestion':
            return True
        
        return False

Question vs Query

Note that a question is simply an interrogative sentence does not necessarily imply a query, According to this paper from Google Research[1],  below are the 6 different types of interrogative sentences:

  1. Advertisement. This kind of tweets ask questions to the reader and deliver advertisements in the following. E.g., ‘ Incorporating your business this year? Call us today for a free consultation with one of our attorneys. 855- 529-8753. http://buz.tw/FjJCV’
  2. Article or News Title on the Web. These tweets post article names or news titles together with the links to the webpage. E.g., ‘New post: Pregnancy Miracle – A Miracle or a Scam? http://articlescontentonline.com/pregnancy-miracle-amiracle-or-a-scam’
  3. Question with Answer. These tweets contain questions followed by their answers. E.g., ‘ I even tried staying away from my using my Internet for a couple hours. The result? Insanity’
  4. Question as Quotation. These tweets contain questions in quoted sentences as references to what other people said. E.g., ‘I think Brian’s been drinking in there because I’m hearing him complain about girls, and then he goes “Wright, are you sure you’re not gay?’
  5. Rhetorical Question. This kind of tweets include rhetorical questions, which seem to be questions but without the expectation of any answer. In another words, these tweets encourage readers to think about the obvious answers. E.g., ‘ You ruined my life and I’m supposed to like you’
  6. Qweet. (Queries) These kinds of tweets ask for some information or help. E.g., ‘ What’s your favorite Harry Potter scene?’
    Tweet author posts a question asked by someone on the web, e.g., CQA portals, forums, etc. The following is an example: ‘Questions about panda update. When will the effect end? http://goo.gl/fb/iiRjn’

References

[1] Baichuan Li, Xiance Si , Michael R. Lyu , Irwin King, and Edward Y. Chang  – Question Identification on Twitter – 2011

Advertisements

AI-Related Careers


Introduction

Welcome Back! Today I talk about all the job “titles” related to AI. I have made a “survey” to get this information. I explored a lot of the people working in the AI Industry on LinkedIn, checked out their job titles then googled each job title to get more information about it. Note that these job titles may overlap, for example: A Game AI Programmer could be also a C++ Programmer.

So Let’s Choose…

1-Search Engineer

What you will do is to engineer search Engines (such as Google, Yahoo, Bing and many others), it might involve some good AI, such as NLP (Natural Language Processing) and a lot more. There’s no coincidence that the director of research in Google is Peter Norvig has happened to become an AI Scientist.

More About Search Engineers here in the official Google Blog

2-Game AI Programmer / Game play Programmer

Most of people working in the AI field are under this category, to enter its world visit AiGameDev.com

3-Freelance Web Intelligence Developer

This is one attractive Title.Freelance means you don’t work for anybody except yourself. You receive the AI project request and you can decide whether to accept it or not. I think this needs a real professional if it’s about AI Software. Here is an example.

4-Robotics Systems Engineer

I think this is clear enough. You program the robots used in manufacturing, military, entertainment and others. It doesn’t necessarily need a Mechanics/Electronics Expert.

5-NLP Architect/Scientist

Natural Language Processing (NLP) related work. This is an extremely wide area serving search engines, intelligent marketing, Strong AI …etc.

6- Computational Linguist

Computational Linguistics serves NLP.

7-Image Processing & Computer Vision Engineer

Your main task here is to make the computer understand stuff from pictures/videos.

8-Visual Computing Expert

Visual Computing is the mother science of Graphics.  So you can work on projects with intelligent graphics behavior.

9-Speech Architect/Engineer

You work on Automated Speech Recognition and Understanding, i.e.: make the computer understand people’s speech. There are hundreds of languages around the world so I think there’s a lot of work left here.

10-Knowledge Engineer

Knowledge Engineering is an engineering discipline that involves integrating knowledge into computer systems in order to solve complex problems normally requiring a high level of human expertise. You will maintain and develop Knowledge Based Systems

11- Trading systems researcher

Your involve AI into trading. You use machine learning and statistical methods to make the computer help Humans in making better decisions in their trading.

12- Statistical Analyst, Data Miner, Data Warehouse Architect, Database Analyst

Apply Data Mining techniques to Databases for taking better decisions considering the business in the future.

13-Data Conversion Engineer

Data conversion is the conversion of computer data from one format to another. When Compressing Data some of would be lost , as a result artificial intelligence techniques are used to predict the lost data.

14-Expert Systems Developer

An Expert System is software that attempts to provide an answer to a problem, or clarify uncertainties where normally one or more human experts would need to be consulted. Your job is to develop it.

15-System Test Engineer

Your Job is to test SW. This SW could be Software based on AI. It’s well known that the test of many AI-based SWs is hard due to its stochastic behavior.

16-Lisp/Prolog Programmer

These are the 2 favored languages for AI. According to my information, Lisp is widely used in the USA, while Prolog is widely used in Europe. The rest of the world follows one of them.

17-C++/Java Programmer

Many AI-based SW is programmed with C++/Java mainly due to their efficiency. I.e.: Most Game AI Developers use C++/Java because they are efficient and games need a lot of computer resources.

18-Artificial Intelligence Developer/Programmer

This is a very general AI job name, but many people use it as their occupation.

19- Research Assistant/Professor/lecturer/Scientist/Researcher

Since AI Is a science which needs a lot more of research. I have observed that a big portion of AI-related jobs are research related.

20-AI Consultant

You provide consultancy concerning AI for companies.