Question Detector for Twitter & Instant Messages


I looked online for a ready made question detector but I couldn’t find any, so i decided to code my own and post it online.

This question detector (in python) can work with any sentence but it’s designed specifically for twitter or instant messages (IM) . It mainly relies on the following:

  • The following assumption is made: any sentence containing a question mark is considered a question.
  • Using a twitter word tokenizer from NLTK to make sure all sentences classified as questions do contain at least one of such key words: “what, why, how, when, where, did, do, does, have, has, am, is, are, can, could, may , would, will, ? ..etc.”. It’s highly unlikely that a question does not have one of these tokens.
  • Using a naive bayes classifier on NLTK corpus ‘nps_chat’, which – alone has got an accuracy of 67% when cross validating it.

I have tested this detector on a small data set, getting an accuracy of 93%.

import nltk.corpus
from nltk.corpus import nps_chat
from nltk.tokenize import TweetTokenizer

class QuestionDetector():

    #Class Initialier:
    #- Creates naive bayes classifier using nltk nps_chat corpus.
    #- Initializes Tweet tokenizer
    #- Initializes question words set to be used
    def __init__(self):
        posts = nltk.corpus.nps_chat.xml_posts()
        featuresets = [(self.__dialogue_act_features(post.text), post.get('class')) for post in posts]
        size = int(len(featuresets) * 0.1)
        train_set, test_set = featuresets[size:], featuresets[:size]
        self.classifier = nltk.NaiveBayesClassifier.train(train_set)
        Question_Words = ['what', 'where', 'when','how','why','did','do','does','have','has','am','is','are','can','could','may','would','will','should'
"didn't","doesn't","haven't","isn't","aren't","can't","couldn't","wouldn't","won't","shouldn't",'?']
        self.Question_Words_Set = set(Question_Words)
        self.tknzr = TweetTokenizer()
    #Private method, Gets the word vector from sentance
    def __dialogue_act_features(self,sentence):
         features = {}
         for word in nltk.word_tokenize(sentence):
             features['contains({})'.format(word.lower())] = True
         return features
    #Public Method, Returns 'True' if sentance is predicted to be a question, returns 'False' otherwise
    def IsQuestion(self,sentence):
        if "?" in sentence:
            return True
        tokens = self.tknzr.tokenize(sentence.lower())
        if self.Question_Words_Set.intersection(tokens) == False:
            return False
        predicted = self.classifier.classify(self.__dialogue_act_features(sentence))
        if predicted == 'whQuestion' or predicted == 'ynQuestion':
            return True
        
        return False

Question vs Query

Note that a question is simply an interrogative sentence does not necessarily imply a query, According to this paper from Google Research[1],  below are the 6 different types of interrogative sentences:

  1. Advertisement. This kind of tweets ask questions to the reader and deliver advertisements in the following. E.g., ‘ Incorporating your business this year? Call us today for a free consultation with one of our attorneys. 855- 529-8753. http://buz.tw/FjJCV’
  2. Article or News Title on the Web. These tweets post article names or news titles together with the links to the webpage. E.g., ‘New post: Pregnancy Miracle – A Miracle or a Scam? http://articlescontentonline.com/pregnancy-miracle-amiracle-or-a-scam’
  3. Question with Answer. These tweets contain questions followed by their answers. E.g., ‘ I even tried staying away from my using my Internet for a couple hours. The result? Insanity’
  4. Question as Quotation. These tweets contain questions in quoted sentences as references to what other people said. E.g., ‘I think Brian’s been drinking in there because I’m hearing him complain about girls, and then he goes “Wright, are you sure you’re not gay?’
  5. Rhetorical Question. This kind of tweets include rhetorical questions, which seem to be questions but without the expectation of any answer. In another words, these tweets encourage readers to think about the obvious answers. E.g., ‘ You ruined my life and I’m supposed to like you’
  6. Qweet. (Queries) These kinds of tweets ask for some information or help. E.g., ‘ What’s your favorite Harry Potter scene?’
    Tweet author posts a question asked by someone on the web, e.g., CQA portals, forums, etc. The following is an example: ‘Questions about panda update. When will the effect end? http://goo.gl/fb/iiRjn’

References

[1] Baichuan Li, Xiance Si , Michael R. Lyu , Irwin King, and Edward Y. Chang  – Question Identification on Twitter – 2011

Advertisements