Sentiment Analysis | Kazi Shahrukh Omar

Access this project

GitHub repo: https://github.com/komar41/twitter-sentiment-analysis
Tools used: Python, NumPy, Pandas, nltk, scikit-learn, matplotlib, seaborn, TensorFlow, Keras.

The primary goal of this project is to classify sentiments expressed in tweets regarding the 2012 US election into positive and negative classes. Sentiment analysis, also known as opinion mining, is the process of determining and categorizing the emotions or opinions conveyed in a piece of text.

Data Wrangling

First, we clean the tweets! Lowercasing, removing URLs and usernames, punctuation and numbers.

def dataClean(tweets_raw):
    cleanTweets = []
    for tweet in tweets_raw:
        tweet = tweet.lower() 
        tweet = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', tweet) #remove URL
        tweet = re.sub(r'(\s)@\w+', r'', tweet) 
        tweet = re.sub(r'@\w+', r'', tweet) 
        tweet = re.sub('<[^<]+?>', '', tweet) 
        tweet = re.sub('[^A-Za-z0-9 ]+', '', tweet)
        tweet = re.sub(" \d+", " ", tweet) 
        lower_case = tweet.lower()

    words = lower_case.split()
    tweet = ' '.join([w for w in words if not w in nltk.corpus.stopwords.words("english")]) #remove stopwords
    ps = nltk.stem.PorterStemmer()
    stemmedTweet = [ps.stem(word) for word in tweet.split(" ")]
    stemmedTweet = " ".join(stemmedTweet)
    tweet = str(stemmedTweet)
    tweet = tweet.replace("'", "")
    tweet = tweet.replace("\"","")
    cleanTweets.append(tweet)

return cleanTweets

Exploratory Data Analysis

After cleaning the data, we move on to analysis. We checked the data distribution, words per tweet, and unique words in both Obama and Romney tweets. The following images are a few of these highlights.

Data Modeling and Results

After data preprocessing and analysis, we trained the clean data on eight machine learning algorithms for sentiment classification. We performed tests using both 80-20 train-test split and 10-fold cross-validation. Furthermore, we analyzed precision, recall, F1-score, and accuracy measures. The results highlight that overall support vector machine performed better than the other models. Model accuracy for sentiment analysis on both Obama and Romney tweets is highlighted in the image below.

For more information, see the notebooks inside the Sentiment Analysis folder.

Team

Kazi Shahrukh Omar
Soham Pradhan