Personality Traits Analysis among Social Media Influencers


The idea of this project is to build a sentiment analysis model that detects the emotions that underlie a tweet. It makes associations between words and emotions and the aim is to classify the tweets into sentiments like anger, happiness, sadness, enthusiasm etc. rather than the usual sentiment classification that only involves truly contrasting sentiments of Positive and Negative.

We have considered 3 different approaches to build this Social IQ analysis pipeline. They are

  • Universal Sentence Encoder
  • Doc2Vec
  • LSTM

The performance of each of these approaches are explained below.

The Data:

The data is in tabular form and each row is divided into 4 columns namely the tweet_id, sentiment, author and content.

The tweets in the data are pulled from Twitter using Twitter API as additional training data. The tweets are classified with their own hashtags — for example “#happy” as the hashtags should be an appreciably good (but far from perfect) representation of the sentiment of the tweet.

Original Dataframe of the Tabular data

Exploring the Data:

The dataset contains 40,000 tweets which belong to 13 Sentiment labels in total and these sentiments were Neutral, Worry, Happiness, Sadness, Love, Surprise, Fun, Relief, Hate, Enthusiasm, Boredom, Anger.

The initial spread of Data

As we can see from the about data visualization, the sentiment classes are very much imbalanced and several of those classes were in fact extremely similar. And hence we combined several of those classes into four final classes and these 4 sentiments that were considered for Analysis are Happiness, Sadness, Neutral and Hate.

Spread of the combined Data


Dropping off the irrelevant columns (tweet_id, author)

The content of the tweets contains special characters like @, #, etc., which are removed. And all the words were converted to lowercase.

Pre-processed Data output

Models Analysis:

Universal Sentence Encoder:

Know more about USE here: encoder/1


USE compares two bodies of text and then gives the similarity score between them. One body of text is the tweets. The other is the Synonyms list that we hardcoded. Synonyms are chosen for the fact that they have the highest semantic score with the sentiments considered and hence they will be the best feature to compare the underlying sentiment of the tweet and to calculate the semantic score. So the tweets are scored based on their semantic score with the synonyms with the USE model from the “tensorflow_hub” library and the sentiment with the highest semantic score is considered to be the predicted sentiment label.

Hardcoding the Synonym List & dividing it into chunks

Universal Sentence Encoder Model


  • It is an Unsupervised Learning model and hence there is no need for training data.
  • As it works based on Semantic similarity between statements, it performs well in capturing the core sentiment of the statements (Positive or Negative) although they aren’t the exact labels.


  • The programmer must be well versed and have a thorough understanding of the Sentiment Labels himself as to hardcode the Synonyms List for each Label without Redundancy.
  • Performance depends on the quality of the Synonyms.
  • It is not efficient for predicting Sentiment labels that are related to each other.

USE Confusion Matrix

Accuracy : 47.5%


Know more about Doc2Vec:


Doc2Vec is a word embedding method. It creates a numeric representation of a document, regardless of its length. The pre-processed data is split into train and test split and the features are vectorized. The Doc2Vec module of “gensim.models” library is used to fit the model. The model is trained for 30 epochs with the vectorized contents of tweet as X and their respective vectorized sentiments as y.

Doc2Vec Model


  • Requires less or very basic understanding of the Sentiment labels.


  • Requires large amount of training data.
  • The model performance entirely depends on the scale of the training dataset.

Doc2Vec Confusion Matrix

Accuracy : 45%


Know more about LSTM:


After pre-processing the data, the contents of the tweets are tokenized and pushed into a Sequential LSTM model with 4 layers. The model is trained for 15 epochs with 64 as the batch size. “categorical_crossentropy” is used as the Loss function and the “adam” optimizer is used. The model is fit with contents of the tweets as X and it’s respective sentiment as y in the form of vectors.

LSTM Model


  • The availability of Memory cells that store previous information and the flexibility to alter the learning rates.


  • Requires large amount of training data.
  • The model performance entirely depends on the scale of the training dataset.

LSTM Confusion Matrix

Accuracy : 52.5%



USE model tends to have a decent performance and efficiency given that it requires no training.

We must be careful that the Sentiment labels considered are not redundanz

We are experts in custom Web & Mobile Application development, Data & Cloud solutions, Artificial Intelligence & other custom solutions.