Friday, 29 September 2017

Daughters Of India

This post is not about feminist movements in India or any such similar topic; rather it's about using Python code for applying basic NLP (Natural Language Processing) techniques on tweets.

The four twitter users whose tweets I analysed are: Nidhi Razdan, Rupa Subramanya, Shubhrastha, and Rana Ayyub. They are the very few media persons whom I like to hear or read, for I find them insightful and interesting while presenting their information and arguments. The rest, or most of the other journalists are either not insightful or boring. In terms of political inclinations, Nidhi and Rana are at the left of centre whereas Rupa and Shubhrastha are to the right.

The analysis has a limited purpose; it's intended to put a set of numbers around tweets, sort of getting a basic quantified view of these journalists' tweets.

By "basic" NLP, I mean applying the following techniques:
Word Density: This is the simplest one that calculates 'words' per tweet.
Lexical Diversity: This is an interesting statistic, which the author Russel explains, "is defined as the number of unique words divided by the number of total words in a corpus; by definition, a lexical diversity of 1.0 would mean that all words in a corpus were unique, while a lexical diversity that approaches a 0.0 implies more duplicate words.
In the Twitter sphere, lexical diversity might be interpreted in a similar fashion if comparing two Twitter users, but it might also suggest a lot about the relative diversity of overall content being discussed, as might be the case with someone who talks only about technology versus someone who talks about a much wider range of topics."[1]
Top Words: The top words that were most frequently used. My program prints the top five.
Popularity: This is the sum of the number of retweets and likes got by a tweet. My program prints the top 5.
Sentiment: A piece of text can be classified as positive, neutral or negative. My program calculates the number and percentage of positive, neutral and negative tweets.
Clustering: Clustering is a way of dividing a bunch of entities into a fixed number of groups that are not previously determined and are formed organically as we process the data. I used K-means clustering algorithm, which is an unsupervised machine learning algorithm that divides 'n' datapoints into 'K' clusters based on some measure of similarity. My program uses K=5, thus it divides the tweets into five clusters (topics) and prints 10 words from each cluster as an indicative example.

I wrote the code in Python. The program has a few helper functions that are used by the more functional routines:
clean_text_and_tokenize: This function takes a line (string) as input, cleans it and returns the words as a list. The cleaning process consists of: removing hyperlinks, punctuation marks, removing stop words and lemmatizing the remaining words. Stop words are words like a, an that we don't want to be part of the analysis. Lemmatizing is the process of replacing a word by its base word.
clean_tweet: This function takes a line(string), gets the clean words by calling clean_text_and_tokenize and returns a string by joining the cleaned words.
getCleanedWords: This function takes a list of lines (strings), cleans each line and retruns all words from all the lines.

The key functional routines are:
lexical_diversity: This function takes a list of words, and returns the number of unique words divided by the total number of words.
average_words: This function takes an array of strings, splits them into words and returns words per string value.
top_words: This function takes in a list of words, stores the frequency of each word and returns the most frequently used words. If the 'top' number is not passed as argument, it defaults to five.
popular_tweets: This function adds the retweet count and like count of every tweet to calculate its popularity. It uses a priority queue to identify the most popular tweets. If the 'top' number is not passed as argument, it defaults to five.
sentiment_analysis_basic: This function uses the sentiment method of TextBlob library to calculate the polarity of a tweet. The tweet is classified as positive, neutral or negative depending on the value of polarity being greater than, equal to, or less than zero.
clusterTweetsKmeans: This function uses the gensim library to create a model of vectors from the cleaned tweets. After training the model, it invokes KMeans routine of the sklearn library. Tweets are clustered into six topics.

The code is available on my github repository python-misc. The input to the program is a file named <twitter_user>.csv. This file has to be generated first by running the excellent program available in the github repository GetOldTweets-python. The ultra-cool feature of this module is that you don't have to register an app on and use the authorization tokens and passwords in the code.

For this article, I have fetched tweets from 01-Jan-2015 to 25-Sep-2017. To get the tweets csv file of @Nidhi, the command is:
$ python --Nidhi --since 2015-01-01 --until 2017-09-25 creates a file with name output_got.csv which I renamed to Nidhi.csv. Command to rum my program is:
$ python tweets_analysis Nidhi

The program opens the csv file and reads all the records into a list of strings. It skips the first line as it is the header. It then calls the functional routines one by one. The output generated for running with Nidhi.csv is:
Total no. of tweets: 3120
Average Number of words per tweet = 10.4330128205
Lexical diversity = 0.252425418385
| Words | Count |
| thank |   320 |
| india |   162 |
| say   |   124 |
| yes   |   109 |
| also  |   101 |
Printing top 5 tweets
1. I don't know who killed Gauri Lankesh. But I do see who is celebrating her death and vilifying her.
Popularity = 17679
Link =
2. A message to those in the media who are still independent and do their job by fearlessly asking questions. We won't be intimidated https:// s/871593196953849856 …
Popularity = 10653
Link =
3. It's now fairly clear demonetisation was a purely political move. Brilliant actually. Economy got hit but hey, U.P. was won
Popularity = 9018
Link =
4. Hello people, Ramdev is not buying NDTV. Thank you
Popularity = 8892
Link =
5. Honoured to present my book 'Left,Right &Centre,The Idea of India' to the President @RashtrapatiBhvn @PenguinIndia
Popularity = 7233
Link =

No. of positive tweets = 1043 Percentage = 33.4294871795
No. of neutral tweets = 1616 Percentage = 51.7948717949
No. of negative tweets = 461 Percentage = 14.7756410256

Topic 1 has words: income tax department sends notice harsh mander institute via httweets
Topic 2 has words: anyone bjp condemned language today actually first one anything else
Topic 3 has words: wonder took long life short live fruit covered story well
Topic 4 has words: lol sigh never according yes saying mention press cog corner
Topic 5 has words: hiv but thank actually thank thank sephora actually french thank

I have captured the output of the runs against the four files in the following Google sheet:
tweet_analysis output
For your ready reference here is a screenshot:

Some observations
Rupa is the most prolific averaging about 45 tweets per day, whereas the most popular tweet is from Nidhi Razdan. Shubhrastha uses the most words per tweet amongst the four. The highest lexical diversity is from Nidhi indicative of a larger vocabulary knowledge. Rupa's LD value is very low probably because the denominator (number of tweets) is very high. The highest positive sentiment is from Rupa and the highest negative sentiment is from Shubhrastha, both right-leaning. Sentiment neutrality is the lowest in Rupa's tweets indicative of her taking a stand most of the time.

Program improvement & enhancement
For lexical diversity calculation we should perhaps consider equal number of tweets.
Sentiment analysis can be done with a more advanced algorithm like Naive Bayes; that would require a corpus of pre-classified tweets, the training data as it is technically called, preferably from Indian users.
Once we have a larger dataset of twitter analyses, this program could be used to classify a twitter user's political orientation as left, centre or right by analysing their tweets. This could be done either with comparing his/her tweets with a political-ideology corpus or measuring similarity with one of the already analysed twitter user.
Just showing the words in a cluster is not meaningful. I need to experiment with the number of clusters and analyse the cluster again separately to derive some semantic meaning. Topic clustering can be also done with a probabilistic algorithm like LDA.

Mining the Social Web, 2nd Edition by Matthew Russel. O'Reilly Publications.

Friday, 8 September 2017

Ideas From Another Field

Applying concepts from one field or a book in another field has been a common pattern in modern technological development. The spirit of antifragile, a recently coined word, has found its way in the implementation of microservices. In the quest to make software programs antifragile, the way forward is to build intelligence into them. A couple of other examples are: i) The law of diminishing returns from economics which applied to parallel computing becomes Amdahl's Law. ii) I surmise that the Agile board in the Scrum methodology is the application of the Hawthorne Effect. Cross-pollination of ideas is one of the mechanisms of innovation. And, as we found out recently, learning transfer is the way Elon Musk follows to become such a prolific technocrat and businessman.
My essay ends here.

Friday, 11 August 2017

I Dare You Not To Fall In Love

On the best part of using Ruby on Rails for software development, thus spake its creator David Hansson:
You get to use Ruby, which, even in a world that has rediscovered the benefits of functional programming and immutability, remains the most extraordinarily beautiful and luxurious language I’ve yet to encounter. Just look at some code. I dare you not to fall in love.[1]
Well, this is my exact opinion of Ruby, my most favourite programming language but I couldn't have articulated it any better than Hansson.

Friday, 2 June 2017

Applied Rails : Gems I Use

In this article, I discuss key gems that I have used in my Rails application. For each gem, I state what it is used for, a brief description of how I used it and the code snippet(s) pertaining to my use case.

Friday, 12 May 2017

Book Review : The Rise Of The Robots

The technology world faces new trends frequently. New technologies typically promise to get things done faster, reduce costs, and open new market segments, thus improving the financials of a lot of firms. From Service-oriented Architecture to mobile and cloud computing, all of them stake their claim to deliver these benefits.

However, the advent of Artificial Intelligence (AI) will impact the world economy in a way that no technology in the past was able to. This is the core thesis of Martin Ford’s 2015 book, 'The Rise of the Robots', sub-titled 'Technology and the Threat of Mass Unemployment.' If AI can unleash a tsunami of unemployment, how should America deal with it?

Friday, 14 April 2017

Applied Rails: An Algorithmic Perspective

The Human Resources manager informed that employees working in the corporate headquarters (CHQ) will have the second and fourth Saturday of the month off. This affected the leave balance calculation method in my Rails application. If a CHQ employee applies for leave and a second or fourth Saturday falls in between the start date and end date, they should not be deducted a leave.

I was already using the Rails date library. Given a date, I could get the beginning of the month and end of the month in which the date occurred. Before I could figure out how to proceed with this info, I went to and put up a question[1].

Friday, 17 March 2017

Applied Rails: Bulleted Text With Prawn

I use the prawn gem to generate pdf documents in my Rails application. It has a drawback that it does not have built-in support for displaying bulleted text. So I wrote a simple function that took a string parameter and printed it with an indent and a leading asterisk.