Unveiling the Hidden Trends:
Analyzing Twitter Activity in India using Topic Models
Introduction
With over 396.5 million users, Twitter has established itself as a leading social media platform, enabling users to share their thoughts and engage with a vast audience through short text messages known as tweets. This microblogging service attracts users from diverse backgrounds, fostering discussions on a wide range of topics. As a result, Twitter serves as a valuable data pool for text mining, offering insights into the prevailing themes and discussions at any given time. However, manually sifting through the massive volume of tweets to identify relevant topics can be impractical. To address this challenge, adopting a text mining approach, such as topic modeling, can distill the information into a more usable and actionable format.
As a data science enthusiast passionate about Twitter's potential, the task is to enhance Twitter's search algorithm by incorporating more context-based results instead of solely relying on query-based outcomes. To achieve this, I leverage topic models to discover latent themes within tweets originating from India. By analyzing the Twitter activity of Indian users over the span of two years, from September 2019 to September 2021, I aim to identify trends and correlations, particularly in relation to the COVID-19 pandemic's impact on social media interactions. As topic modeling is an unsupervised machine learning technique, our analysis and results are contingent upon the effectiveness of the topic models in capturing relevant topics.
In this post, I delve into the process of analyzing Twitter activity in India, exploring the application of topic models as a means to uncover the underlying themes and trends. By doing so, I aim to shed light on the latent discussions and correlations that have emerged on the platform, especially considering the significant influence of the COVID-19 pandemic. Furthermore, I discuss the potential implications of this analysis for improving Twitter's search engine performance, driving user engagement, and ultimately facilitating user monetization.
Datasets
To analyze Twitter activity in India, I utilized the Snscrape API, a powerful scraper for social networking services like Twitter, Facebook, and Instagram. Snscrape's Python wrapper for Twitter provided extensive functionality, including support for users, user profiles, hashtags, searches, threads, and list posts.
One of the key advantages of using Snscrape over Tweepy is that it does not require an API key to scrape tweets. This simplifies the process and eliminates the need for authentication. Additionally, Snscrape's search function operates in a similar way to Twitter's search, making it easier to retrieve relevant tweets.
To filter tweets by location, I employed the 'near' 'Geo' operator in Snscrape. This allowed me to focus on tweets from the top 25 cities in India, using their population as a criterion. For example, a typical search query to scrape tweets around a specific city would resemble the following format: f'near:{city} within:200km since:{since} until:{until} lang:en -filter:retweets -filter:replies filter:verified'). Here, 'city', 'since', and 'until' represent the city name, start date, and end date of the search query, respectively.
To ensure the dataset's quality and manage the search results effectively, I applied additional filters. These filters helped limit the number of search results and exclude retweets and replies, focusing on original tweets that contain verified information.
After the scraping process, the final dataset consisted of approximately 400,000 tweets. This dataset served as the foundation for subsequent analysis and topic modeling to uncover meaningful insights into Twitter activity in India.
EDA
During the exploratory data analysis (EDA) phase, the unstructured text data underwent iterative cleaning steps, including:
Converting the unstructured data into a structured dataframe format.
Performing quality checks (QC) on the scraped data to identify inconsistencies or irregularities.
Filtering out tweets with missing attributes or incomplete information.
Utilizing regular expressions and the SwachData API to clean individual tweets.
Generating visualizations to identify any remaining issues with the scraped data.
Analyzing user activity:
The plot depicting the distribution of tweet lengths reveals interesting patterns. The distribution exhibits a peak around 80 characters, indicating that a significant number of tweets tend to be relatively short. As the tweet length increases beyond 80 characters, there is a gradual decline in the count of tweets.
However, a notable shift occurs around the maximum character limit of 280, which is the original limit set by Twitter. At this point, there is a sharp inflection in the distribution, suggesting that a considerable number of users are utilizing the expanded character limit. This indicates that users are taking advantage of the increased space to express more detailed thoughts and opinions.
Surprisingly, after the 280-character mark, there is a drop in the tweet count, followed by an even bigger peak around 320 characters. This observation could indicate that some users are pushing the character limit further, potentially by employing methods such as thread linking or sharing longer texts in multiple tweets.
The significance of this plot lies in understanding how users adapt to changes in character limits and how it affects the nature of conversations on Twitter. It highlights the diverse usage patterns and the need for platforms like Twitter to continuously evaluate and adjust their character limits to accommodate users' communication preferences effectively.
The distribution plot represents the number of tweets originating from various cities in India, with Mumbai and Delhi emerging as the top two contributors.
Mumbai and Delhi, being highly populated and bustling metropolitan cities, demonstrate their significance in the Twitter landscape. The substantial number of tweets originating from these cities suggests a vibrant online presence and active engagement of users in these regions.
The prominence of Mumbai and Delhi in terms of tweet count highlights their influential role in shaping discussions and conversations on Twitter in India. This may be attributed to factors such as their large population, diverse demographics, and prominent cultural, social, and economic significance.
Understanding the distribution of tweets across different cities provides valuable insights into regional trends, preferences, and conversations. It enables further analysis to uncover specific topics or themes that are prevalent in certain cities and facilitates a deeper understanding of the dynamics of Twitter activity across India.
The plot depicting the distribution of tweets per user reveals an interesting pattern. Out of the total 19,000 unique users, approximately 12,000 users have posted only between 1 to 2 tweets. In contrast, the remaining 7,000 users account for the majority of the 400,000 tweets.
This distribution raises a question regarding its potential impact on the modeling effort. It is currently unclear whether this distribution will positively or negatively influence the modeling process.
On one hand, the presence of a large number of users who have only posted a few tweets could introduce noise or sparse data, making it challenging to extract meaningful patterns or themes from their limited contributions.
On the other hand, the smaller group of users who have posted a significant number of tweets may offer more consistent and substantial data, potentially providing richer insights and enhancing the modeling effort.
Further analysis and exploration are necessary to determine the implications of this distribution. Evaluating the content, quality, and relevance of tweets from both groups of users can provide a clearer understanding of their impact on the modeling process and help make informed decisions on how to handle or weight the contributions from different user groups.
Model Building
In this project, a three-step modeling approach is adopted for the LDA (Latent Dirichlet Allocation) model using the Gensim library. The steps involved are as follows:
Base LDA model for the entire corpus: Initially, a base LDA model is built for the entire corpus of text data. This involves creating a dictionary object to assign a unique ID to each token in the text, constructing a document-term matrix to represent the frequency of words in each document, and training the LDA model using the corpus, dictionary, and the desired number of topics. The LDAMulticore implementation is used for efficient parallel processing.
Tuning the Hyperparameters: To improve upon the base model, the hyperparameters of the LDA model are tuned. This involves adjusting parameters such as the number of topics, alpha, beta, and the number of iterations to find the optimal configuration that maximizes the model's performance and topic coherence.
Build a model for each month and tune their hyperparameters: To capture the temporal dynamics of the data, a separate LDA model is built for each month. The corpus is limited to the tweets posted within each month, and the hyperparameters are tuned specifically for that month's model. This approach allows for a more granular analysis of topics and their evolution over time.
Once the models are trained, various steps are taken to analyze and evaluate the results. This includes investigating the topics and the words associated with each topic, calculating model metrics to assess the model's performance, and visualizing the topics using pyLDAvis, a Python library for interactive topic model visualization.
Model Evaluation
In unsupervised learning, such as in the case of topic modeling using LDA, model building and evaluation are intertwined and often overlap with each other. Unlike in supervised learning where there are clear labels or ground truth to evaluate against, unsupervised learning relies on the inherent structure or patterns within the data to generate insights and make sense of the information.
In the context of topic modeling, the process of building the model involves training it on the text data and extracting latent topics. However, without evaluation, it becomes challenging to determine the effectiveness or quality of the generated topics. Therefore, evaluation becomes an integral part of the model-building process to gauge its performance and assess the relevance and coherence of the identified topics.
During evaluation, various metrics and techniques are employed to measure the model's performance. This may include assessing topic coherence, calculating perplexity scores, conducting qualitative analysis, or using visualization tools like pyLDAvis. These evaluation methods aim to determine the effectiveness of the model in capturing meaningful topics, identifying distinct clusters of words, and aligning with the underlying semantics of the data.
Furthermore, the insights gained from the evaluation phase often inform the iterative refinement of the model. If the evaluation results indicate suboptimal performance or unclear topics, adjustments can be made to hyperparameters, preprocessing steps, or even the model architecture itself. This iterative process of building, evaluating, and refining the model continues until satisfactory results are achieved.
Therefore, in unsupervised learning scenarios like topic modeling, model building and evaluation are interconnected, and the feedback loop between them is crucial for generating meaningful and accurate insights from unstructured data. The overlap between the two helps ensure that the model's outputs align with the desired objectives, leading to more reliable and actionable results.
Base Model Investigation:
In the base model, 25 topics were used as starting input. To evaluate the performance of these LDA models, the pyLDAvis library was utilized. This library provides an interactive visualization called the Intertopic Distance Map, which helps assess the quality of the model's topics.
The goal is to observe circles in the plot that are spread out and minimally overlapped. This indicates that the model has successfully identified distinct topics within the corpus. When hovering over a specific bubble (representing a topic), the plot on the right displays the associated words and their occurrences in that particular topic. By default, all words in the vocabulary are assigned to a topic, but their probabilities may vary significantly. Sorting the words by their probabilities allows for understanding the likelihood of a word being associated with a specific topic.
The color red represents the word distribution within the chosen topic, sorted from high to low, while the color blue represents the word distribution across the entire corpus. This visualization aids in interpreting the topics based on clusters of associated words.
For example, the top topic (Topic 1) consists of words such as "government," "vaccine," "student," "issue," and "health." This suggests that the topic being discussed involves aspects related to government intervention in COVID-19, vaccination efforts, student-related issues, and health concerns. By analyzing the composition of words associated with a topic, human interpretation can provide deeper insights and understanding of the underlying themes.
It is important to note that the evaluation of LDA models involves a human element, as subjective interpretation is required to make sense of the topics generated. The combination of automated analysis and human judgment helps validate and refine the model's outputs, ensuring that the identified topics align with the intended context and goals of the analysis.
In the base model with 25 topics and a coherence score of 0.45, the top topics can provide insights into the discussions happening within the corpus. Here is a summary of the interpretations for some of the top topics:
Topic 1: This topic appears to involve discussions related to government intervention in COVID-19, prioritizing vaccines, and issues concerning students, possibly indicating the impact of the pandemic on educational institutions.
Topic 2: This topic revolves around political discussions, mentioning parties like BJP and Congress, along with leaders, indicating discussions related to national political parties.
Topic 3: The words in this topic suggest positive sentiments, best wishes, and encouragement to stay home during the pandemic, indicating discussions about personal well-being and taking precautions.
Topic 4: This topic involves discussions about an app, film releases, and user excitement, particularly around Bollywood films. It signifies a proportional activity or anticipation of a film release.
Topic 5: The words in this topic, such as "new," "view," "change," and "million," do not indicate a specific topic or theme, making it challenging to associate them with a particular discussion.
Topic 6: This topic primarily revolves around COVID-19 cases, police reports, deaths, and testing, suggesting discussions related to the pandemic's impact, spread, and related statistics.
Hyperparameter Tuning
In the hyperparameter tuning process, a manual grid search approach was adopted instead of using GridSearchCV, as the evaluation metric focused on coherence scores. Two key hyperparameters were tuned: the number of topics and the decay parameter. The models were trained with different combinations of these hyperparameters, and each model was saved to disk for later evaluation and comparison.
To determine the best LDA model, the saved models were loaded, and coherence scores were calculated for each model. The models were then ranked based on their coherence scores, with the highest-ranked models considered as the best LDA models. These top-ranked models were selected for further evaluation of the results.
The chosen best LDA model(s) were subsequently analyzed by generating visualizations to investigate the topics extracted by the model. This allowed for a deeper understanding of the content and themes captured within the corpus.
By systematically training and evaluating multiple models with different hyperparameter combinations, the aim was to identify the LDA model that exhibited the highest coherence score. This best-performing model would provide more meaningful and coherent topics, enhancing the overall quality of the topic modeling results.
After performing the manual grid search, the best LDA model was identified with a num_topics value of 15 and a decay value of 0.5. This model demonstrated an improved coherence score compared to the base model. The coherence score serves as an evaluation metric for the quality and interpretability of the topics generated by the model.
Best LDA Model Evaluation
Topics Identified
Topic 0 | help let family come fight way go right want think |
Topic 1 | thank team good film look come new win release congratulation |
Topic 2 | new flight business city development market company price global country |
Topic 3 | bjp party justice police congress case leader bapu sant asharamji |
Topic 4 | crore vaccine lakh student read school bank open free vaccination |
Topic 5 | look kashmir boy jammu old hyderabad picture beautiful home shoot |
Topic 6 | song temple hai music listen ram beautiful video king art |
Topic 7 | food help blood army eat distribute donate camp needy singh |
Topic 8 | morning road west bengal pic south north fly east black |
Topic 9 | block pakistan dist river earth sri chandigarh meet forest network |
Topic 10 | happy wish god book bless sant bapu asharamji health happiness |
Topic 11 | post mumbai photo police pradesh tamil maharashtra station village railway |
Topic 12 | hindu religion hindus language culture muslims muslim idea boss religious |
Topic 13 | case test hospital report number death patient positive update new |
Topic 14 | minister app president modi nation anniversary gandhi hon tribute prime |
Dominant Topic Distribution
The LDA will model multiple topics in each document. But we are generally interested in assigning one particular topic to each tweet since that is the most likely scenario, a tweet talks about one particular topic. This is done by extracting all the topics and their associated weights for a given document, sorting them by the weights in descending order and outputting the words constituting that particular topic.
Document_No | Dominant_Topic | Topic_Perc_Contrib | Keywords | Text |
---|---|---|---|---|
0 | 10.0 | 0.4437 | happy, wish, god, book, bless, sant, bapu, asharamji, health, happiness | [bhai, celebrate, generation, house, lal, phonta, pishi, ray, roma] |
1 | 1.0 | 0.5443 | thank, team, good, film, look, come, new, win, release, congratulation | [accept, available, challenge, delighted, difficult, game, goal, hyderabad, improve, lot, new, p... |
2 | 5.0 | 0.6888 | look, kashmir, boy, jammu, old, hyderabad, picture, beautiful, home, shoot | [artist, old] |
3 | 1.0 | 0.5821 | thank, team, good, film, look, come, new, win, release, congratulation | [career, crazy, enjoy, grill, guy, hey, include, interview, know, link, listen, playlist, podcas... |
4 | 0.0 | 0.4315 | help, let, family, come, fight, way, go, right, want, think | [clear, course, imagine, lose, seat, tories, want, win] |
5 | 1.0 | 0.3727 | thank, team, good, film, look, come, new, win, release, congratulation | [bangladesh, confirm, eden, gardens, go, test, well] |
6 | 10.0 | 0.7583 | happy, wish, god, book, bless, sant, bapu, asharamji, health, happiness | [celebrate, blood, bond, brother, cousin, festival, good, miss, sister, special, time, unique, w... |
7 | 0.0 | 0.9348 | help, let, family, come, fight, way, go, right, want, think | [avoid, break, ego, forgive, laugh, live, loudly, make, quickly, short, smile, truly] |
8 | 4.0 | 0.4382 | crore, vaccine, lakh, student, read, school, bank, open, free, vaccination | [good, wish, class, kharagpur, sketch, student] |
9 | 3.0 | 0.4741 | bjp, party, justice, police, congress, case, leader, bapu, sant, asharamji | [agitate, bengal, decent, delegation, demand, governer, job, lead, leader, long, marginalized, m... |
Most Representative Sentence for Each Topic
In topic modeling, each document is typically assigned multiple topics with corresponding weights that represent the document's distribution across different topics. To understand the underlying representative text driving a particular topic, we can identify the most representative sentence for each topic based on these topic distributions.
The most representative sentence for a specific topic is the one that has the highest contribution or weight for that topic within its document. By analyzing this sentence, we can gain insights into the key themes or concepts associated with that topic.
This information is valuable because it helps us interpret the topics generated by the model and understand the dominant aspects of each topic. It allows us to extract meaningful information and gain a deeper understanding of the underlying content within a large collection of documents.
By examining the most representative sentence for each topic, we can identify the specific text segments that are driving the mixture of words for that topic. This analysis enhances our ability to interpret and describe the topics generated by the topic modeling algorithm.
Overall, the identification of the most representative sentence for each topic provides valuable context and insight into the content and themes captured by the topic modeling process, enabling us to extract meaningful information and make informed interpretations.
Topic_Num | Topic_Perc_Contrib | Keywords | Representative Text |
---|---|---|---|
0 | 0.0 | 0.9718 | help, let, family, come, fight, way, go, right, want, think |
1 | 1.0 | 0.9749 | thank, team, good, film, look, come, new, win, release, congratulation |
2 | 2.0 | 0.9719 | new, flight, business, city, development, market, company, price, global, country |
3 | 3.0 | 0.9705 | bjp, party, justice, police, congress, case, leader, bapu, sant, asharamji |
4 | 4.0 | 0.9598 | crore, vaccine, lakh, student, read, school, bank, open, free, vaccination |
5 | 5.0 | 0.9611 | look, kashmir, boy, jammu, old, hyderabad, picture, beautiful, home, shoot |
6 | 6.0 | 0.9651 | song, temple, hai, music, listen, ram, beautiful, video, king, art |
7 | 7.0 | 0.9650 | food, help, blood, army, eat, distribute, donate, camp, needy, singh |
8 | 8.0 | 0.9547 | morning, road, west, bengal, pic, south, north, fly, east, black |
9 | 9.0 | 0.9647 | block, pakistan, dist, river, earth, sri, chandigarh, meet, forest, network |
Frequency Distribution of Word Counts in Documents
We see that the number of word counts in the documents is very small. This is due to the fact that most of the recurring words were added to the stopwords list which did not really add context to the topic
Distribution of document word counts by dominant topic
Word count distribution is similar amongst all topics. Topic 0 to Topic 3 constitute the bulk of documents
Word Clouds of Top Keywords in Each Topic
Wordcloud shows the words and their relative percentage within a given topic. This helps us to understand which words contribute to a given topic
Word Counts of Topic Keywords
This plot is similar to wordcloud with the actual word counts and its correspoind weight distribution
Most Discussed Topics
We see that topic 0
to topic 3
are the most dominant of the topics being talked about in the tweets
Modeling Trending Topics
In traditional topic modeling approaches, the entire corpus of tweets spanning a two-year period is used to cluster topics into 5-50 categories. However, social media conversations are highly dynamic, with different topics trending at different times. To capture these evolving trends, I propose a monthly-based approach to analyze how discussions have changed over the two-year period.
In this approach, I train individual topic models for each month's tweets, resulting in a total of 24 models. To determine the best topics for a given month, I employ gridsearch, which involves exploring different combinations of hyperparameters such as the number of topics and decay rate. By fitting the model to the specific month's tweets, I can better capture the relevant and timely topics of that period.
Once the models are trained, I calculate coherence scores to assess the quality and interpretability of the topics generated. Coherence measures how coherent and meaningful the words within each topic are, ensuring the identified topics are clear and informative. The models are then ranked based on their coherence scores, and the model with the highest coherence is selected as the best model for that month.
To track the evolving trends over the two-year period, I save the best model for each month. These models serve as snapshots of the most relevant topics during that specific time frame. By analyzing the topics and associated words from the best models, I can visualize and understand the top topics that emerged over the entire two-year period.
By adopting this approach, I aim to overcome the limitation of capturing evolving trends in social media conversations. Instead of relying on a single model for the entire corpus, analyzing topics on a monthly basis provides a more accurate representation of the changing discussions and allows for a comprehensive analysis of the trends over time.
The above plot shows the number of topics for the best
candidate lda model after tuning of the hyperparameters. The best
LDA model had the highest coherence score for a given month. We can see that the wide variation in the number of topics being identified for each month. The number of topics varies between the lower count of 3 and upper count of 30. The coherence scores for all the best
models are more than 0.4 except for the March-2020 model
WordCloud of Top Topic Over Time
The topics which can be interpreted for each of the month are as follows:
Month-Year | Topic |
---|---|
Sep-2019 | Very small dataset, not a coherent topic |
Oct-2019 | Diwali wishes |
Nov-2019 | Thanking for Birthday wishes |
Dec-2019 | Student protests |
Jan-2020 | Film promotion |
Feb-2020 | Film promotion |
Mar-2020 | Stay home COVID messages |
Apr-2020 | Lockdown |
May-2020 | Lockdown and Migrant worker crisis |
Jun-2020 | China-India border skirmishes |
Jul-2020 | Family+Student, possibly related to school shutdowns due to COVID |
Aug-2020 | Family+Student, possibly related to school shutdowns due to COVID |
Sep-2020 | Thank you tweets |
Oct-2020 | Thank you tweets |
Nov-2020 | Diwali wishes |
Dec-2020 | Farmers protest |
Jan-2021 | Farmers protest |
Feb-2021 | Birthday wishes to guru |
Mar-2021 | Woman+Thank you messages |
Apr-2021 | COVID 2nd wave, Oxygen and Hospital help |
May-2021 | COVID 2nd wave, Oxygen and Hospital help |
Jun-2021 | End of 2nd COVID wave, Happy messages |
Jul-2021 | Birthday wishes |
Aug-2021 | India England Cricket Series |
It is worth noting that many of the topics identified in the analysis contain words such as "happy," "thank," "wish," and "good," which appear repeatedly across almost all months. These topics are likely related to tweets expressing birthday wishes to famous individuals and the subsequent gratitude expressed by those individuals towards their followers. However, it is important to acknowledge that this pattern is potentially an artifact of how Twitter displays past search results when tweets are scraped by location.
It has been observed that the search results obtained from Twitter do not fully reflect or capture all the tweets matching a specific search parameter. Twitter seems to hide certain search results from its public query database. While it is assumed that using the enterprise Tweepy API would provide more representative results, this remains speculative at this point. A visual comparison was made between the search results displayed on Twitter's webpage and the tweets scraped using Snscrape, revealing that both methods retrieved a similar number of tweets for a given search criteria.
Therefore, it is important to consider the limitations of the data collection process and the potential bias introduced by Twitter's search result display when interpreting the identified topics and their evolution over time.
Conclusions
In conclusion, the topic modeling approach utilized in this analysis has demonstrated its ability to identify distinct topics and clusters of words associated with each topic. This provides valuable insights into the content and context of the tweets within the dataset. By analyzing the topics generated by the model, we can gain a better understanding of the conversations and trends that have taken place over a two-year period.
Furthermore, this topic modeling technique can be leveraged to enhance search query results by incorporating contextual information. By considering the identified topics and their associated words, search engines and recommendation systems can deliver more relevant and personalized results to users. This contextual-based approach has the potential to greatly improve the user experience and provide more meaningful and accurate information.
However, it is important to acknowledge the limitations and challenges in the data collection process, particularly in relation to Twitter's search result display. The presence of certain artifacts, such as recurring phrases in birthday wishes and expressions of gratitude, should be taken into account when interpreting the identified topics. Future research could explore alternative data collection methods or APIs to overcome these limitations and ensure a more comprehensive representation of the underlying conversations.
Overall, the topic modeling approach presented here offers a valuable tool for understanding evolving trends and capturing the context of discussions on social media platforms. It opens up opportunities for more sophisticated information retrieval and recommendation systems that can cater to the specific needs and interests of users.