The Student Room Group

Need help with this Statistics problem

I am doing a research trying to find if there's a correlation between twitter sentiments and sales and I'm doing it on two different companies. They are in the same industry and direct competitors. The time span is quarterly from Q1 2014 to Q4 2021 (28 data points each). I found the quarterly sales numbers easily because they are public companies. For twitter, I collected the tweets with Twitter Premium API v1.1 full archive search. I will not say the method of sentiment analysis.

My H0 is "there is no correlation between twitter sentiments and sales" and my H1 is "there i correlation between twitter sentiments and sales".

For company A, my p-value is < 0.05 and for company B, my p-value is > 0.05. How should I write the conclusion when I reject the H0 for one company and fail to reject the H0 for the other?

Thank you.
Is this a university project, or ....

Must admit, Id be very wary about reading too much into any relatively simple analysis on 28 data points given the number of variables that would affect sales and trying to correlate sales change to twitter posts is going to be problematic at best. Your data could easily be affected by
* it includes the covid pandemic time at the end
* if the companies are in competition, does one sales affect the other
* how would you recognise an A company verses a B company if they give different results
* ....
never mind about all the other explanatory variables for real world data which could completely change the meaning you read into a correlation (conditional dependence/independence).
(edited 1 year ago)
Reply 2
Yes, it's a dissertation.
Original post by reyjusuf
Yes, it's a dissertation.

Can you provide any more info regarding what you did / how you resolved problems like the above? Even a scatter plot of the data would be useful.
(edited 1 year ago)
Reply 4
Original post by mqb2766
Can you provide any more info regarding what you did / how you resolved problems like the above? Even a scatter plot of the data would be useful.

Basically for each company I did the following:
Collect quarterly sales data from their websites
Collect tweets from the corresponding quarters (28 times) using hashtags #(name of the company) as the keyword. I used the Twitter full_archive_search Premium API https://developer.twitter.com/en/docs/twitter-api/premium/search-api/quick-start/premium-full-archive.
I performed sentiment analysis using Multinomial NB (I previosly tested multiple methods and I found Multinomial NB to have the highest accuracy) to classify the tweets as positive, negative and neutral.
For each quarter, I counted the number of positive, negative and neutral tweets and scored them using the formula (Positive - Negative)/(Positive + Negative + Neutral). This number is the one I'm correlate with sales and I used Excel's CORREL function to find the correlation coefficient.

For company 1, it's 0.39
For company 2, it's 0.24.

I needed to see if it's significant or not so I found the p-value using this:
https://opentextbc.ca/introstatopenstax/chapter/testing-the-significance-of-the-correlation-coefficient/

Since I have 28 data, my degrees of freedom in the t-table is 26. my t-value needs to be above 1.706 (one tailed test with 95% confidence level).
For company 1, t is 2.18 and for company 2, t is 1.27.

Why one-tailed?
Because I want to test if twitter sentiments have a POSITIVE correlation to sales. (negative correlation has no practical use).
Original post by reyjusuf
Basically for each company I did the following:
Collect quarterly sales data from their websites
Collect tweets from the corresponding quarters (28 times) using hashtags #(name of the company) as the keyword. I used the Twitter full_archive_search Premium API https://developer.twitter.com/en/docs/twitter-api/premium/search-api/quick-start/premium-full-archive.
I performed sentiment analysis using Multinomial NB (I previosly tested multiple methods and I found Multinomial NB to have the highest accuracy) to classify the tweets as positive, negative and neutral.
For each quarter, I counted the number of positive, negative and neutral tweets and scored them using the formula (Positive - Negative)/(Positive + Negative + Neutral). This number is the one I'm correlate with sales and I used Excel's CORREL function to find the correlation coefficient.

For company 1, it's 0.39
For company 2, it's 0.24.

I needed to see if it's significant or not so I found the p-value using this:
https://opentextbc.ca/introstatopenstax/chapter/testing-the-significance-of-the-correlation-coefficient/

Since I have 28 data, my degrees of freedom in the t-table is 26. my t-value needs to be above 1.706 (one tailed test with 95% confidence level).
For company 1, t is 2.18 and for company 2, t is 1.27.

Why one-tailed?
Because I want to test if twitter sentiments have a POSITIVE correlation to sales. (negative correlation has no practical use).

Im not familiar with how youve developed/used the naive bayes classifier to encode the tweets each quarter, but the reservations were about a simple correlation of tweets and sales without factoring in (or not) other explanatory variables, being able to distinguish companies in the first place, linear correlation assumptions (have you eyeballed the scatter plots youre trying to correlate and the time series data, are there any outliers which would have a high leverage on the correlation values) etc would mean that Id probably spend more time on explaining those things in the conclusion, rather than discussing a difference in the results. If youre claiming some form of significance between tweet sentiment and sales, how can you have any confidence that its not due to some other effect, such as an advertising push? One well known example is the correlation between ice cream sales and murders.
https://slate.com/news-and-politics/2013/07/warm-weather-homicide-rates-when-ice-cream-sales-rise-homicides-rise-coincidence.html

You dont say whether youre predicting actual sales value or some form of percentage raise/fall? If its the latter, perhaps you could combine the data to represent some form of generic company model? But given what youve described, if I was writing a conclusion, I would spend more time on justifying/analysing the data (does it have an underlying linear structure, what is the noise/errors like, are there many outliers in the time series or residual correlation plots, why did you not look at other explanatory variables for sales change in the analysis ... and honestly spend little time discussing the final significances as it will be easy to argue about them. State the figures, say they're a bit inconclusive, but if there are outliers (for instance), see if you can tell a story around them / discuss their leverage on the solution.

Quick Reply

Latest