import numpy as np, pandas as pd, seaborn as sns, scipy.stats as st, math
from matplotlib import pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
a_data = pd.read_csv('amazon_bestsellers.csv')
a_data.index += 1
a_data.rename(columns = {'Name' : 'name',
'Author' : 'author',
'User Rating' : 'rating',
'Reviews' : 'reviews',
'Price' : 'price',
'Year' : 'year',
'Genre' : 'genre'}, inplace = True)
The internet has allowed consumers to easily rate many different kinds of products and services including restaurants, Uber drivers, pop albums, and hotel rooms. These ratings allow other consumers to determine whether or not to pay a particular price for these products and services. During the current COVID-19 pandemic, the combination of more free time and having to spend much of that extra free time at home has resulted in many people reading more books than they previously have. A recent survey by Global English Editing suggests that 35% of people worldwide have read more books than usual over the past year while 14% have read significantly more than usual (https://geediting.com/world-reading-habits-2020/). Accordingly, it has become more important for people to be able to accurately determine whether they will enjoy a book before committing to buy it.
In this project, we will analyze open-source data about the top 50 bestselling books on Amazon every year from 2009 - 2019. We want to investigate the relationship between book rating and book price as well how prices and ratings have changed over the relevant timeframe.
We restrict our data to the top 50 bestselling books to eliminate outlier data consisting of many books that are poorly rated. Our data set includes book metadata (name, author, genre, year), average user rating, how many reviews created that rating, and the price of that book on Amazon. Genre is actually a simplified category identifying each book as either fiction or non-fiction. Lastly, some of the authors in the dataset are not individuals but are organizations that have written and published books, such as the American Psychological Association. Below is a sample of the dataset.
a_data.head()
First, we will try to determine if there is any correlation between rating and price. We hypothesize that more highly rated books would be in higher demand and would thus demand higher prices. Here is a violin plot of book price versus book rating.
fig, ax = plt.subplots(figsize=(14,8.65))
ax = sns.violinplot(x="rating", y="price", data=a_data, color='lightseagreen')
plt.xlabel("User Rating", size=14)
plt.ylabel("Price ($)", size=14)
plt.title("Top 50 Amazon Book Prices vs. Ratings (2009 - 2019)", size=16)
plt.show()
The violin plot does not actually confirm or deny our hypothesis. To determine whether a correlation exists between price and rating, we'll perform regression analysis. We'll also go ahead and make a scatter plot of the data with the regression line.
X = a_data['rating'].values
y = a_data['price'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
model = LinearRegression().fit(X_train.reshape(-1,1), y_train)
coef50 = round(model.coef_[0], 3)
intercept50 = round(model.intercept_, 3)
print('Linear regression model given by:')
print('y = {}x'.format(coef50), end='')
if (intercept50 < 0):
print(' - {}'.format(abs(intercept50)))
else:
print(' + {}'.format(intercept50))
fig, ax = plt.subplots(figsize=(14,8.65))
plt.xlabel("User Rating", size=14)
plt.ylabel("Price ($)", size=14)
plt.title("Top 50 Amazon Book Prices vs. Ratings (2009 - 2019) w/ Regression", size=16)
ax.scatter(X,y)
x_lin = np.linspace(np.amin(X), np.amax(X), 100)
y_lin = intercept50 + (coef50 * x_lin)
plt.plot(x_lin, y_lin, color='red')
plt.show()
So, our hypothesis was wrong! Even though there is a correlation between rating and price, it is actually in the opposite direction of what we expected. For each 1-point increase in average user rating, book prices decrease by \$5.64.
A convenient consequence of this graphic is that we can quickly determine which books are good deals. Data points located below the regression line represent books that have a lower-than-expected price given their rating while points above the line represent books with a higher-than-expected price given their rating. Therefore, books below the line can be considered bargains while those above the line can be considered overpriced.
Even though our data only covers 10 years, it may be worthwhile to look at how the top 50 book prices changed over those ten years. While it is unlikely that any drastic change in price occurred during that time period, it's possible that the recovery from the '07-'08 financial crisis may have caused prices to increase slightly. Again, we'll plot the data using a violin plot and then calculate a linear regression model.
fig, ax = plt.subplots(figsize=(14,8.65))
ax = ax = sns.violinplot(x="year", y="price", data=a_data, color='lightseagreen')
plt.xlabel("User Rating", size=14)
plt.ylabel("Price ($)", size=14)
plt.title("Top 50 Amazon Book Prices vs. Ratings (2009 - 2019)", size=16)
plt.show()
X = a_data['year'].values
y = a_data['price'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
model = LinearRegression().fit(X_train.reshape(-1,1), y_train)
coef = round(model.coef_[0], 3)
intercept = round(model.intercept_, 3)
print('Linear regression model given by:')
print('y = {}x'.format(coef), end='')
if (intercept < 0):
print(' - {}'.format(abs(intercept)))
else:
print(' + {}'.format(intercept))
So, as we predicted, prices changed very little over the ten years covered by our dataset. However, instead of a slight increase, there was actually a slight decrease of \$0.38 per year. However, that amount is not very significant considering that book prices range from \$0 to \$105.
We also hypothesize that people would be less critical of a book that they paid less money for. In other words, if I pay a lot of money for a book it would have to be amazing for me to leave a 5-star review. However, I might leave a 5-star review for a cheap book that was just ok. To test for this, we are going to compare the number of reviews a book receives versus its price, and the overall rating versus the number of reviews. The idea is that cheap books would be bought more often and thus reviewed more, and books with more reviews would be higher rated indicating that cheap books receive a higher rating.
reviews_df = a_data.copy(deep=True)
for index, row in reviews_df.iterrows():
temp = reviews_df.loc[index]['reviews']
temp = temp/1000
reviews_df.at[index, 'reviews'] = temp
fig, ax = plt.subplots(figsize=(14,8.65))
ax = sns.violinplot(x="reviews", y="price", data=reviews_df, color='lightseagreen')
plt.xlabel("Number of Reviews (in 1000s)", size=14)
plt.ylabel("Price ($)", size=14)
plt.title("Top 50 Amazon Book Prices vs. Number of Reviews (2009 - 2019)", size=16)
plt.show()
Based on this violin plot it does not look like the price of the book has a huge effect on the number of reviews a book receives. However, just to confirm, we will also do some regression analysis.
X = reviews_df['reviews'].values
y = reviews_df['price'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
model = LinearRegression().fit(X_train.reshape(-1,1), y_train)
coef_count1 = round(model.coef_[0], 3)
intercept_count1 = round(model.intercept_, 3)
print('Linear regression model given by:')
print('y = {}x'.format(coef_count1), end='')
if (intercept_count1 < 0):
print(' - {}'.format(abs(intercept_count1)))
else:
print(' + {}'.format(intercept_count1))
fig, ax = plt.subplots(figsize=(14,8.65))
plt.xlabel("Number of Reviews (in 1000s)", size=14)
plt.ylabel("Price ($)", size=14)
plt.title("Top 50 Amazon Book Prices vs. Number of Reviews (2009 - 2019) w/ Regression", size=16)
ax.scatter(X,y)
x_lin = np.linspace(np.amin(X), np.amax(X), 100)
y_lin = intercept_count1 + (coef_count1 * x_lin)
plt.plot(x_lin, y_lin, color='red')
plt.show()
This shows that there is only a very slight, if any correlation between the number of reviews a book receives and the price of the book. This is alone is enough to disprove the hypothesis, but just for curiosity, we want to check if there is any correlation between the rating of the book and the number of reviews it receives.
fig, ax = plt.subplots(figsize=(14,8.65))
ax = sns.violinplot(x="reviews", y="rating", data=reviews_df, color='lightseagreen')
plt.xlabel("Number of Reviews (in 1000s)", size=14)
plt.ylabel("Rating", size=14)
plt.title("Top 50 Amazon Book Rating vs. Number of Reviews (2009 - 2019)", size=16)
plt.show()
X = reviews_df['rating'].values
y = reviews_df['reviews'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
model = LinearRegression().fit(X_train.reshape(-1,1), y_train)
coef_count2 = round(model.coef_[0], 3)
intercept_count2 = round(model.intercept_, 3)
print('Linear regression model given by:')
print('y = {}x'.format(coef_count2), end='')
if (intercept_count2 < 0):
print(' - {}'.format(abs(intercept_count)))
else:
print(' + {}'.format(intercept_count2))
fig, ax = plt.subplots(figsize=(14,8.65))
plt.xlabel("Rating", size=14)
plt.ylabel("Number of Reviews (in 1000s)", size=14)
plt.title("Top 50 Amazon Book Ratings vs. Number of Reviews (2009 - 2019) w/ Regression", size=16)
ax.scatter(X,y)
x_lin = np.linspace(np.amin(X), np.amax(X), 100)
y_lin = intercept_count2 + (coef_count2 * x_lin)
plt.plot(x_lin, y_lin, color='red')
plt.show()
Surprisingly, there also doesn't appear to be much of a correlation between the number of views that a book receives and the rating that it gets. This completely shatters he hypothesis that cheaper books would tend to get more reviews and thus a higher rating.
To determine who the best rated and most/least expensive authors are, we will narrow our dataset to include only the top 20 authors based on how many books they have had on the top 50 lists for the given time period. This ensures that authors with few data points do not pollute our data. To make this easy, we add a new column which represents the number of top 50 appearances for the author on a given row. Then we create a new data frame of the top 20 authors based on that metric. For this dataset, that is equivalent to the authors who have had 6 or more appearances on the top 50 lists.
Once we have this new DataFrame, we will create yet another DataFrame containing only three columns: each author from the top 20 list, the average rating for that author, and the average book price for that author.
a_data['count'] = 0
for index,row in a_data.iterrows():
a_data.at[index, 'count'] = (a_data['author']==row['author']).sum()
top20 = a_data[a_data['count'] >= 6]
top20 = top20.drop(columns=['reviews', 'year', 'count'])
top20_avgs = top20.groupby('author').mean()
top20_avgs.rename(columns={'rating': 'avg_rating', 'price': 'avg_price'}, inplace=True)
top20_avgs
The simplicity of this DataFrame will make it easy to work with. Similar to what we did in Part 1, we will now plot the average price versus the average rating and create a regression model.
X = top20_avgs['avg_rating'].values
y = top20_avgs['avg_price'].values
labels = top20_avgs.index.values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
model = LinearRegression().fit(X_train.reshape(-1,1), y_train)
coef20 = round(model.coef_[0], 3)
intercept20 = round(model.intercept_, 3)
print('Linear regression model given by:')
print('y = {}x'.format(coef20), end='')
if (intercept20 < 0):
print(' - {}'.format(abs(intercept20)))
else:
print(' + {}'.format(intercept20))
fig, ax = plt.subplots(figsize=(14,8.65))
plt.xlabel("Average Rating", size=14)
plt.ylabel("Average Price ($)", size=14)
plt.title("Top 20 Amazon Author Average Book Prices vs. Average Ratings (2009 - 2019) w/ Regression", size=16)
ax.scatter(X,y)
for index,author in enumerate(labels):
ax.annotate(author, (X[index] + 0.01, y[index]))
x_lin = np.linspace(np.amin(X), np.amax(X), 100)
y_lin = intercept20 + (coef20 * x_lin)
plt.plot(x_lin, y_lin, color='red')
plt.show()
There is a little bit to unpack with this graphic. First, given our axes, authors are sorted by rating along the x-axis and are sorted by price along the y-axis. Thus, the best and worst authors for both categories stick out right off the bat:
The American Psychological Association and Gallup are traditional individual authors, so lets take a look at what they're actually publishing to determine what makes them the most expensive and least rated authors respectively.
print('American Psychological Association:')
for book in a_data.loc[a_data['author'] == 'American Psychological Association']['name'].unique():
print(book)
print('\nGallup:')
for book in a_data.loc[a_data['author'] == 'Gallup']['name'].unique():
print(book)
A quick web search reveals that the APA publication is style manual often used for professional bibliographies and citations while the Gallup publication is a management book used to determine the reader's professional strengths.
Our graphic also confirms something we discovered earlier, which is the existence of an inverse correlation between rating and price. Here, however, that correlation is even more pronounced. For every 1 unit increase in average rating, there is an expected decrease in average price of $16.65.
Lastly, like we showed in Part 1, given their average rating, authors above the regression line are more expensive than expected and those below are less expensive than expected. If we calculate the residuals for each author, we can order them by residual and determine the least expensive authors given their ratings.
top20_avgs['res'] = 0.0
for index,row in top20_avgs.iterrows():
top20_avgs.at[index, 'res'] = row['avg_price'] - ((coef20 * row['avg_rating']) + intercept20)
top20_avgs.sort_values(by='res')
This resulting DataFrame shows us, for those who have appeared on Amazon's Top 50 list more than 6 times, the authors whose average price is lowest given their average rating.
The method used to determine the best priced authors given the rating can also be used to determine the best price books given the rating. Since we have already calculated a linear model for price versus rating for all of the books in our dataset, we can just add a column for residuals to the DataFrame and calculate.
a_data['res'] = 0.0
for index,row in a_data.iterrows():
a_data.at[index, 'res'] = row['price'] - ((coef50 * row['rating']) + intercept50)
a_data.sort_values(by='res')
The resulting DataFrame tells us that the best priced book in our dataset is Disney's Journey to the Ice Palace while the worst priced book is yet another publication from the APA.
Lastly, we want to check to see how our regressions models compares to models based on other datasets. To do this, we will first import a dataset gathered from Google Books. Like the Amazon data, the Google Books data lists book prices as well as an average rating for each book on a scale from 1 to 5. Since we only need price and rating data, we are going to remove most of the other columns.
One important thing to note is that the original dataset lists prices in Saudi Arabian Riyals (SAR). We will convert prices to US Dollars (USD) using the rate of 1 SAR to 0.27 USD. This rate was acquired by Morningstar for Currency and Coinbase for Cryptocurrency on December 16, 2020 at 15:37 UTC.
g_data = pd.read_csv('google_books.csv')
g_data.index += 1
g_data = g_data.drop(columns=['Unnamed: 0', 'description', 'publisher', 'page_count', 'generes', 'ISBN', 'language'])
g_data['price'] *= 0.27
g_data.dropna(subset=['rating', 'price'], inplace=True)
g_data.head()
Like we did with the Amazon data, we will create a linear regression model and determine the slope and intercept of the regression line. We are also going to change variable names so that they are more in line with the math we are going to execute.
b1 = coef50
X1 = a_data['rating'].values
Y1 = a_data['price'].values
a_intercept = intercept50
X2 = g_data['rating'].values
Y2 = g_data['price'].values
X_train, X_test, y_train, y_test = train_test_split(X2, Y2, test_size=0.2, random_state=0)
model = LinearRegression().fit(X_train.reshape(-1,1), y_train)
b2 = round(model.coef_[0], 3)
g_intercept = round(model.intercept_, 3)
print('Linear regression model given by:')
print('y = {}x'.format(b2), end='')
if (g_intercept < 0):
print(' - {}'.format(abs(g_intercept)))
else:
print(' + {}'.format(g_intercept))
fig, ax = plt.subplots(figsize=(14,8.65))
plt.xlabel("User Rating", size=14)
plt.ylabel("Price ($)", size=14)
plt.title("Google Book Prices vs. Ratings w/ Regression", size=16)
ax.scatter(X,y)
x_lin = np.linspace(np.amin(X), np.amax(X), 100)
y_lin = g_intercept + (b2 * x_lin)
plt.plot(x_lin, y_lin, color='red')
plt.show()
The Google data and regression look very similar to those that we saw from the Amazon data. However, the slopes (-5.64 for Amazon and -6.82 for Google) are a little off. To determine whether the true slope of the books on the Amazon Top 50 list is indeed less than that of the Google Books list, we will perform a two-sample hypothesis test for slopes. This will be conducted in 4 steps: establishing the null and alternate hypotheses, calculating the test statistic, calculating the p-value, and comparing the p-value to our significance level. We will use a fairly standard significance level of α = 0.05.
# Calculate SEb1
m = len(a_data['rating'])
x1_bar = X1.sum() / m
SEb1_num = ((Y1 - (b1*Y1 + a_intercept))**2).sum()
SEb1_den = (m - 2) * ((X1 - x1_bar)**2).sum()
SEb1 = SEb1_num / SEb1_den
# Calculate SEb2
n = len(g_data['rating'])
x2_bar = X2.sum() / n
SEb2_num = ((Y2 - (b2*Y2 + g_intercept))**2).sum()
SEb2_den = (n - 2) * ((X2 - x2_bar)**2).sum()
SEb2 = SEb2_num / SEb2_den
SEb2
# Calculate test statistic
z_cal = (b1 - b2) / math.sqrt(SEb1 + SEb2)
z_cal
p_val = 1 - st.norm.cdf(z_cal)
p_val
To summarize, the analysis of our datasets resulted in a number of conclusions that help to clarify various relationships between book prices, ratings, the number of reviewers, time, and even how different datasets relate to one another. The main takeaways from our analysis are: