In [1]:
import numpy as np, pandas as pd, seaborn as sns, scipy.stats as st, math
from matplotlib import pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

a_data = pd.read_csv('amazon_bestsellers.csv')
a_data.index += 1
a_data.rename(columns = {'Name'        : 'name',
                         'Author'      : 'author',
                         'User Rating' : 'rating',
                         'Reviews'     : 'reviews',
                         'Price'       : 'price',
                         'Year'        : 'year',
                         'Genre'       : 'genre'}, inplace = True) 

Analysis of Book Data from Amazon

Clayton Durkin, Hyen-Tyae Jeong, Thomas Geisler

The internet has allowed consumers to easily rate many different kinds of products and services including restaurants, Uber drivers, pop albums, and hotel rooms. These ratings allow other consumers to determine whether or not to pay a particular price for these products and services. During the current COVID-19 pandemic, the combination of more free time and having to spend much of that extra free time at home has resulted in many people reading more books than they previously have. A recent survey by Global English Editing suggests that 35% of people worldwide have read more books than usual over the past year while 14% have read significantly more than usual (https://geediting.com/world-reading-habits-2020/). Accordingly, it has become more important for people to be able to accurately determine whether they will enjoy a book before committing to buy it.

In this project, we will analyze open-source data about the top 50 bestselling books on Amazon every year from 2009 - 2019. We want to investigate the relationship between book rating and book price as well how prices and ratings have changed over the relevant timeframe.

We restrict our data to the top 50 bestselling books to eliminate outlier data consisting of many books that are poorly rated. Our data set includes book metadata (name, author, genre, year), average user rating, how many reviews created that rating, and the price of that book on Amazon. Genre is actually a simplified category identifying each book as either fiction or non-fiction. Lastly, some of the authors in the dataset are not individuals but are organizations that have written and published books, such as the American Psychological Association. Below is a sample of the dataset.

In [2]:
a_data.head()
Out[2]:
name author rating reviews price year genre
1 10-Day Green Smoothie Cleanse JJ Smith 4.7 17350 8 2016 Non Fiction
2 11/22/63: A Novel Stephen King 4.6 2052 22 2011 Fiction
3 12 Rules for Life: An Antidote to Chaos Jordan B. Peterson 4.7 18979 15 2018 Non Fiction
4 1984 (Signet Classics) George Orwell 4.7 21424 6 2017 Fiction
5 5,000 Awesome Facts (About Everything!) (Natio... National Geographic Kids 4.8 7665 12 2019 Non Fiction

Part 1: Price Versus Rating

First, we will try to determine if there is any correlation between rating and price. We hypothesize that more highly rated books would be in higher demand and would thus demand higher prices. Here is a violin plot of book price versus book rating.

In [3]:
fig, ax = plt.subplots(figsize=(14,8.65))
ax = sns.violinplot(x="rating", y="price", data=a_data, color='lightseagreen')
plt.xlabel("User Rating", size=14)
plt.ylabel("Price ($)", size=14)
plt.title("Top 50 Amazon Book Prices vs. Ratings (2009 - 2019)", size=16)
plt.show()

The violin plot does not actually confirm or deny our hypothesis. To determine whether a correlation exists between price and rating, we'll perform regression analysis. We'll also go ahead and make a scatter plot of the data with the regression line.

In [4]:
X = a_data['rating'].values
y = a_data['price'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
model = LinearRegression().fit(X_train.reshape(-1,1), y_train)
coef50 = round(model.coef_[0], 3)
intercept50 = round(model.intercept_, 3)
print('Linear regression model given by:')
print('y = {}x'.format(coef50), end='')
if (intercept50 < 0):
    print(' - {}'.format(abs(intercept50)))
else:
    print(' + {}'.format(intercept50))
fig, ax = plt.subplots(figsize=(14,8.65))
plt.xlabel("User Rating", size=14)
plt.ylabel("Price ($)", size=14)
plt.title("Top 50 Amazon Book Prices vs. Ratings (2009 - 2019) w/ Regression", size=16)
ax.scatter(X,y)
x_lin = np.linspace(np.amin(X), np.amax(X), 100)
y_lin = intercept50 + (coef50 * x_lin)
plt.plot(x_lin, y_lin, color='red')
plt.show()
Linear regression model given by:
y = -5.64x + 38.734

So, our hypothesis was wrong! Even though there is a correlation between rating and price, it is actually in the opposite direction of what we expected. For each 1-point increase in average user rating, book prices decrease by \$5.64.

A convenient consequence of this graphic is that we can quickly determine which books are good deals. Data points located below the regression line represent books that have a lower-than-expected price given their rating while points above the line represent books with a higher-than-expected price given their rating. Therefore, books below the line can be considered bargains while those above the line can be considered overpriced.

Part 2A: Price versus Year

Even though our data only covers 10 years, it may be worthwhile to look at how the top 50 book prices changed over those ten years. While it is unlikely that any drastic change in price occurred during that time period, it's possible that the recovery from the '07-'08 financial crisis may have caused prices to increase slightly. Again, we'll plot the data using a violin plot and then calculate a linear regression model.

In [5]:
fig, ax = plt.subplots(figsize=(14,8.65))
ax = ax = sns.violinplot(x="year", y="price", data=a_data, color='lightseagreen')
plt.xlabel("User Rating", size=14)
plt.ylabel("Price ($)", size=14)
plt.title("Top 50 Amazon Book Prices vs. Ratings (2009 - 2019)", size=16)
plt.show()
X = a_data['year'].values
y = a_data['price'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
model = LinearRegression().fit(X_train.reshape(-1,1), y_train)
coef = round(model.coef_[0], 3)
intercept = round(model.intercept_, 3)
print('Linear regression model given by:')
print('y = {}x'.format(coef), end='')
if (intercept < 0):
    print(' - {}'.format(abs(intercept)))
else:
    print(' + {}'.format(intercept))
Linear regression model given by:
y = -0.38x + 778.358

So, as we predicted, prices changed very little over the ten years covered by our dataset. However, instead of a slight increase, there was actually a slight decrease of \$0.38 per year. However, that amount is not very significant considering that book prices range from \$0 to \$105.

Part 2B: Price versus Number of Reviews

We also hypothesize that people would be less critical of a book that they paid less money for. In other words, if I pay a lot of money for a book it would have to be amazing for me to leave a 5-star review. However, I might leave a 5-star review for a cheap book that was just ok. To test for this, we are going to compare the number of reviews a book receives versus its price, and the overall rating versus the number of reviews. The idea is that cheap books would be bought more often and thus reviewed more, and books with more reviews would be higher rated indicating that cheap books receive a higher rating.

In [6]:
reviews_df = a_data.copy(deep=True)

for index, row in reviews_df.iterrows():
    temp = reviews_df.loc[index]['reviews']
    temp = temp/1000
    reviews_df.at[index, 'reviews'] = temp
    
fig, ax = plt.subplots(figsize=(14,8.65))
ax = sns.violinplot(x="reviews", y="price", data=reviews_df, color='lightseagreen')
plt.xlabel("Number of Reviews (in 1000s)", size=14)
plt.ylabel("Price ($)", size=14)
plt.title("Top 50 Amazon Book Prices vs. Number of Reviews (2009 - 2019)", size=16)
plt.show()

Based on this violin plot it does not look like the price of the book has a huge effect on the number of reviews a book receives. However, just to confirm, we will also do some regression analysis.

In [7]:
X = reviews_df['reviews'].values
y = reviews_df['price'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
model = LinearRegression().fit(X_train.reshape(-1,1), y_train)
coef_count1 = round(model.coef_[0], 3)
intercept_count1 = round(model.intercept_, 3)
print('Linear regression model given by:')
print('y = {}x'.format(coef_count1), end='')
if (intercept_count1 < 0):
    print(' - {}'.format(abs(intercept_count1)))
else:
    print(' + {}'.format(intercept_count1))
fig, ax = plt.subplots(figsize=(14,8.65))
plt.xlabel("Number of Reviews (in 1000s)", size=14)
plt.ylabel("Price ($)", size=14)
plt.title("Top 50 Amazon Book Prices vs. Number of Reviews (2009 - 2019) w/ Regression", size=16)
ax.scatter(X,y)
x_lin = np.linspace(np.amin(X), np.amax(X), 100)
y_lin = intercept_count1 + (coef_count1 * x_lin)
plt.plot(x_lin, y_lin, color='red')
plt.show()
Linear regression model given by:
y = -0.078x + 13.579

This shows that there is only a very slight, if any correlation between the number of reviews a book receives and the price of the book. This is alone is enough to disprove the hypothesis, but just for curiosity, we want to check if there is any correlation between the rating of the book and the number of reviews it receives.

In [8]:
fig, ax = plt.subplots(figsize=(14,8.65))
ax = sns.violinplot(x="reviews", y="rating", data=reviews_df, color='lightseagreen')
plt.xlabel("Number of Reviews (in 1000s)", size=14)
plt.ylabel("Rating", size=14)
plt.title("Top 50 Amazon Book Rating vs. Number of Reviews (2009 - 2019)", size=16)
plt.show()

X = reviews_df['rating'].values
y = reviews_df['reviews'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
model = LinearRegression().fit(X_train.reshape(-1,1), y_train)
coef_count2 = round(model.coef_[0], 3)
intercept_count2 = round(model.intercept_, 3)
print('Linear regression model given by:')
print('y = {}x'.format(coef_count2), end='')
if (intercept_count2 < 0):
    print(' - {}'.format(abs(intercept_count)))
else:
    print(' + {}'.format(intercept_count2))
fig, ax = plt.subplots(figsize=(14,8.65))
plt.xlabel("Rating", size=14)
plt.ylabel("Number of Reviews (in 1000s)", size=14)
plt.title("Top 50 Amazon Book Ratings vs. Number of Reviews (2009 - 2019) w/ Regression", size=16)
ax.scatter(X,y)
x_lin = np.linspace(np.amin(X), np.amax(X), 100)
y_lin = intercept_count2 + (coef_count2 * x_lin)
plt.plot(x_lin, y_lin, color='red')
plt.show()
Linear regression model given by:
y = -1.286x + 17.572

Surprisingly, there also doesn't appear to be much of a correlation between the number of views that a book receives and the rating that it gets. This completely shatters he hypothesis that cheaper books would tend to get more reviews and thus a higher rating.

Part 3: Best Rated and Most Expensive Authors

To determine who the best rated and most/least expensive authors are, we will narrow our dataset to include only the top 20 authors based on how many books they have had on the top 50 lists for the given time period. This ensures that authors with few data points do not pollute our data. To make this easy, we add a new column which represents the number of top 50 appearances for the author on a given row. Then we create a new data frame of the top 20 authors based on that metric. For this dataset, that is equivalent to the authors who have had 6 or more appearances on the top 50 lists.

Once we have this new DataFrame, we will create yet another DataFrame containing only three columns: each author from the top 20 list, the average rating for that author, and the average book price for that author.

In [9]:
a_data['count'] = 0
for index,row in a_data.iterrows():
    a_data.at[index, 'count'] = (a_data['author']==row['author']).sum()
top20 = a_data[a_data['count'] >= 6]
top20 = top20.drop(columns=['reviews', 'year', 'count'])
top20_avgs = top20.groupby('author').mean()
top20_avgs.rename(columns={'rating': 'avg_rating', 'price': 'avg_price'}, inplace=True)
top20_avgs
Out[9]:
avg_rating avg_price
author
American Psychological Association 4.500000 46.000000
Bill O'Reilly 4.642857 10.571429
Dav Pilkey 4.900000 6.285714
Don Miguel Ruiz 4.700000 6.000000
Dr. Seuss 4.877778 8.666667
E L James 4.233333 15.333333
Eric Carle 4.900000 5.000000
Gallup 4.000000 17.000000
Gary Chapman 4.736364 17.181818
Harper Lee 4.600000 4.333333
J.K. Rowling 4.450000 20.166667
Jeff Kinney 4.800000 9.250000
Rick Riordan 4.772727 9.909091
Rob Elliott 4.562500 4.000000
Sarah Young 4.900000 8.000000
Stephen R. Covey 4.642857 20.571429
Stephenie Meyer 4.657143 19.857143
Stieg Larsson 4.600000 9.500000
Suzanne Collins 4.663636 13.363636
The College Board 4.383333 39.333333

The simplicity of this DataFrame will make it easy to work with. Similar to what we did in Part 1, we will now plot the average price versus the average rating and create a regression model.

In [10]:
X = top20_avgs['avg_rating'].values
y = top20_avgs['avg_price'].values
labels = top20_avgs.index.values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
model = LinearRegression().fit(X_train.reshape(-1,1), y_train)
coef20 = round(model.coef_[0], 3)
intercept20 = round(model.intercept_, 3)
print('Linear regression model given by:')
print('y = {}x'.format(coef20), end='')
if (intercept20 < 0):
    print(' - {}'.format(abs(intercept20)))
else:
    print(' + {}'.format(intercept20))
fig, ax = plt.subplots(figsize=(14,8.65))
plt.xlabel("Average Rating", size=14)
plt.ylabel("Average Price ($)", size=14)
plt.title("Top 20 Amazon Author Average Book Prices vs. Average Ratings (2009 - 2019) w/ Regression", size=16)
ax.scatter(X,y)
for index,author in enumerate(labels):
    ax.annotate(author, (X[index] + 0.01, y[index]))
x_lin = np.linspace(np.amin(X), np.amax(X), 100)
y_lin = intercept20 + (coef20 * x_lin)
plt.plot(x_lin, y_lin, color='red')
plt.show()
Linear regression model given by:
y = -16.652x + 90.232

There is a little bit to unpack with this graphic. First, given our axes, authors are sorted by rating along the x-axis and are sorted by price along the y-axis. Thus, the best and worst authors for both categories stick out right off the bat:

  • Most expensive author: American Psychological Association
  • Least expensive author: Rob Elliot
  • Highest rated author: Sarah Young, Dav Pilkey, Eric Carle (Tie)
  • Lowest rated author: Gallup

The American Psychological Association and Gallup are traditional individual authors, so lets take a look at what they're actually publishing to determine what makes them the most expensive and least rated authors respectively.

In [11]:
print('American Psychological Association:')
for book in a_data.loc[a_data['author'] == 'American Psychological Association']['name'].unique():
    print(book)
print('\nGallup:')
for book in a_data.loc[a_data['author'] == 'Gallup']['name'].unique():
    print(book)
American Psychological Association:
Publication Manual of the American Psychological Association, 6th Edition

Gallup:
StrengthsFinder 2.0

A quick web search reveals that the APA publication is style manual often used for professional bibliographies and citations while the Gallup publication is a management book used to determine the reader's professional strengths.

Our graphic also confirms something we discovered earlier, which is the existence of an inverse correlation between rating and price. Here, however, that correlation is even more pronounced. For every 1 unit increase in average rating, there is an expected decrease in average price of $16.65.

Lastly, like we showed in Part 1, given their average rating, authors above the regression line are more expensive than expected and those below are less expensive than expected. If we calculate the residuals for each author, we can order them by residual and determine the least expensive authors given their ratings.

In [12]:
top20_avgs['res'] = 0.0
for index,row in top20_avgs.iterrows():
    top20_avgs.at[index, 'res'] = row['avg_price'] - ((coef20 * row['avg_rating']) + intercept20)
top20_avgs.sort_values(by='res')
Out[12]:
avg_rating avg_price res
author
Rob Elliott 4.562500 4.000000 -10.257250
Harper Lee 4.600000 4.333333 -9.299467
Gallup 4.000000 17.000000 -6.624000
Don Miguel Ruiz 4.700000 6.000000 -5.967600
E L James 4.233333 15.333333 -4.405200
Stieg Larsson 4.600000 9.500000 -4.132800
Eric Carle 4.900000 5.000000 -3.637200
Dav Pilkey 4.900000 6.285714 -2.351486
Bill O'Reilly 4.642857 10.571429 -2.347714
Jeff Kinney 4.800000 9.250000 -1.052400
Rick Riordan 4.772727 9.909091 -0.847455
Sarah Young 4.900000 8.000000 -0.637200
Dr. Seuss 4.877778 8.666667 -0.340578
Suzanne Collins 4.663636 13.363636 0.790509
J.K. Rowling 4.450000 20.166667 4.036067
Gary Chapman 4.736364 17.181818 5.819745
Stephenie Meyer 4.657143 19.857143 7.175886
Stephen R. Covey 4.642857 20.571429 7.652286
The College Board 4.383333 39.333333 22.092600
American Psychological Association 4.500000 46.000000 30.702000

This resulting DataFrame shows us, for those who have appeared on Amazon's Top 50 list more than 6 times, the authors whose average price is lowest given their average rating.

Part 4: Best Books for the Price

The method used to determine the best priced authors given the rating can also be used to determine the best price books given the rating. Since we have already calculated a linear model for price versus rating for all of the books in our dataset, we can just add a column for residuals to the DataFrame and calculate.

In [13]:
a_data['res'] = 0.0
for index,row in a_data.iterrows():
    a_data.at[index, 'res'] = row['price'] - ((coef50 * row['rating']) + intercept50)
a_data.sort_values(by='res')
Out[13]:
name author rating reviews price year genre count res
194 JOURNEY TO THE ICE P RH Disney 4.6 978 0 2014 Fiction 2 -12.790
462 The Short Second Life of Bree Tanner: An Eclip... Stephenie Meyer 4.6 2122 0 2010 Fiction 7 -12.790
92 Eat This Not That! Supermarket Survival Guide:... David Zinczenko 4.5 720 1 2009 Non Fiction 2 -12.354
117 Frozen (Little Golden Book) RH Disney 4.7 3642 0 2014 Fiction 2 -12.226
390 The Girl with the Dragon Tattoo (Millennium Se... Stieg Larsson 4.4 10559 2 2010 Fiction 6 -11.918
... ... ... ... ... ... ... ... ... ...
347 The Book of Basketball: The NBA According to T... Bill Simmons 4.7 858 53 2009 Non Fiction 1 40.774
152 Hamilton: The Revolution Lin-Manuel Miranda 4.9 5867 54 2016 Non Fiction 1 42.902
474 The Twilight Saga Collection Stephenie Meyer 4.7 3801 82 2009 Fiction 7 69.774
70 Diagnostic and Statistical Manual of Mental Di... American Psychiatric Association 4.5 6679 105 2013 Non Fiction 2 91.646
71 Diagnostic and Statistical Manual of Mental Di... American Psychiatric Association 4.5 6679 105 2014 Non Fiction 2 91.646

550 rows × 9 columns

The resulting DataFrame tells us that the best priced book in our dataset is Disney's Journey to the Ice Palace while the worst priced book is yet another publication from the APA.

Part 5: Comparing Regression Between Datasets

Lastly, we want to check to see how our regressions models compares to models based on other datasets. To do this, we will first import a dataset gathered from Google Books. Like the Amazon data, the Google Books data lists book prices as well as an average rating for each book on a scale from 1 to 5. Since we only need price and rating data, we are going to remove most of the other columns.

One important thing to note is that the original dataset lists prices in Saudi Arabian Riyals (SAR). We will convert prices to US Dollars (USD) using the rate of 1 SAR to 0.27 USD. This rate was acquired by Morningstar for Currency and Coinbase for Cryptocurrency on December 16, 2020 at 15:37 UTC.

In [14]:
g_data = pd.read_csv('google_books.csv')
g_data.index += 1
g_data = g_data.drop(columns=['Unnamed: 0', 'description', 'publisher', 'page_count', 'generes', 'ISBN', 'language'])
g_data['price'] *= 0.27
g_data.dropna(subset=['rating', 'price'], inplace=True)
g_data.head()
Out[14]:
title author rating voters price currency published_date
1 Attack on Titan: Volume 13 Hajime Isayama 4.6 428 11.6856 SAR Jul 31, 2014
2 Antiques Roadkill: A Trash 'n' Treasures Mystery Barbara Allan 3.3 23 7.0605 SAR Jul 1, 2007
3 The Art of Super Mario Odyssey Nintendo 3.9 9 36.1395 SAR Nov 5, 2019
4 Getting Away Is Deadly: An Ellie Avery Mystery Sara Rosett 4.0 10 7.0605 SAR Mar 1, 2009
5 The Painted Man (The Demon Cycle, Book 1) Peter V. Brett 4.5 577 7.7058 SAR Jan 8, 2009

Like we did with the Amazon data, we will create a linear regression model and determine the slope and intercept of the regression line. We are also going to change variable names so that they are more in line with the math we are going to execute.

In [15]:
b1 = coef50
X1 = a_data['rating'].values
Y1 = a_data['price'].values
a_intercept = intercept50

X2 = g_data['rating'].values
Y2 = g_data['price'].values
X_train, X_test, y_train, y_test = train_test_split(X2, Y2, test_size=0.2, random_state=0)
model = LinearRegression().fit(X_train.reshape(-1,1), y_train)
b2 = round(model.coef_[0], 3)
g_intercept = round(model.intercept_, 3)
print('Linear regression model given by:')
print('y = {}x'.format(b2), end='')
if (g_intercept < 0):
    print(' - {}'.format(abs(g_intercept)))
else:
    print(' + {}'.format(g_intercept))
fig, ax = plt.subplots(figsize=(14,8.65))
plt.xlabel("User Rating", size=14)
plt.ylabel("Price ($)", size=14)
plt.title("Google Book Prices vs. Ratings w/ Regression", size=16)
ax.scatter(X,y)
x_lin = np.linspace(np.amin(X), np.amax(X), 100)
y_lin = g_intercept + (b2 * x_lin)
plt.plot(x_lin, y_lin, color='red')
plt.show()
Linear regression model given by:
y = -6.82x + 41.844

The Google data and regression look very similar to those that we saw from the Amazon data. However, the slopes (-5.64 for Amazon and -6.82 for Google) are a little off. To determine whether the true slope of the books on the Amazon Top 50 list is indeed less than that of the Google Books list, we will perform a two-sample hypothesis test for slopes. This will be conducted in 4 steps: establishing the null and alternate hypotheses, calculating the test statistic, calculating the p-value, and comparing the p-value to our significance level. We will use a fairly standard significance level of α = 0.05.

  1. We will let β1 and β2 be the true correlation coefficients for the Amazon and Google datasets respectively while b1 and b2 will be the respective sample coefficients. Since b1 = -5.64 > -6.82 = b2, we will test the alternate hypothesis that β1 > β2. Thus, our null and alternate hypotheses are as follows:
    H0:   β1 - β2 = 0
    Ha:   β1 - β2 > 0

  2. Next, we will calculate our test statistic using the formula

    We already have the terms in the numerator. The terms in the square root in the denominator can be found via

In [16]:
# Calculate SEb1
m = len(a_data['rating'])
x1_bar = X1.sum() / m
SEb1_num = ((Y1 - (b1*Y1 + a_intercept))**2).sum()
SEb1_den = (m - 2) * ((X1 - x1_bar)**2).sum()
SEb1 = SEb1_num / SEb1_den

# Calculate SEb2
n = len(g_data['rating'])
x2_bar = X2.sum() / n
SEb2_num = ((Y2 - (b2*Y2 + g_intercept))**2).sum()
SEb2_den = (n - 2) * ((X2 - x2_bar)**2).sum()
SEb2 = SEb2_num / SEb2_den
SEb2

# Calculate test statistic
z_cal = (b1 - b2) / math.sqrt(SEb1 + SEb2)
z_cal
Out[16]:
0.0671105598681493

  • Thus, our test statistic is zcal = 0.0671

  1. The next step is to use our test statistic to find our p-value. This step is relatively simple since SciPy has built-in tables for z-critical values. For our alternate hypothesis β1 - β2 > 0, the p-value is given by

In [17]:
p_val = 1 - st.norm.cdf(z_cal)
p_val
Out[17]:
0.4732468436452486
  • Thus, we have p-value = 0.4732.
  1. The last thing we must do is compare our p-value to our significance level of 0.05. We therefore have

    Since our p-value is greater than our level of significance, we fail to reject the null hypothesis. Given our sample data, there is insufficient evidence to suggest that the inverse correlation between price and rating of books on Amazon's Top 50 bestsellers list is greater than that for books on Google Books.

Conclusion

To summarize, the analysis of our datasets resulted in a number of conclusions that help to clarify various relationships between book prices, ratings, the number of reviewers, time, and even how different datasets relate to one another. The main takeaways from our analysis are:

  • There is a clear negative correlation between book price and book rating. I.e., for every 1-point increase in rating, the expected price of a book decreases by $5.64.
  • For the time period 2009 - 2019, book prices actually decreased slightly over time by about 38¢ per year.
  • People do not tend to review books differently based on cost. Expensive books do not seem to be held to a different standard than cheap ones.
  • The best authors for the price are Rob Elliot and Harper Lee while the worst are the American Psychological Association and The College Board.
  • The best book for the price is Journey to the Ice Palace by RH Disney.
  • Despite a naive analysis of linear regression suggesting otherwise, the negative correlation between book price and book rating for the Google dataset is not any more pronounced than that for the Amazon dataset.