Sentiment Analysis Of Yelp User Review Data

Social data provides important, real-time insights on consumer opinion – on lifestyle, habits, brands, and preferences. Because these opinions are unsolicited, they provide genuine insight into consumer feelings, and, as such, they should be valued. Yelp provides restaurant details including name, price, rating, address and reviews. The ratings given by the users say how good the restaurant is, but do you really think that the ratings alone is sufficient to give the correct information? No, because people who really hated a restaurant would comment on their experience. The same goes for a the good experience. So, Thus, one would expect that performing sentiment analysis would give give a better insight about judging a into the masses’ opinions of restaurants.

Web Scraping:

Web Scraping the yelp.com to scrape the restaurant data. The data I scraped Restaurant data was scraped from Yelp using the python package BeautifulSoup. The data consists of information such as restaurant name, rating, price, number of reviews, address and user reviews. I split the web scraping module into two tasks. The first one is to scraped the restaurant name, rating, price, number of reviews and address. The second one is to scraped the restaurant name and user reviews. The data sets were then merged. Finally, we merge the two datasets. We use BeautifulSoup to scrape the data.

Module 1:

The code used to scrape the first model is as follows
Scraping First Model - Code
from __future__ import print_function
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://www.yelp.com/search?find_desc=Restaurants&find_near=times-square-new-york-2"
r = requests.get(url)
soup = BeautifulSoup(r.content)
#print(soup.prettify())
names = []
reviews = []
price = []
address = []
links = soup.find_all("h3",{"class":"search-result-title"})

for link in links:
    text = ''
    for string in link.strings:
        if string.strip():
            text = text + string.strip()
            names.append(text)
#     names.append(link.text)
#     print(link.text)
     
     
links1 = soup.find_all("span",{"class":"review-count"}) # rating-qualifier

for link in links1:
    reviews.append(link.text)
    print(link.text)
     
     

links2 = soup.find_all("span",{"class":"business-attribute price-range"})

for link in links2:
    price.append(link.text)
    print(link.text)
     
links3 = soup.find_all("address")

for link in links3:
    address.append(link.text)
    print(link.text)
i = 0     
while (i < 20):
    try:
        i += 1
        ttag = soup.find_all('a', {'class':'u-decoration-none next pagination-links_anchor'})
    
        next_url = 'https://www.yelp.com/'+ttag[0].get('href')
        
        #soup = BeautifulSoup(requests.get(next_url))
        q = requests.get(next_url)
        soup = BeautifulSoup(q.content)        
        
        
        links = soup.find_all("h3",{"class":"search-result-title"})
        
        for link in links:
            
            names.append(link.text)
            print(link.text)
     
        
        links1 = soup.find_all("span",{"class":"review-count rating-qualifier"})

        for link in links1:
            reviews.append(link.text)
            print(link.text)
     
     

        links2 = soup.find_all("span",{"class":"business-attribute price-range"})

        for link in links2:
            price.append(link.text)
            print(link.text)
     
        links3 = soup.find_all("address")

        for link in links3:
            address.append(link.text)            
            print(link.text)
    
    except:
        break
        #raise
review = map(lambda x: x.strip().encode('ascii'), reviews)
price = map(lambda x: x.strip().encode('ascii'), price)
address = map(lambda x: x.strip().encode('ascii'), address)
name = map(lambda x: x.strip().encode('ascii', errors="ignore"), names)


df_l = pd.DataFrame(review)
df_a = pd.DataFrame(price)
df = pd.concat([df_l, df_a], axis=1)
df_b = pd.DataFrame(address)
df1 = pd.concat([df, df_b], axis=1)
df_n = pd.DataFrame(names)
data = pd.concat([df1, df_n], axis=1)


df1.to_csv('C:/Users/venkatesh/Documents/s.csv', sep='\t', encoding='utf-8')

#s = df.to_csv(data, sep='\t')
#s = df.to_csv(data, sep='\t', encoding='utf-8')
import codecs
writer = codecs.open("ss.csv","w", encoding = "utf-8")
for x in range(0, 230):
    writer.write(review[x])
    writer.write(",")
    writer.write(price[x])
    writer.write(",")
    writer.write(address[x])
    writer.write("\n")
writer.close()
    



log = open("hotel_name.txt", "w")
for n in name:
print(n, file = log)
									   

Module 2:

The code used to scrape the second model is as follows
Scraping Second Model - Code
# -*- coding: utf-8 -*-
"""
Created on Fri Aug 12 16:04:55 2016
@author: venkatesh
"""

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=Times+Square%2C+NY&ns=1"
info = pd.DataFrame()
r = requests.get(url)
info = {}
soup = BeautifulSoup(r.content)
#print(soup.prettify())
links = soup.find_all("h3",{"class":"search-result-title"})
for link in links[1:]:
    res_name = ''.join(map(lambda x: x.strip(), link.strings))
    print res_name
    info[res_name] = []
    res_url = 'https://www.yelp.com/' + link.find('a').get('href')
    res = requests.get(res_url)
    soup1 = BeautifulSoup(res.content)
    print res_url
#    next_url = 'https://www.yelp.com/'+ttag[0].get('href')
#    q = requests.get(next_url)
#    soup1 = BeautifulSoup(q.content)
    reviews = soup1.find_all("p", {"itemprop": "description"})
    for review in reviews:
#        print review.text
        info[res_name].append(review.text)
        


i = 0     
while (i < 20): 
    try:
        i += 1
        ttag = soup.find_all('a', {'class':'u-decoration-none next pagination-links_anchor'})
        
        next_url = 'https://www.yelp.com/'+ttag[0].get('href')
        #next_url = ttag[-1].get('href')
#        print next_url
 #       '/search?find_desc=Restaurants&find_loc=Times+Square%2C+NY&start=10'
        q = requests.get(next_url)
        soup = BeautifulSoup(q.content)
        links = soup.find_all("h3",{"class":"search-result-title"})
#        print len(links[1:])
        for link in links[1:]:
            res_name = ''.join(map(lambda x: x.strip(), link.strings))
            print res_name
            info[res_name] = []
            res_url = 'https://www.yelp.com/' + link.find('a').get('href')
            res = requests.get(res_url)
            soup1 = BeautifulSoup(res.content)
 #           print res_url
#    next_url = 'https://www.yelp.com/'+ttag[0].get('href')
#    q = requests.get(next_url)
#    soup = BeautifulSoup(q.content)
            reviews = soup1.find_all("p", {"itemprop": "description"})
            for review in reviews:
#                print review.text
                info[res_name].append(review.text)

    except:
        break
    
    
print info

x = pd.DataFrame(info.items(), columns=['Restaurant', 'Review']) 

#x.to_csv('C:/Users/venkatesh/Documents/review.csv', sep='\t', encoding='utf-8')


import codecs
writer = codecs.open("daa.csv","w", encoding = "utf-8")
for key, val in info.items():
    writer.write(key)
    writer.write(",")
    value = str(val).replace(",", "")
    writer.write(value)
    writer.write("\n")
writer.close()
									   

Exploratory Data Analysis:

My scraping was restricted to the restaurants in a 2 mile radius around Times Square. I scraped the restaurants in and around Times Square. To be precise, I scraped the restaurants placed in a 2 mile radius. The distribution of the ratings is as follows Most of the restaurants in the first 20 pages of the yelp data has have a rating of 4. The distribution of the number of reviews is as follows The distribution shows that the majority of the restaurant reviews range from 0 to 1000.

Sentiment Analysis:

Sentiment Analysis was performed using the Natural Language Toolkit. The name of the specific package used is called Vader Sentiment. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains. The code for Sentiment analysis is as follows: It works on the word level, by classifying splitting each word into either positive, negative, or neutral. I want to We concentrate on the positive and negative words as neutral words doesn't add value. The plot of the sentiment analysis is as follows There are few interesting observations showing that reviews and ratings contradict. The plot of those observations are as follows A restaurant with a rating of 4 has an equal mixture of negative and positive words. It’s safe to say that these reviews are mixed So the restaurant has mixed reviews. The restaurant has a rating of 4, but the sentiment analysis says that the restaraunt has more negative reviews than positive reviews.

Future Scope:

The algorithm can be combined with the the text mining so that the dish name specified in the reviews can be combined with incorporated into the sentiment analysis algorithm to give an output saying that say whether or not a particular dish is associated with has a positive sentiment or negative sentiment and a overall score can be specified.

About the author: Venkatesh Subramaniam

Leave a comment