Yelp Rating Predictor Cumulative Project

Yelp Rating Predictor Cumulative Project

Hi,

I completed the above project.
It is run on Colaboratory well, however on my computer I had several issues:

1. loading dataset:

businesses = pd.read_json('yelp_business.json',lines=True)

create memory error
MemoryError                               Traceback (most recent call last)
<ipython-input-1-2f80f790b198> in <module>
     27 
     28 
---> 29 businesses = pd.read_json('yelp_business.json',lines=True)
     30 reviews = pd.read_json('yelp_review.json',lines=True)
     31 users = pd.read_json('yelp_user.json',lines=True)

~\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
    197                 else:
    198                     kwargs[new_arg_name] = new_arg_value
--> 199             return func(*args, **kwargs)
    200 
    201         return cast(F, wrapper)

~\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
    294                 )
    295                 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 296             return func(*args, **kwargs)
    297 
    298         return wrapper

~\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\json\_json.py in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines, chunksize, compression, nrows)
    595     )
    596 
--> 597     json_reader = JsonReader(
    598         filepath_or_buffer,
    599         orient=orient,

~\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\json\_json.py in __init__(self, filepath_or_buffer, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines, chunksize, compression, nrows)
    678 
    679         data = self._get_data_from_filepath(filepath_or_buffer)
--> 680         self.data = self._preprocess_data(data)
    681 
    682     def _preprocess_data(self, data):

~\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\json\_json.py in _preprocess_data(self, data)
    689         """
    690         if hasattr(data, "read") and (not self.chunksize or not self.nrows):
--> 691             data = data.read()
    692         if not hasattr(data, "read") and (self.chunksize or self.nrows):
    693             data = StringIO(data)

~\AppData\Local\Programs\Python\Python38-32\lib\codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

MemoryError: 

I found a solution and I created a load_json function. This function create the expected pandas dataframe.
I also delete any dataframe if not necessary anymore with del statement.

2. create a model on all numeric feature:

model_these_features(numeric_features)

create memory error
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-41-0d1b217517bc> in <module>
      1 # create a model on all numeric features here
----> 2 model_these_features(numeric_features)

<ipython-input-38-e0f16dfbdef3> in model_these_features(feature_list)
     46     model = LinearRegression()
     47     #print(type(X_train))
---> 48     model.fit(np.array(X_train), np.array(y_train))
     49 
     50     #

MemoryError: Unable to allocate 21.9 MiB for an array with shape (150874, 19) and data type float64

I was unable to resolve the above error.
I tried to change the datatype when call the .fit() method but the memory issue still appears:
model.fit(np.array(X_train), np.array(y_train))

If anyone had a similar issue before and know how to resolve, please reply to me.

Link to Colaboratory

Link to GitHub (workable version - run on Colaboratory)

Link to GitHub (Jupiter NoteBook - run on my own computer with memory error)

My developer environment:

  • Op. system: Win10
  • Pyton: 3.8.6.
  • Pandas: 1.1.3.
  • sklearn: 0.23.2.
  • numpy: 1.19.3.
Here is the full packages-list:
Package                Version
---------------------- ---------
absl-py                0.10.0
astor                  0.8.1
astroid                2.4.2
attrs                  19.3.0
backcall               0.2.0
bleach                 1.5.0
certifi                2020.6.20
chardet                3.0.4
colorama               0.4.3
cvxopt                 1.2.5
cycler                 0.10.0
decorator              4.4.2
defusedxml             0.6.0
entrypoints            0.3
future                 0.18.2
googlemaps             4.4.1
grpcio                 1.33.1
h5py                   2.10.0
html5lib               0.9999999
ibm-cloud-sdk-core     3.3.0
idna                   2.9
ipykernel              5.3.0
ipython                7.15.0
ipython-genutils       0.2.0
isort                  4.3.21
jedi                   0.17.0
Jinja2                 2.11.2
joblib                 0.17.0
json5                  0.9.5
jsonschema             3.2.0
jupyter-client         6.1.3
jupyter-core           4.6.3
jupyterlab             2.1.4
jupyterlab-server      1.1.5
Keras                  2.4.3
kiwisolver             1.3.0
lazy-object-proxy      1.4.3
lxml                   4.5.2
Markdown               3.3.3
MarkupSafe             1.1.1
matplotlib             3.3.2
mccabe                 0.6.1
mistune                0.8.4
nbconvert              5.6.1
nbformat               5.0.7
notebook               6.0.3
numpy                  1.19.3
oauthlib               3.1.0
packaging              20.4
pandas                 1.1.3
pandas-datareader      0.9.0
pandocfilters          1.4.2
parso                  0.7.0
pickleshare            0.7.5
Pillow                 8.0.1
pip                    20.3
prometheus-client      0.8.0
prompt-toolkit         3.0.5
Pygments               2.6.1
PyJWT                  1.7.1
pylint                 2.5.3
pyparsing              2.4.7
pyrsistent             0.16.0
python-dateutil        2.8.1
python-dotenv          0.15.0
python-twitter         3.5
pytz                   2020.1
pywin32                228
pywinpty               0.5.7
PyYAML                 5.3.1
pyzmq                  19.0.1
requests               2.23.0
requests-oauthlib      1.3.0
scikit-learn           0.23.2
scipy                  1.5.2
seaborn                0.11.0
Send2Trash             1.5.0
setuptools             49.2.1
six                    1.15.0
sklearn                0.0
termcolor              1.1.0
terminado              0.8.3
testpath               0.4.4
threadpoolctl          2.1.0
toml                   0.10.1
tornado                6.0.4
traitlets              4.3.3
urllib3                1.25.9
watson-developer-cloud 2.10.1
wcwidth                0.2.4
webencodings           0.5.1
websocket-client       0.48.0
Werkzeug               1.0.1
wheel                  0.35.1
wrapt                  1.12.1

Same problem, did you manage to fix the problem? I tried to increase the amount of memory use assigned to PyCharm, but it did not do the trick for me.

Hi,
I did not find the solution…
I use colaboratory to bypass this issue and to continue learning…