Possible to select features from both a csr matrix and dataframe for machine learning?

At codecademy I’ve seen examples of training a machine learning model on data formatted as a Pandas dataframe, and also examples of training with the data formatted as a Scipy csr matrix.

What if I have a Pandas dataframe where some of its features are unstructured data, such as the actual text of a review, while the other features are your typical categorical and numerical features? If I vectorize the unstructured data using scipy, so that it can be processed by a machine learning model, it becomes a csr matrix. Can I make my machine learning model look at both the original features present in the dataset, and at the expanded features of the csr matrix? Or can I only train a model separately one the csr matrix features and the other on the dataframe features?

I now have an answer to this:

  1. Use pd.arrays.SparseArray to convert the features of your dataframe to sparse format.
  2. Use the Pandas DataFrame.sparse.to_coo() method to convert your sparse dataframe to a scipy coo object.
  3. Use Scipy’s to_csr() method to convert your coo to a csr_matrix.
  4. Use scipy.sparse.hstack() to concatenate your csr_matrix with your other csr_matrix.
  5. Enjoy faster training time and lower memory usage - can handle high dimensionality vectors.

If anyone also tries this, please report here your success for additional verification, thanks

I found yet another answer to this topic by looking at mandrucyk’s Gaydar project.
The solution can be accomplished by using model nesting. The output of a model that trained on unstructured data can be used as additional features of a model that is going to train on structured data.