Date-A-Scientist - Scikit-Learn Import error

Hello there everyone! I’m getting such a strange error while doing Date-A-Scientist. If I try to import sklearn.linear_model.LinearRegression, I get the error below. It’s my first time working with it sklearn off platform and doing some research, it seems like it’s a file error, I’d just like to check in here if there is some issue on that and maybe start a resourceful thread.

$ C:/Users/Lucas/anaconda3/python.exe "d:/Projetos/Code/Date a Scientist/dating_skeleton.py"
Traceback (most recent call last):
  File "d:/Projetos/Code/Date a Scientist/dating_skeleton.py", line 4, in <module>
    from sklearn.linear_model import LinearRegression
  File "C:\Users\Lucas\anaconda3\lib\site-packages\sklearn\__init__.py", line 80, in <module>
    from .base import clone
  File "C:\Users\Lucas\anaconda3\lib\site-packages\sklearn\base.py", line 21, in <module>
    from .utils import _IS_32BIT
  File "C:\Users\Lucas\anaconda3\lib\site-packages\sklearn\utils\__init__.py", line 23, in <module>
    from .class_weight import compute_class_weight, compute_sample_weight
  File "C:\Users\Lucas\anaconda3\lib\site-packages\sklearn\utils\class_weight.py", line 7, in <module>
    from .validation import _deprecate_positional_args
  File "C:\Users\Lucas\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 25, in <module>
    from .fixes import _object_dtype_isnan, parse_version
  File "C:\Users\Lucas\anaconda3\lib\site-packages\sklearn\utils\fixes.py", line 18, in <module>
    import scipy.stats
  File "C:\Users\Lucas\anaconda3\lib\site-packages\scipy\stats\__init__.py", line 388, in <module>
    from .stats import *
  File "C:\Users\Lucas\anaconda3\lib\site-packages\scipy\stats\stats.py", line 180, in <module>
    from . import distributions
  File "C:\Users\Lucas\anaconda3\lib\site-packages\scipy\stats\distributions.py", line 8, in <module>
    from ._distn_infrastructure import (entropy, rv_discrete, rv_continuous,
  File "C:\Users\Lucas\anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py", line 23, in <module>
    from scipy import optimize
  File "C:\Users\Lucas\anaconda3\lib\site-packages\scipy\optimize\__init__.py", line 387, in <module>
    from .optimize import *
  File "C:\Users\Lucas\anaconda3\lib\site-packages\scipy\optimize\optimize.py", line 36, in <module>
    from ._numdiff import approx_derivative
  File "C:\Users\Lucas\anaconda3\lib\site-packages\scipy\optimize\_numdiff.py", line 6, in <module>
    from scipy.sparse.linalg import LinearOperator
  File "C:\Users\Lucas\anaconda3\lib\site-packages\scipy\sparse\linalg\__init__.py", line 114, in <module>
    from .eigen import *
  File "C:\Users\Lucas\anaconda3\lib\site-packages\scipy\sparse\linalg\eigen\__init__.py", line 9, in <module>
    from .arpack import *
  File "C:\Users\Lucas\anaconda3\lib\site-packages\scipy\sparse\linalg\eigen\arpack\__init__.py", line 20, in <module>
    from .arpack import *
  File "C:\Users\Lucas\anaconda3\lib\site-packages\scipy\sparse\linalg\eigen\arpack\arpack.py", line 43, in <module>
    from . import _arpack
ImportError: DLL load failed while importing _arpack:

I found this which might be helpful(?). You might have to uninstall conda versions and install pip versions of scipy.

https://github.com/conda/conda/issues/6396

Or, this from StackOverflow:
https://stackoverflow.com/questions/54083514/how-to-fix-importerror-dll-load-failed-the-specified-procedure-could-not-be-f

1 Like

Hey Lisa!!

Thank you very much!
I tried uninstalling and installing through pip before, but I actually had to conda uninstall scikit-learn, numpy and scipy. So my recommended commands would be.

conda uninstall scikit-learn numpy scipy
conda remove --force scikit-learn numpy scipy
pip uninstall scikit-learn numpy scipy
pip install -U scikit-learn numpy scipy --user

If it’s somehow unnecessary or dangerous, I’ll edit out!

1 Like

You’re welcome.

Wait, so uninstalling conda versions and installing pip versions doesn’t work?
Did you update pip?

I’m not very familiar with that error though… :thinking:

it did! it did. I just tried getting rid of every version possible of each module in my pc then reinstalled everything.

1 Like

Excellent! :partying_face:
I had forgotten to ask if you had a Mac or PC.

I don’t know if this was the explicit reason but try and avoid mixing packages installed with conda and with pip where possible. If you’re using conda for obtaining packages and environment management then get your packages from conda wherever possible and install any additions through pip last and only if you have to (i.e. there’s no build from conda).

A little info on mixing the two here- https://docs.conda.io/projects/conda/en/latest/user-guide/configuration/pip-interoperability.html

2 Likes

HA! I forgot to mention that too. Good thinkin’!
:slight_smile:

This is actually great info! Should be lying somewhere around here!
Which one do you prefer? Which one is a more permanent solution as in installing a package once and not having to do anything about it again?

It’s worth noting that there’s no direct equivalence between *conda and pip as they do vastly different things (there’s numerous online discussions about this if you wanted to look into it). See this recent post by @el_cocodrilo for a little more information: Setting Up Conda in Git Bash and have a read of the page on conda myths linked within this post under ‘What is Conda…’ .

Personally I’d boil it down to this; if conda covers all your requirements, then you may as well use it (unless you really disliked it). It performs many different tasks that would require otherwise require several pieces of software to do the same thing. If you have the time it’s worth trying your hand with the alternative to see what the benefits and drawbacks are but there’s plenty of online discussion about the same if you wanted to look into it. Might save you some time to see what other more experienced folk think.

3 Likes

@tgrtim and @lisalisaj, could you guys give me a hand?
I’m trying to augment my data with [‘religion’] => [‘religion_code’]. Sort of having some hard time. I’ve tried three methods to find if the string contains a substring:

df[‘religion_code’] = df.religion.apply([lambda x: 0 if (x.str.find(‘christianity’) != -1) | (‘catholic’ in x) else 1 if (x.find(‘agnostic’) != -1) | (‘atheis’ in x) else 2]).replace(np.nan, 0, regex=True)

x.str.find() will raise the following error:

AttributeError: ‘str’ object has no attribute ‘str’

x.find():

‘float’ object has no attribute ‘find’

and substring in string function won’t return anything!
I thought that the in function would work just like (x == ‘string’) and return labels based on if true;

I’ve also tried (‘catholic’ in x.value) and (‘catholic’ in x.str) (just brute forcing; I thought it would not work)
Any ideas? Am I missing something?

I’m none too sure what your dataset is as I’ve not attempted that project so I’m afraid I can only point out a couple of things that stand out to me.

Are you intentionally using bitwise or (the " | " operator) or should that be just a standard or?
Also name.str is not a valid method for most Python types. If you want to cast something as a string, use the function str(). The .find() method is a valid method for a string but not for a float. Unless you converted it beforehand you’d just get errors thrown if you tried to use it. The expression within your .apply method appears to be passing a list rather than a function, double check the placement of those brackets.

This entire thing seems a bit complex at the minute and I think that’s the reason for a couple of errors so maybe break it down into steps. Try making just your first step work e.g. just setting values to 0 with ‘christianity’ in the name (be careful about letter case), testing the output and making sure it makes sense. Once that’s done you can do the next steps.

The .find() method does seem like a bit much for testing if a substring is present and your first comment about 'substring in string' sounds like the right way forward though it is an operator rather than a function. What is the type of x within your expression? If you were to use x.value or x.str what would actually be returned?

If it’s still causing issues try testing single rows from this dataset against a regular function to see if you can make them do what you want. Then you can worry about sticking it into a single line if necessary.

1 Like
  1. | = or sould I use or instead?

  2. I made another column successfully with apply([lambda x]), I thought that I would be following the same path. I don’t think that I actually need to stringfy it! I just need the string value from the data since I’m working with a pandas dataframe.

df[‘plural_code’] = df.ethnicity.apply([lambda x: 0 if (x == ‘white’) else 1 if (x == ‘black’) | (x == ‘asian’) | (x == ‘hispanic / latin’) else 2]).replace(np.nan, 0, regex=True)

  1. To clarify things: when performing a print(df.religion.value_counts()) we get the following:
agnosticism                                   2724
other                                         2691
agnosticism but not too serious about it      2636
agnosticism and laughing about it             2496
catholicism but not too serious about it      2318
atheism                                       2175
other and laughing about it                   2119
atheism and laughing about it                 2074
christianity                                  1957
christianity but not too serious about it     1952
other but not too serious about it            1554
judaism but not too serious about it          1517
atheism but not too serious about it          1318
catholicism                                   1064
christianity and somewhat serious about it     927
atheism and somewhat serious about it          848
other and somewhat serious about it            846
catholicism and laughing about it              726
judaism and laughing about it                  681
buddhism but not too serious about it          650
agnosticism and somewhat serious about it      642
judaism                                        612
christianity and very serious about it         578
atheism and very serious about it              570
catholicism and somewhat serious about it      548
other and very serious about it                533
buddhism and laughing about it                 466
buddhism                                       403
christianity and laughing about it             373
buddhism and somewhat serious about it         359
agnosticism and very serious about it          314
judaism and somewhat serious about it          266
hinduism but not too serious about it          227
hinduism                                       107
catholicism and very serious about it          102
buddhism and very serious about it              70
hinduism and somewhat serious about it          58
islam                                           48
hinduism and laughing about it                  44
islam but not too serious about it              40
islam and somewhat serious about it             22
judaism and very serious about it               22
islam and laughing about it                     16
hinduism and very serious about it              14
islam and very serious about it                 13
Name: religion, dtype: int64
  1. Yes! an operator, I got things mixed up. If I perform x.str I’ll get ‘str’ object has no attribute ‘str’ and x.value I’ll get AttributeError: ‘str’ object has no attribute ‘value’ and AttributeError: ‘Series’ object has no attribute ‘value’.

If I run everything with substring in string I get no errors and religion_code’s entire column is NaN! Just got this info now. That seems really weird since I’m performing a .replace(np.nan, 0, regex=True)

I was never the biggest fan of pandas as I got away with numpy for most of my work so I apologise if I make a mistake on the syntax.

  1. For your standard logical OR yes you must use or. Have a quick online search to see what the bitwise one is for instead.
  2. From what I understand this should be passing a function to the apply method. This could be a lambda expression or a standard function but I don’t see how wrapping it brackets helps. Something like df['column'].apply(lambda x: x + 1) makes sense to me, ([lambda x: x + 1]) does not (even if it accepts it, unless that’s a piece of pandas syntax I’ve missed).
  3. N/A
  4. From what I can tell that’s still using a method .str that doesn’t exist for that data-type, hence the attribute warning. I’m still not sure what your intention is at this point. You seem to try trying to use method that don’t exist for the given datatype so make sure you know exactly what type you are dealing with.

I’m not quite sure at what point it is messing up but once again try breaking it down. The basic syntax would be something along the following-

folks = {'Name': ['Mike', 'Bob', 'Basil'], 'Age': [20, 30, 40]}
myframe = pd.DataFrame(folks)
outp = myframe['Name'].apply(lambda x: 'Found Him!' if 'Bob' in x else x)

# outp looks like the following-
0          Mike
1    Found Him!
2         Basil
1 Like

Thanks @tgrtim!

| => or :white_check_mark:
Instead of using .apply I’ll .map every case :white_check_mark:
The data type is object!
I’ll keep looking into this because it’s way more malleable to work with .apply in this case. I’ll get back here if I find an answer!

I’d suggest writing your function, the one you’re passing to .apply, as a standard function first. They’re much easier to work with and read. If it’s then suitable to convert it to a lambda then you could but bear in mind whether or not it actually makes things easier.

Unless I’m mistaken each x in that function might well be an object but you want it to be a string? That would make more sense for usage such as (x == 'white') in your example. Do you have mixed datatypes? Depending on your goal you may want to filter out or convert values that don’t fit the pattern to simplify your function. Otherwise using in-built methods is risky because whilst they may exist for the type of one datapoint they might not for the next.