Zastosowanie factorize
:
df['col'] = pd.factorize(df.col)[0]
print (df)
col
0 0
1 1
2 0
3 0
4 1
Docs
EDIT:
Jak Jeff
mowa w komentarzu, wtedy najlepiej przekonwertować kolumnę categorical
głównie dlatego mniej memory usage:
df['col'] = df['col'].astype("category")
Timings:
To ciekawe, w dużej df pandas
jest szybszy jak numpy
. Nie mogę w to uwierzyć.
len(df)=500k
:
In [29]: %timeit (a(df1))
100 loops, best of 3: 9.27 ms per loop
In [30]: %timeit (a1(df2))
100 loops, best of 3: 9.32 ms per loop
In [31]: %timeit (b(df3))
10 loops, best of 3: 24.6 ms per loop
In [32]: %timeit (b1(df4))
10 loops, best of 3: 24.6 ms per loop
len(df)=5k
:
In [38]: %timeit (a(df1))
1000 loops, best of 3: 274 µs per loop
In [39]: %timeit (a1(df2))
The slowest run took 6.71 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 273 µs per loop
In [40]: %timeit (b(df3))
The slowest run took 5.15 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 295 µs per loop
In [41]: %timeit (b1(df4))
1000 loops, best of 3: 294 µs per loop
len(df)=5
:
In [46]: %timeit (a(df1))
1000 loops, best of 3: 206 µs per loop
In [47]: %timeit (a1(df2))
1000 loops, best of 3: 204 µs per loop
In [48]: %timeit (b(df3))
The slowest run took 6.30 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 164 µs per loop
In [49]: %timeit (b1(df4))
The slowest run took 6.44 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 164 µs per loop
kod do testowania:
d = {'col': ["baked","beans","baked","baked","beans"]}
df = pd.DataFrame(data=d)
print (df)
df = pd.concat([df]*100000).reset_index(drop=True)
#test for 5k
#df = pd.concat([df]*1000).reset_index(drop=True)
df1,df2,df3, df4 = df.copy(),df.copy(),df.copy(),df.copy()
def a(df):
df['col'] = pd.factorize(df.col)[0]
return df
def a1(df):
idx,_ = pd.factorize(df.col)
df['col'] = idx
return df
def b(df):
df['col'] = np.unique(df['col'],return_inverse=True)[1]
return df
def b1(df):
_,idx = np.unique(df['col'],return_inverse=True)
df['col'] = idx
return df
print (a(df1))
print (a1(df2))
print (b(df3))
print (b1(df4))
Gdybym wiedział pandy więcej, to bym to doceniliśmy więcej może, ale to też działa! Może zrobić coś takiego jak "idx, _ = pd.factorize (df.col)" i może to może być trochę szybciej? Znowu, to jest przeczucie :) – Divakar
Mam nadzieję, że kiedyś zacznę się uczyć 'numpy' - jest dużo fajnej funkcji i jest szybsza. Dziękuję Ci. Tak, zamierzam zrobić kilka testów. – jezrael
Hmmm, ciekawe, w dużych '' pf '' pand' jest szybsze jako 'numpy'. – jezrael