python - HDFStore: table.select and RAM usage -

- May 15, 2010

i trying select random rows hdfstore table of 1 gb. ram usage explodes when ask 50 random rows.

i using pandas 0-11-dev, python 2.7, linux64.

in first case ram usage fits size of chunk

with pd.get_store("train.h5",'r') train: chunk in train.select('train',chunksize=50):     pass

in second case, seems whole table loaded ram

r=random.choice(400000,size=40,replace=false) train.select('train',pd.term("index",r))

in last case, ram usage fits equivalent chunk size

r=random.choice(400000,size=30,replace=false)     train.select('train',pd.term("index",r))

i puzzled, why moving 30 40 random rows induces such dramatic increase in ram usage.

note table has been indexed when created such index=range(nrows(table)) using following code:

def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000 ):     max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize)      pd.get_store( storefile,'w') store:         i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))):             chunk.index= range( chunksize*(i), chunksize*(i+1))[:chunk.shape[0]]             store.append(table_name,chunk, min_itemsize={'values':max_len})

thanks insight

edit answer zelazny7

here's file used write train.csv train.h5. wrote using elements of zelazny7's code how trouble-shoot hdfstore exception: cannot find correct atom type

import pandas pd import numpy np sklearn.feature_extraction import dictvectorizer   def object_max_len(x):     if x.dtype != 'object':         return     else:         return len(max(x.fillna(''), key=lambda x: len(str(x))))  def txtfile2dtypes(infile, sep="\t", header=0, chunksize=50000 ):     max_len = pd.read_table(infile,header=header, sep=sep,nrows=5).apply( object_max_len).max()     dtypes0 = pd.read_table(infile,header=header, sep=sep,nrows=5).dtypes      chunk in pd.read_table(infile,header=header, sep=sep, chunksize=chunksize):         max_len = max((pd.dataframe(chunk.apply( object_max_len)).max(),max_len))         i,k in enumerate(zip( dtypes0[:], chunk.dtypes)):             if (k[0] != k[1]) , (k[1] == 'object'):                 dtypes0[i] = k[1]     #as of pandas-0.11 nan requires float64 dtype     dtypes0.values[dtypes0 == np.int64] = np.dtype('float64')     return max_len, dtypes0   def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000 ):     max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize)      pd.get_store( storefile,'w') store:         i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))):             chunk.index= range( chunksize*(i), chunksize*(i+1))[:chunk.shape[0]]             store.append(table_name,chunk, min_itemsize={'values':max_len})

applied

txtfile2hdfstore('train.csv','train.h5','train',sep=',')

this known issue, see reference here: https://github.com/pydata/pandas/pull/2755

essentially query turned numexpr expression evaluation. there issue can't pass lot of or conditions numexpr (its dependent on total length of generated expression).

so limit expression pass numexpr. if exceeds number of or conditions, query done filter, rather in-kernel selection. means table read , reindexed.

this on enhancements list: https://github.com/pydata/pandas/issues/2391 (17).

as workaround, split queries multiple ones , concat results. should faster, , use constant amount of memory

Search This Blog

Kiastu

python - HDFStore: table.select and RAM usage -

Comments

Post a Comment

Popular posts from this blog

android - getbluetoothservice() called with no bluetoothmanagercallback -

javascript - Image onload event not firing in firefox -

sql - ASP.NET SqlDataSource, like on SelectCommand -