python - HDFStore: table.select and RAM usage -
i trying select random rows hdfstore table of 1 gb. ram usage explodes when ask 50 random rows.
i using pandas 0-11-dev, python 2.7, linux64.
in first case ram usage fits size of chunk
with pd.get_store("train.h5",'r') train: chunk in train.select('train',chunksize=50):     pass in second case, seems whole table loaded ram
r=random.choice(400000,size=40,replace=false) train.select('train',pd.term("index",r)) in last case, ram usage fits equivalent chunk size
r=random.choice(400000,size=30,replace=false)     train.select('train',pd.term("index",r)) i puzzled, why moving 30 40 random rows induces such dramatic increase in ram usage.
note table has been indexed when created such index=range(nrows(table)) using following code:
def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000 ):     max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize)      pd.get_store( storefile,'w') store:         i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))):             chunk.index= range( chunksize*(i), chunksize*(i+1))[:chunk.shape[0]]             store.append(table_name,chunk, min_itemsize={'values':max_len}) thanks insight
edit answer zelazny7
here's file used write train.csv train.h5. wrote using elements of zelazny7's code how trouble-shoot hdfstore exception: cannot find correct atom type
import pandas pd import numpy np sklearn.feature_extraction import dictvectorizer   def object_max_len(x):     if x.dtype != 'object':         return     else:         return len(max(x.fillna(''), key=lambda x: len(str(x))))  def txtfile2dtypes(infile, sep="\t", header=0, chunksize=50000 ):     max_len = pd.read_table(infile,header=header, sep=sep,nrows=5).apply( object_max_len).max()     dtypes0 = pd.read_table(infile,header=header, sep=sep,nrows=5).dtypes      chunk in pd.read_table(infile,header=header, sep=sep, chunksize=chunksize):         max_len = max((pd.dataframe(chunk.apply( object_max_len)).max(),max_len))         i,k in enumerate(zip( dtypes0[:], chunk.dtypes)):             if (k[0] != k[1]) , (k[1] == 'object'):                 dtypes0[i] = k[1]     #as of pandas-0.11 nan requires float64 dtype     dtypes0.values[dtypes0 == np.int64] = np.dtype('float64')     return max_len, dtypes0   def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000 ):     max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize)      pd.get_store( storefile,'w') store:         i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))):             chunk.index= range( chunksize*(i), chunksize*(i+1))[:chunk.shape[0]]             store.append(table_name,chunk, min_itemsize={'values':max_len}) applied
txtfile2hdfstore('train.csv','train.h5','train',sep=',') 
this known issue, see reference here: https://github.com/pydata/pandas/pull/2755
essentially query turned numexpr expression evaluation. there issue can't pass lot of or conditions numexpr (its dependent on total length of generated expression). 
so limit expression pass numexpr. if exceeds number of or conditions, query done filter, rather in-kernel selection. means table read , reindexed.
this on enhancements list: https://github.com/pydata/pandas/issues/2391 (17).
as workaround, split queries multiple ones , concat results. should faster, , use constant amount of memory
Comments
Post a Comment