python - HDFStore: table.select and RAM usage -


i trying select random rows hdfstore table of 1 gb. ram usage explodes when ask 50 random rows.

i using pandas 0-11-dev, python 2.7, linux64.

in first case ram usage fits size of chunk

with pd.get_store("train.h5",'r') train: chunk in train.select('train',chunksize=50):     pass 

in second case, seems whole table loaded ram

r=random.choice(400000,size=40,replace=false) train.select('train',pd.term("index",r)) 

in last case, ram usage fits equivalent chunk size

r=random.choice(400000,size=30,replace=false)     train.select('train',pd.term("index",r)) 

i puzzled, why moving 30 40 random rows induces such dramatic increase in ram usage.

note table has been indexed when created such index=range(nrows(table)) using following code:

def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000 ):     max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize)      pd.get_store( storefile,'w') store:         i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))):             chunk.index= range( chunksize*(i), chunksize*(i+1))[:chunk.shape[0]]             store.append(table_name,chunk, min_itemsize={'values':max_len}) 

thanks insight

edit answer zelazny7

here's file used write train.csv train.h5. wrote using elements of zelazny7's code how trouble-shoot hdfstore exception: cannot find correct atom type

import pandas pd import numpy np sklearn.feature_extraction import dictvectorizer   def object_max_len(x):     if x.dtype != 'object':         return     else:         return len(max(x.fillna(''), key=lambda x: len(str(x))))  def txtfile2dtypes(infile, sep="\t", header=0, chunksize=50000 ):     max_len = pd.read_table(infile,header=header, sep=sep,nrows=5).apply( object_max_len).max()     dtypes0 = pd.read_table(infile,header=header, sep=sep,nrows=5).dtypes      chunk in pd.read_table(infile,header=header, sep=sep, chunksize=chunksize):         max_len = max((pd.dataframe(chunk.apply( object_max_len)).max(),max_len))         i,k in enumerate(zip( dtypes0[:], chunk.dtypes)):             if (k[0] != k[1]) , (k[1] == 'object'):                 dtypes0[i] = k[1]     #as of pandas-0.11 nan requires float64 dtype     dtypes0.values[dtypes0 == np.int64] = np.dtype('float64')     return max_len, dtypes0   def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000 ):     max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize)      pd.get_store( storefile,'w') store:         i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))):             chunk.index= range( chunksize*(i), chunksize*(i+1))[:chunk.shape[0]]             store.append(table_name,chunk, min_itemsize={'values':max_len}) 

applied

txtfile2hdfstore('train.csv','train.h5','train',sep=',') 

this known issue, see reference here: https://github.com/pydata/pandas/pull/2755

essentially query turned numexpr expression evaluation. there issue can't pass lot of or conditions numexpr (its dependent on total length of generated expression).

so limit expression pass numexpr. if exceeds number of or conditions, query done filter, rather in-kernel selection. means table read , reindexed.

this on enhancements list: https://github.com/pydata/pandas/issues/2391 (17).

as workaround, split queries multiple ones , concat results. should faster, , use constant amount of memory


Comments

Popular posts from this blog

android - getbluetoothservice() called with no bluetoothmanagercallback -

sql - ASP.NET SqlDataSource, like on SelectCommand -

ios - Undefined symbols for architecture armv7: "_OBJC_CLASS_$_SSZipArchive" -