python - HDFStore: table.select and RAM usage -
i trying select random rows hdfstore table of 1 gb. ram usage explodes when ask 50 random rows.
i using pandas 0-11-dev, python 2.7, linux64
.
in first case ram usage fits size of chunk
with pd.get_store("train.h5",'r') train: chunk in train.select('train',chunksize=50): pass
in second case, seems whole table loaded ram
r=random.choice(400000,size=40,replace=false) train.select('train',pd.term("index",r))
in last case, ram usage fits equivalent chunk
size
r=random.choice(400000,size=30,replace=false) train.select('train',pd.term("index",r))
i puzzled, why moving 30 40 random rows induces such dramatic increase in ram usage.
note table has been indexed when created such index=range(nrows(table)) using following code:
def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000 ): max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize) pd.get_store( storefile,'w') store: i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))): chunk.index= range( chunksize*(i), chunksize*(i+1))[:chunk.shape[0]] store.append(table_name,chunk, min_itemsize={'values':max_len})
thanks insight
edit answer zelazny7
here's file used write train.csv train.h5. wrote using elements of zelazny7's code how trouble-shoot hdfstore exception: cannot find correct atom type
import pandas pd import numpy np sklearn.feature_extraction import dictvectorizer def object_max_len(x): if x.dtype != 'object': return else: return len(max(x.fillna(''), key=lambda x: len(str(x)))) def txtfile2dtypes(infile, sep="\t", header=0, chunksize=50000 ): max_len = pd.read_table(infile,header=header, sep=sep,nrows=5).apply( object_max_len).max() dtypes0 = pd.read_table(infile,header=header, sep=sep,nrows=5).dtypes chunk in pd.read_table(infile,header=header, sep=sep, chunksize=chunksize): max_len = max((pd.dataframe(chunk.apply( object_max_len)).max(),max_len)) i,k in enumerate(zip( dtypes0[:], chunk.dtypes)): if (k[0] != k[1]) , (k[1] == 'object'): dtypes0[i] = k[1] #as of pandas-0.11 nan requires float64 dtype dtypes0.values[dtypes0 == np.int64] = np.dtype('float64') return max_len, dtypes0 def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000 ): max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize) pd.get_store( storefile,'w') store: i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))): chunk.index= range( chunksize*(i), chunksize*(i+1))[:chunk.shape[0]] store.append(table_name,chunk, min_itemsize={'values':max_len})
applied
txtfile2hdfstore('train.csv','train.h5','train',sep=',')
this known issue, see reference here: https://github.com/pydata/pandas/pull/2755
essentially query turned numexpr
expression evaluation. there issue can't pass lot of or
conditions numexpr (its dependent on total length of generated expression).
so limit expression pass numexpr. if exceeds number of or
conditions, query done filter, rather in-kernel selection. means table read , reindexed.
this on enhancements list: https://github.com/pydata/pandas/issues/2391 (17).
as workaround, split queries multiple ones , concat results. should faster, , use constant amount of memory
Comments
Post a Comment