gawk - Using multiple threads/cores for awk performance improvement -
i have directory ~50k files. each file has ~700000 lines. have written awk program read each line , print if there error. running fine, time taken huge - ~4 days!!!! there way reduce time? can use multiple cores (processes)? did try before?
awk
, gawk
not fix themselves. there no magic "make parallel" switch. need rewrite degree:
- shard file - simplest way fix run multiple awks' in parallel, 1 per file. need sort of dispatch mechanism. parallelize bash script maximum number of processes shows how can write in shell. take more reading, if want more features check out gearman or celery should adaptable problem
- better hardware - sounds need faster cpu make go faster, i/o issue. having graphs of cpu , i/o munin or other monitoring system isolate bottleneck in case. have tried running job on ssd based system? easy win these days.
- caching - there amount of duplicate lines or files. if there enough duplicates helpful cache processing in way. if calculate crc/
md5sum
file , store in database calculate md5sum new file , skip processing if you've done so. - complete rewrite - scaling
awk
going ridiculous @ point. using map-reduce framework might idea.
Comments
Post a Comment