perl - Process two space delimited text files into one by common column -
this question has answer here:
i have 2 text files like:
col1 primary col3 col4 blah 1 blah 4 1 2 5 6 ...
and
cola primary colc cold 1 1 7 27 foo 2 11 13
i want merge them single wider table, such as:
primary col1 col3 col4 cola colc cold 1 blah blah 4 7 27 2 1 5 6 foo 11 13
i'm pretty new perl, i'm not sure best way this. note column order not matter, , there a couple million rows. files unfortunately not sorted.
my current plan unless there's alternative: given line in 1 of files, scan other file matching row , append them both necessary new file. sounds slow , cumbersome though.
thanks!
solution 1.
read smaller of 2 files line line, using standard cpan delimited-file parser
txt::csv_xs
parse out columns.save each record (as arrayref of columns) in hash, merge column being hash key
when done, read larger of 2 files line line, using standard cpan delimited-file parser
txt::csv_xs
parse out columns.for each record, find join key field, find matching record hash storing data file#1, merge 2 records needed, , print.
note: pretty memory intensive entire smaller file live in memory, won't require read 1 of files million times.
solution 2.
sort file1 (using unix
sort
or simple perl code) "file1.sorted"sort file2 (using unix
sort
or simple perl code) "file2.sorted"open both files reading. loop until both read:
read 1 line each file buffer if buffer file empty (buffer being variable containing next record).
compare indexes between 2 lines.
if index1 < index2, write record file1 output (without merging) , empty buffer1. repeat step 3
if index1 > index2, write record file2 output (without merging) , empty buffer2. repeat.
if index1 == index2, merge 2 records, write merged record output , empty out both buffers (assuming join index column unique. if not unique, step more complicated).
note: not require keep entire file in memory, aside sorting files (which can done in memory constrained way if need to).
Comments
Post a Comment