perl - Process two space delimited text files into one by common column -

- May 15, 2015

this question has answer here:

merge 2 files key if exists in first file / bash script [duplicate] 2 answers

i have 2 text files like:

col1 primary col3 col4 blah 1       blah  4 1    2       5     6 ...

and

cola primary colc cold 1    1       7    27 foo  2       11   13

i want merge them single wider table, such as:

primary  col1 col3 col4 cola colc cold 1        blah blah 4       7    27 2        1    5    6    foo  11   13

i'm pretty new perl, i'm not sure best way this. note column order not matter, , there a couple million rows. files unfortunately not sorted.

my current plan unless there's alternative: given line in 1 of files, scan other file matching row , append them both necessary new file. sounds slow , cumbersome though.

thanks!

solution 1.
1. read smaller of 2 files line line, using standard cpan delimited-file parser txt::csv_xs parse out columns.
2. save each record (as arrayref of columns) in hash, merge column being hash key
3. when done, read larger of 2 files line line, using standard cpan delimited-file parser txt::csv_xs parse out columns.
4. for each record, find join key field, find matching record hash storing data file#1, merge 2 records needed, , print.
note: pretty memory intensive entire smaller file live in memory, won't require read 1 of files million times.

solution 2.
1. sort file1 (using unix sort or simple perl code) "file1.sorted"
2. sort file2 (using unix sort or simple perl code) "file2.sorted"
3. open both files reading. loop until both read:
  - read 1 line each file buffer if buffer file empty (buffer being variable containing next record).
  - compare indexes between 2 lines.
  - if index1 < index2, write record file1 output (without merging) , empty buffer1. repeat step 3
  - if index1 > index2, write record file2 output (without merging) , empty buffer2. repeat.
  - if index1 == index2, merge 2 records, write merged record output , empty out both buffers (assuming join index column unique. if not unique, step more complicated).
note: not require keep entire file in memory, aside sorting files (which can done in memory constrained way if need to).

Search This Blog

Kiastu

perl - Process two space delimited text files into one by common column -

Comments

Post a Comment

Popular posts from this blog

javascript - Image onload event not firing in firefox -

android - getbluetoothservice() called with no bluetoothmanagercallback -

sql - ASP.NET SqlDataSource, like on SelectCommand -