perl - Process two space delimited text files into one by common column -


i have 2 text files like:

col1 primary col3 col4 blah 1       blah  4 1    2       5     6 ... 

and

cola primary colc cold 1    1       7    27 foo  2       11   13 

i want merge them single wider table, such as:

primary  col1 col3 col4 cola colc cold 1        blah blah 4       7    27 2        1    5    6    foo  11   13 

i'm pretty new perl, i'm not sure best way this. note column order not matter, , there a couple million rows. files unfortunately not sorted.

my current plan unless there's alternative: given line in 1 of files, scan other file matching row , append them both necessary new file. sounds slow , cumbersome though.

thanks!

  • solution 1.

    1. read smaller of 2 files line line, using standard cpan delimited-file parser txt::csv_xs parse out columns.

    2. save each record (as arrayref of columns) in hash, merge column being hash key

    3. when done, read larger of 2 files line line, using standard cpan delimited-file parser txt::csv_xs parse out columns.

    4. for each record, find join key field, find matching record hash storing data file#1, merge 2 records needed, , print.

    note: pretty memory intensive entire smaller file live in memory, won't require read 1 of files million times.


  • solution 2.

    1. sort file1 (using unix sort or simple perl code) "file1.sorted"

    2. sort file2 (using unix sort or simple perl code) "file2.sorted"

    3. open both files reading. loop until both read:

      • read 1 line each file buffer if buffer file empty (buffer being variable containing next record).

      • compare indexes between 2 lines.

      • if index1 < index2, write record file1 output (without merging) , empty buffer1. repeat step 3

      • if index1 > index2, write record file2 output (without merging) , empty buffer2. repeat.

      • if index1 == index2, merge 2 records, write merged record output , empty out both buffers (assuming join index column unique. if not unique, step more complicated).

    note: not require keep entire file in memory, aside sorting files (which can done in memory constrained way if need to).


Comments

Popular posts from this blog

android - getbluetoothservice() called with no bluetoothmanagercallback -

sql - ASP.NET SqlDataSource, like on SelectCommand -

ios - Undefined symbols for architecture armv7: "_OBJC_CLASS_$_SSZipArchive" -