[nycbug-talk] Re: Simple sed question
Mikel King
mikel.king
Fri Jan 7 15:42:12 EST 2005
>>506 AllianceBer Intl PremGr B AIPBX 8.29 -2.13 -1.54
>>507 AllianceBer Intl PremGr C AIPCX 8.29 -2.24 -1.54
>>508 AllianceBer Intl PremGrAd AIPYX 8.87 -1.88 -0.67
>>509 AllianceBer Intl Val A ABIAX 14.91 5.59 9.79
>>510 AllianceBer Intl Val A ABIAX 14.91 5.59 9.79
>>511 AllianceBer Intl Val A ABIAX 14.91 5.59 9.79
>>512 AllianceBer Intl Val A ABIAX 14.91 5.59 9.79
>>
>>Basically what I am trying to do is to have only from the description
>>onward.
>>506 through 509 have a tab
>>510 has 2 spaces
>>511 and 512 have a single space
>>
>>The data is coming from OCR and basically I am cleaning it up in sed so by
>>the time I get it to awk is in good shape. I figured out all the other
>>cleanups this is the only one have not figured out. :-(
>>
>>
>
>The real question is how do you define the data into fields, what delimits
>fields and what delimits seperate sub fields in a field. from looking at the
>data above you have the 4 last fields and the first field are fixed
>and everything
>between them is field 2, if that is correct then it is easy and not
>nessarly a regex
>problem, and you can now turn it into a safer intermediate form(CSV
>for example).
>
>untested code:
>awk ' { for( i=2 ; i< NF-4 ; i++) {
> tmp_2 = sprintf( "%s %s",tmp_2, $i);}
> printf "%s,%s,%s,%s,%s,%s\n", $1, tmp_2, $(NF-3),
>$(NF-2), $(NF-1), $NF
>}' file
>
>marc
>
>
>
>
>
cool, good work Marc... I like the concept...
More information about the talk
mailing list