[nycbug-talk] Text parsing question

maddaemon at gmail.com maddaemon at gmail.com
Tue Dec 16 10:56:01 EST 2008


On Tue, Dec 16, 2008 at 12:53 AM, Ray Lai <nycbug at cyth.net> wrote:
> On Tue, Dec 16, 2008 at 6:48 PM, maddaemon at gmail.com
> <maddaemon at gmail.com> wrote:
>> List,
>>
>> I'm hoping someone can help me with this...
>>
>> I'm trying to search for a pattern in a text file that contains login
>> info from a syslog and weed out entries that are duplicated with
>> differnt IP addresses.
>>
>> For example, here are 2 lines:
>>
>> Dec 15 05:15:56 - abc1234 tried logging in from 192.168.8.17
>> Dec 15 05:15:56 - abc1234 tried logging in from 192.168.18.13
>>
>> where 192.168.8.17 is the Windows DC, and the other is the IIP of the
>> webmail server.
>>
>> I need to remove the line that contains the DC _ONLY_WHEN_ there is a
>> duplicate entry (same timestamp) with another IP.  The text file
>> contains hundreds of other entries, and there are single entries where
>> the DC IP is the only entry.  Using the above examples, I need to
>> remove the first line and only retrieve the second line:
>>
>> Dec 15 05:15:56 - abc1234 tried logging in from 192.168.18.13
>>
>> Does anyone know how to go about doing this?  I was going to try using
>> sed and compare the lines looking for the same timestamp + username +
>> IP1/IP2, but it gave me a headache when I tried to wrap my head around
>> the logic.
>
> Does "sort -unsk1,9" work? You'd have to split the files according to
> month, though.
>
> -Ray-
>

That cuts everything out and leaves 1 line (with the DC IP, which is
what I'm trying to get rid of):

md at madmartigan [~/scripts/report_temp]$ cat badpass.log | wc -l
      24
md at madmartigan [~/scripts/report_temp]$ cat badpass.log | sort -unsk1,9
Dec 16 01:00:57 - def3456 tried logging in from 192.168.8.3
md at madmartigan [~/scripts/report_temp]$

I'm doing this every day, so the day/month/year will always be a
constant for that particular day.

I guess what I'm trying to do could be described as finding "almost"
or "partial" duplicates..



More information about the talk mailing list