[nycbug-talk] Text parsing question

Marc Spitzer mspitzer at gmail.com
Tue Dec 16 13:40:47 EST 2008


On Tue, Dec 16, 2008 at 10:56 AM, maddaemon at gmail.com
<maddaemon at gmail.com> wrote:
> On Tue, Dec 16, 2008 at 12:53 AM, Ray Lai <nycbug at cyth.net> wrote:
>> On Tue, Dec 16, 2008 at 6:48 PM, maddaemon at gmail.com
>> <maddaemon at gmail.com> wrote:
>>> List,
>>>
>>> I'm hoping someone can help me with this...
>>>
>>> I'm trying to search for a pattern in a text file that contains login
>>> info from a syslog and weed out entries that are duplicated with
>>> differnt IP addresses.
>>>
>>> For example, here are 2 lines:
>>>
>>> Dec 15 05:15:56 - abc1234 tried logging in from 192.168.8.17
>>> Dec 15 05:15:56 - abc1234 tried logging in from 192.168.18.13
>>>
>>> where 192.168.8.17 is the Windows DC, and the other is the IIP of the
>>> webmail server.
>>>
>>> I need to remove the line that contains the DC _ONLY_WHEN_ there is a
>>> duplicate entry (same timestamp) with another IP.  The text file
>>> contains hundreds of other entries, and there are single entries where
>>> the DC IP is the only entry.  Using the above examples, I need to
>>> remove the first line and only retrieve the second line:
>>>
>>> Dec 15 05:15:56 - abc1234 tried logging in from 192.168.18.13
>>>
>>> Does anyone know how to go about doing this?  I was going to try using
>>> sed and compare the lines looking for the same timestamp + username +
>>> IP1/IP2, but it gave me a headache when I tried to wrap my head around
>>> the logic.
>>
>> Does "sort -unsk1,9" work? You'd have to split the files according to
>> month, though.
>>
>> -Ray-
>>
>
> That cuts everything out and leaves 1 line (with the DC IP, which is
> what I'm trying to get rid of):
>
> md at madmartigan [~/scripts/report_temp]$ cat badpass.log | wc -l
>      24
> md at madmartigan [~/scripts/report_temp]$ cat badpass.log | sort -unsk1,9
> Dec 16 01:00:57 - def3456 tried logging in from 192.168.8.3
> md at madmartigan [~/scripts/report_temp]$
>
> I'm doing this every day, so the day/month/year will always be a
> constant for that particular day.
>
> I guess what I'm trying to do could be described as finding "almost"
> or "partial" duplicates..

No code but some advice:

1: figure out how to specify you lines of interest in terms of a
regular expression, if possible.
2: from said RE pull out the pieces that you need to verify as uniq
and pass to a test function, concat interesting bits and use as a key
into a hash, ie if not seen as key in hash return true and set hash
else return false
3: in your main loop if you match your RE then check and print or skip
otherwise print

3->1->2 done.

The hardest part of this is specifying the RE, "exactly" what do I care about.

probably best to not do in shell.  perl, ruby or tcl etc. would be better.

back to work,

marc

-- 
Freedom is nothing but a chance to be better.
Albert Camus



More information about the talk mailing list