[nycbug-talk] Text parsing question

Tue Dec 16 16:31:54 EST 2008

On Tue, Dec 16, 2008 at 1:40 PM, Marc Spitzer <mspitzer at gmail.com> wrote:
> On Tue, Dec 16, 2008 at 10:56 AM, maddaemon at gmail.com
> <maddaemon at gmail.com> wrote:
>> On Tue, Dec 16, 2008 at 12:53 AM, Ray Lai <nycbug at cyth.net> wrote:
>>> On Tue, Dec 16, 2008 at 6:48 PM, maddaemon at gmail.com
>>> <maddaemon at gmail.com> wrote:
>>>> List,
>>>>
>>>> I'm hoping someone can help me with this...
>>>>
>>>> I'm trying to search for a pattern in a text file that contains login
>>>> info from a syslog and weed out entries that are duplicated with
>>>> differnt IP addresses.
>>>>
>>>> For example, here are 2 lines:
>>>>
>>>> Dec 15 05:15:56 - abc1234 tried logging in from 192.168.8.17
>>>> Dec 15 05:15:56 - abc1234 tried logging in from 192.168.18.13
>>>>
>>>> where 192.168.8.17 is the Windows DC, and the other is the IIP of the
>>>> webmail server.
>>>>
>>>> I need to remove the line that contains the DC _ONLY_WHEN_ there is a
>>>> duplicate entry (same timestamp) with another IP.  The text file
>>>> contains hundreds of other entries, and there are single entries where
>>>> the DC IP is the only entry.  Using the above examples, I need to
>>>> remove the first line and only retrieve the second line:
>>>>
>>>> Dec 15 05:15:56 - abc1234 tried logging in from 192.168.18.13
>>>>
>>>> Does anyone know how to go about doing this?  I was going to try using
>>>> sed and compare the lines looking for the same timestamp + username +
>>>> IP1/IP2, but it gave me a headache when I tried to wrap my head around
>>>> the logic.
>>>
>>> Does "sort -unsk1,9" work? You'd have to split the files according to
>>> month, though.
>>>
>>> -Ray-
>>>
>>
>> That cuts everything out and leaves 1 line (with the DC IP, which is
>> what I'm trying to get rid of):
>>
>> md at madmartigan [~/scripts/report_temp]$ cat badpass.log | wc -l
>>      24
>> md at madmartigan [~/scripts/report_temp]$ cat badpass.log | sort -unsk1,9
>> Dec 16 01:00:57 - def3456 tried logging in from 192.168.8.3
>> md at madmartigan [~/scripts/report_temp]$
>>
>> I'm doing this every day, so the day/month/year will always be a
>> constant for that particular day.
>>
>> I guess what I'm trying to do could be described as finding "almost"
>> or "partial" duplicates..
>
> No code but some advice:
>
> 1: figure out how to specify you lines of interest in terms of a
> regular expression, if possible.

What's making this a challenge is the fact that I'm comparing 2 lines,
with no reference (i.e. line 2 and 3, or 4 and 5) because they're
dynamic.  If it was 1 line, or 2 static lines, it wouldn't be an issue
for me.

> 2: from said RE pull out the pieces that you need to verify as uniq
> and pass to a test function, concat interesting bits and use as a key
> into a hash, ie if not seen as key in hash return true and set hash
> else return false

You lost me after uniq ...

> 3: in your main loop if you match your RE then check and print or skip
> otherwise print
>
> 3->1->2 done.
>
> The hardest part of this is specifying the RE, "exactly" what do I care about.

The timestamp, UID, and IP

> probably best to not do in shell.  perl, ruby or tcl etc. would be better.

If I knew those, I'd be set :)

> back to work,
>
> marc