[nycbug-talk] MD5 Presentation
Jonathan Vanasco
nycbug-list at 2xlp.com
Thu Aug 3 14:44:46 EDT 2006
On Aug 3, 2006, at 12:39 PM, Jonathan Vanasco wrote:
> i think its based off of Benson's law of mathematics. it works
> well. i just decided that i wanted to obfuscate serials to the
> public.
To answer a few offlist questions;
Sorry for the confusion folks, its Benford, not Benson
http://www.ezrstats.com/Benford.htm
http://mathworld.wolfram.com/BenfordsLaw.html
Ironically, the IRS started using it recently to flag returns for audit.
It has to do with leading digit distribution-- 1 is more likely to be
the first digit than a 9-- 30% more on average
if you're storing stuff into buckets off a serial though, you feel it
worse, as you encounter geometic expansion under base 10
ie: 1-3, 11-13 , 111-113
1/
2/
3/
9/
1/1
1/2/
1/3/
1/9/
1/1/1
1/1/2
1/1/3
vs
1/1
2/1
3/1
1/1/1
2/1/1
3/1/1
if you 'front load' your buckets, you'll get a highly uneven ( read:
completely worthless ) distribution -- you only get a new bucket when
you're up by a factor of 10
if you back load your buckets though, you'll get a clean distribution
-- every 10 items you get a new bucket
getting back to the hashing algorithm-- on the OP. i should add:
let's say you're storing a serial
while a GREAT advantage of it is that you're getting a better
representation of a number to store , as it has a better distribution
of characters ( 11 vs alkjkas )
the BEST advantage ( i found) is this:
you get to bucket a base 10 id in a 16bit space ( i use md5_hex )
at what point does directory performance degrade? in the mid
thousands.
2 md5hex chars = 256 buckets / dir
3 md5hex chars = 4096
2 md5 base 64 chars = 4096
3 md5 base 32 = 1024
personally, i think 3 md5 base 32 chars would have been the best
choice for my appllications-- but it was a PITA to implement across
languages. perl/python/php all have base16/base64 as default, so i
just used that.
i while ago i tested. freebsd works great with 4096 directories. i
didn't like the performance of ex2fs w/3 levels of 4096 dirs on my
debian box , but 1024 was on par with freebsd.
you can't realistically implemt it using digits, even taking benford
into account, because you run into the issue where you start having
images in the same dir as buckets:
123.gif
1123.gif
3/2/1/123.gif
3/2/1/1/123.gif
the only way to get past that is to zero-pad your images to be a
certain length. but then you get
0/0/0/0/3/2/1/123.gif
and you'll have a ton of zeros until your site grows. until you hit
the upper limit that you defined with 0s- which means your system is
finitely scalable, not infinitely.
the point of this story being- if you're storing massive things, do
as steve said and map to md5. you can even script most webservers to
translate filenames to an internal mapping, so people can pull the
image based on a serial or some text name, and everything is seamless.
// Jonathan Vanasco
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - -
| FindMeOn.com - The cure for Multiple Web Personality Disorder
| Web Identity Management and 3D Social Networking
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - -
| RoadSound.com - Tools For Bands, Stuff For Fans
| Collaborative Online Management And Syndication Tools
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - -
More information about the talk
mailing list