[nycbug-talk] MD5 Presentation

Jonathan Vanasco nycbug-list at 2xlp.com
Thu Aug 3 14:44:46 EDT 2006


On Aug 3, 2006, at 12:39 PM, Jonathan Vanasco wrote:

> i think its based off of Benson's law of mathematics.  it works
> well.  i just  decided that i wanted to obfuscate serials to the  
> public.

To answer a few offlist questions;

Sorry for the confusion folks, its Benford, not Benson

	http://www.ezrstats.com/Benford.htm
	http://mathworld.wolfram.com/BenfordsLaw.html

Ironically, the IRS started using it recently to flag returns for audit.

It has to do with leading digit distribution-- 1 is more likely to be  
the first digit than a 9-- 30% more on average

if you're storing stuff into buckets off a serial though, you feel it  
worse, as you encounter geometic expansion under base 10

ie: 1-3, 11-13 , 111-113

	1/
	2/
	3/
	9/
	1/1
	1/2/
	1/3/
	1/9/
	1/1/1
	1/1/2
	1/1/3

	vs

	1/1
	2/1
	3/1
	1/1/1
	2/1/1
	3/1/1

		


if you 'front load' your buckets, you'll get a highly uneven ( read:  
completely worthless ) distribution -- you only get a new bucket when  
you're up by a factor of 10

if you back load your buckets though, you'll get a clean distribution  
-- every 10 items you get a new bucket

getting back to the hashing algorithm-- on the OP.  i should add:

let's say you're storing a serial

	while a GREAT advantage of it is that you're getting a better  
representation of a number to store , as it has a better distribution  
of characters ( 11 vs alkjkas )
	the BEST advantage ( i found) is this:
		you get to bucket a base 10 id in a 16bit space ( i use md5_hex )
		at what point does directory performance degrade?  in the mid  
thousands.
			2 md5hex chars = 256 buckets / dir
			3 md5hex chars = 4096
			2 md5 base 64 chars = 4096

			3 md5 base 32 = 1024

	personally, i think 3 md5 base 32 chars would have been the best  
choice for my appllications-- but it was a PITA to implement across  
languages.  perl/python/php all have base16/base64 as default, so i  
just used that.

	i while ago i tested.  freebsd works great with 4096 directories.  i  
didn't like the performance of ex2fs w/3 levels of 4096 dirs on my  
debian box , but 1024 was on par with freebsd.

you can't realistically implemt it using digits, even taking benford  
into account, because you run into the issue where you start having  
images in the same dir as buckets:

	123.gif
	1123.gif

		3/2/1/123.gif
		3/2/1/1/123.gif

the only way to get past that is to zero-pad your images to be a  
certain length.  but then you get
	0/0/0/0/3/2/1/123.gif

and you'll have a ton of zeros until your site grows.  until you hit  
the upper limit that you defined with 0s- which means your system is  
finitely scalable, not infinitely.

the point of this story being- if you're storing massive things, do  
as steve said and map to md5.  you can even script most webservers to  
translate filenames to an internal mapping, so people can pull the  
image based on a serial or some text name, and everything is seamless.


// Jonathan Vanasco

| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - -
| FindMeOn.com - The cure for Multiple Web Personality Disorder
| Web Identity Management and 3D Social Networking
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - -
| RoadSound.com - Tools For Bands, Stuff For Fans
| Collaborative Online Management And Syndication Tools
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - -





More information about the talk mailing list