block numbers need a better random number generator

Key: NUTCH122
URL:
http://issues.apache.org/jira/browse/NUTCH122 Project: Nutch
Type: Bug
Components: fetcher, indexer, searcher
Versions: 0.8dev
Reporter: Paul Baclace
In order to support billions of block numbers, a better PRNG than java.util.Random is needed. To reach billions with low probability of collision, 64 bit random numbers are needed (the Birthday Problem is the model for the number of bits needed; the result is that twice as many bits are needed as the number of bits to count the expected number of items.) The builtin java.util.Random keeps only 48 bits of state which is only sufficient for 2^24 items. Using repeated calls to or more than one instance of Random does not increase its total entropy.
Analysis
util.Random is a linear congruential generator (LCG) identical to drand48.
util.Random keeps 48 bits of state and gangs together 2 consecutive values to return 64 bit values.
LCGs suffer from periodicity in the low order bits which would make modular binning less than random.
"low order bits" could mean least significant byte.
LCGs have periods in the range 106 to 109 when using 32 bit words, a range of poor to fair.
seed = ( 0x5DEECE66DL * seed + 0xBL ) & ((1L << 48)  1);
the origin of 0x5DEECE66D, a nonprime, is shrouded in the mists of time.
Results of the Birthday Spacings Test look good.
References
http://www.math.utah.edu/~beebe/java/random/README http://www.pierssen.com/arcview/upload/esoterica/randomizer.htmlRecommended alternative: MersenneTwister
Matsumoto and Nishimura (1998).
Longest period of any known generator 2^19937 or about 10^6001.
A period that exceeds the number of unique values seems ideal; obviously a shorter period than the number of unique values (like util.Random) is a problem).
Faster than java.util.Random (Random was recent tweaked, however).
Excellent result for Diehard Birthday Spacings Test.
Can be seeded with up to 624 32 bit integers.
Doug Cutting wrote on nutchdev:
> It just occurred to me that perhaps we could simply use sequential block numbering.
> All block ids are generated centrally on the namenode.
Response from Paul Baclace:
I'm not sure what the advantage of sequential block numbers would be
since long period PRNG block numbering does not even need to store
it's state, just pick a new starting place.
Sequential block numbering does have the downside that picking a datanode based on (BlockNum % DataNodeCount) would devolve into round robin. Any attempt to pass the sequence through a hash ends up becoming a random number generator.
Sequential numbering provides contiguous numbers, but after G.C. that would be lost, so no advantage there.
When human beings eyeball block numbers, many with small differences are more likely to be misread than many that are totally different.
If block numbering is sequential, then there is a temptation to use 32 bits instead of 64, but 32 bits leads to wraparound and uh oh.
FSNamesystem uses Random to help pick a target datanode, but it could just use the randomness of block numbers.

This message is automatically generated by JIRA.

If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see:
http://www.atlassian.com/software/jira