DmozParser questions

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

DmozParser questions

Andy Markham
Following the instructions in the Nutch tutorial, I downloaded the DMOZ file
content.rdf.u8, which is roughly 2GB and has 36.9M entries.  I then ran the
command to grab a subset of those URLS:

bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 >
dmoz/urls

According to the Tuturial, the -subset 5000 option tells it to grab one out
of every 5000 URLs.  As such, I was expecting to see somwhere in the
neighborhood of 7000 (36.9M/5000=7380) URLs in the dmoz/urls file.  However,
instead the file contained only 906 URLs.

For grins, I thought I'd try another run at it using -subset 7500.  The
result was 587 URLs.

There is no information in the log files,  fyi.

When comparing the numbers above it's almost like DmozParser is taking the
subset value, multiplying it by a number slightly larger than 8 and then
grabbing one out of every <subset>*8.2 lines.

36.9M/(5000*8.2) = 900
36.9M/(7500*8.2) = 615

Any hints?

I sure hope I'm not being obtuse, but I've looked around a bit for more
info, to no avail.  Also, I realize this seems like picking nits, because if
I'm looking for 7000 URLs, I can simpy adjust the submit number using my
math above, but I'd just like to make sure I understand things...

Also, as many know the NutchTutorial page on the Nutch Wiki is not
up-to-date with 0.8.  Is there any chance that if I rewrite it and send the
diffs to someone, they'll actually get applied to the Wiki?  I'm also more
than willing to change the page directly (it IS a wiki), but can't seem to
figure out how!  Again, could be the obtuse thing...

Best,
Andy