Does the data size in 0.8 vesion should be much smaller than in version 0.7?

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Does the data size in 0.8 vesion should be much smaller than in version 0.7?

Rafi Iz
Hi,

I am running few cycles of fetching on nutch 0.8 and I notice that the data
size is much smaller than the data size I got in version 0.7 (running the
same cycle about the same time from different machines), about 5G after the
third cycle starting with about 72000 URLs .
All the processes ended sucssesfuly, everything seems to be fine but I am
afraid that I'm missing somthing.


Each cycle includes :
fetch segments/..
updatedb crawldb segments/..
generate crawldb segments

The configuration in nutch-site.xml are :
<property>
  <name>fs.default.name</name>
  <value>machine1:50000</value>
</property>

<property>
  <name>mapred.job.tracker</name>
  <value>machine1:50020</value>
</property>

<property>
  <name>ndfs.name.dir</name>
  <value>/home/nutch_svn/nutch/trunk/ndfs/name</value>
</property>

<property>
  <name>ndfs.data.dir</name>
  <value>/home/nutch_svn/nutch/trunk/ndfs/data</value>
</property>

<property>
  <name>mapred.local.dir</name>
  <value>/home/nutch_svn/nutch/trunk/mapred/local</value>
</property>

<property>
  <name>mapred.system.dir</name>
  <value>/home/nutch_svn/nutch/trunk/mapred/system</value>
</property>

<property>
  <name>mapred.temp.dir</name>
  <value>/home/nutch_svn/nutch/trunk/mapred/temp</value>
</property>

<property>
  <name>mapred.map.tasks</name>
  <value>12</value>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>6</value>
</property>

<property>
  <name>generate.max.per.host</name>
  <value>-1</value>
</property>


Thanks,
-Rafi

_________________________________________________________________
On the road to retirement? Check out MSN Life Events for advice on how to
get there! http://lifeevents.msn.com/category.aspx?cid=Retirement