Problem with CRC files on NDFS

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem with CRC files on NDFS

Andrzej Białecki-2
Hi,

I have a problem with the recently added CRC files, when "put"-ting
stuff to NDFS. NDFS complains that these files already exist - I suspect
that it creates them on the fly just before they are actually
transmitted from the NDFSClient - and aborts the transfer. I was able to
succeed in -put operation only if I first deleted all .*.crc files.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

jobdetails.jsp and jobtracker.jsp

Anton Potekhin
How to use jobtracker.jsp and jobdetails.jsp?
They need tomcat?

When I try start jobdetails.jsp with tomcat, it return error:
java.lang.NullPointerException
        at
org.apache.jsp.m.jobdetails_jsp._jspService(org.apache.jsp.m.jobdetails_jsp:
53)
        at
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
        at
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:3
22)
        at
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:291)
        at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:241)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
        at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
FilterChain.java:252)
        at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
ain.java:173)
        at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
va:213)
        at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
va:178)
        at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126
)
        at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105
)
        at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java
:107)
        at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
        at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:856)
        at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConne
ction(Http11Protocol.java:744)
        at
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.jav
a:527)
        at
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWo
rkerThread.java:80)
        at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.jav
a:684)
        at java.lang.Thread.run(Thread.java:595)


Reply | Threaded
Open this post in threaded view
|

Re: jobdetails.jsp and jobtracker.jsp

Andrzej Białecki-2
[hidden email] wrote:

>How to use jobtracker.jsp and jobdetails.jsp?
>They need tomcat?
>  
>

No, but jobdetails.jsp requires a parameter (job_id) - start with
jobtracker.jsp, and then follow the links.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

RE: jobdetails.jsp and jobtracker.jsp

Anton Potekhin
They not need tomcat? But then, what we must type in browser address?

http://<host_jobtracker>:<port_jobtracer>/jobtracker/jobtracker.jsp ?


-----Original Message-----
From: Andrzej Bialecki [mailto:[hidden email]]
Sent: Monday, November 21, 2005 12:46 PM
To: [hidden email]
Subject: Re: jobdetails.jsp and jobtracker.jsp

[hidden email] wrote:

>How to use jobtracker.jsp and jobdetails.jsp?
>They need tomcat?
>  
>

No, but jobdetails.jsp requires a parameter (job_id) - start with
jobtracker.jsp, and then follow the links.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Reply | Threaded
Open this post in threaded view
|

RE: jobdetails.jsp and jobtracker.jsp

Anton Potekhin
In reply to this post by Andrzej Białecki-2

Why we need parameter mapred.map.tasks greater than number of available
host? If we set it equal to number of host, we got "negative progress
percentages" problem.


Reply | Threaded
Open this post in threaded view
|

mapred.map.tasks

Anton Potekhin
In reply to this post by Anton Potekhin

Why we need parameter mapred.map.tasks greater than number of available
host? If we set it equal to number of host, we got "negative progress
percentages" problem.


Reply | Threaded
Open this post in threaded view
|

Re: jobdetails.jsp and jobtracker.jsp

Andrzej Białecki-2
In reply to this post by Anton Potekhin
[hidden email] wrote:

>Why we need parameter mapred.map.tasks greater than number of available
>host? If we set it equal to number of host, we got "negative progress
>percentages" problem.
>  
>

Because the whole point of MapReduce tasktrackers is that they are able
to run more than 1 task simultaneously on a single host.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: jobdetails.jsp and jobtracker.jsp

Andrzej Białecki-2
In reply to this post by Anton Potekhin
[hidden email] wrote:

>They not need tomcat? But then, what we must type in browser address?
>
>  
>

No, they don't - Jobtracker runs an embedded Jetty.

>http://<host_jobtracker>:<port_jobtracer>/jobtracker/jobtracker.jsp ?
>  
>

You need to use whatever is the hostname that runs the JobTracker, and
whatever port you set for mapred.job.tracker.info.port in your config files.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Problem with CRC files on NDFS

Doug Cutting-2
In reply to this post by Andrzej Białecki-2
Andrzej Bialecki wrote:
> I have a problem with the recently added CRC files, when "put"-ting
> stuff to NDFS. NDFS complains that these files already exist - I suspect
> that it creates them on the fly just before they are actually
> transmitted from the NDFSClient - and aborts the transfer. I was able to
> succeed in -put operation only if I first deleted all .*.crc files.

I have not seen this.  Can you tell me more how to cause this problem,
perhaps providing the transcript of a session?  Are you overwriting
existing files?

A crc file is created just after file is opened for output.  It
overwrites any existing crc file.  See NFSDataOutputStream.java line 44.

There are a few cases where things will complain about non-existant .crc
files.  This happens, e.g., when putting a file that was not created by
Nutch tools.

It also notably happens with Lucene indexes, since these are created by
FSDirectory, not NDFSDirectory, since NDFS does not permit overwrites,
and Lucene overwrites in one place (TermInfosWriter.java line 141).  If
we modify Lucene to write the term count at EOF-8 then Lucene indexes
can be written directly through a NutchFileSystem API and will be
correctly checksummed at creation.  Is this change to Lucene justified?

Doug
Reply | Threaded
Open this post in threaded view
|

Re: mapred.map.tasks

Doug Cutting-2
In reply to this post by Anton Potekhin
[hidden email] wrote:
> Why we need parameter mapred.map.tasks greater than number of available
> host? If we set it equal to number of host, we got "negative progress
> percentages" problem.

Can you please post a simple example that demonstrates the "negative
progress" problem?  E.g., the minimal changes to your conf/ directory
required to illustrate this, how you start your daemons, etc.

Thanks,

Doug
Reply | Threaded
Open this post in threaded view
|

RE: mapred.map.tasks

Anton Potekhin
I tried to launch mapred on 2 machines: 192.168.0.250 and 192.168.0.111.

In nutch-site.xml I specified parameters:

1) On the both machines:
<property>
  <name>fs.default.name</name>
  <value>192.168.0.250:9009</value>
  <description>The name of the default file system.  Either the
  literal string "local" or a host:port for NDFS.</description>
</property>

<property>
  <name>mapred.job.tracker</name>
  <value>192.168.0.250:9010</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>
<property>
  <name>mapred.map.tasks</name>
  <value>2</value>
  <description>The default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is "local".  
  </description>
</property>

<property>
  <name>mapred.tasktracker.tasks.maximum</name>
  <value>2</value>
  <description>The maximum number of tasks that will be run
  simultaneously by a task tracker.
  </description>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>2</value>
  <description>The default number of reduce tasks per job.  Typically set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is "local".
  </description>
</property>
 



On 192.168.0.250 I started:
2)       bin/nutch-daemon.sh start datanode
3)       bin/nutch-daemon.sh start namenode
4)       bin/nutch-daemon.sh start jobtracker
5)       bin/nutch-daemon.sh start tasktracker

I created directory seeds and file urls in it. Urls contained 2 links.
Then I added that directory to NDFS (bin/nutch ndfs -put ./seeds seeds).
Directory was added successfully..

 

Then I launched command:
bin/nutch crawl seeds -depth 2

I a result I received log written by jobtracker:
....
051123 053118 Adding task 'task_m_z66npx' to set for tracker 'tracker_53845'
051123 053118 Adding task 'task_m_xaynqo' to set for tracker 'tracker_11518'
051123 053130 Task 'task_m_z66npx' has finished successfully.
 

Log written by tasktracker on 192.168.0.111:
......
051110 142607 task_m_z66npx 0.0% /user/root/seeds/urls:0+31
051110 142607 task_m_z66npx 1.0% /user/root/seeds/urls:0+31
051110 142607 Task task_m_z66npx is done.
 

Log written by tasktracker on 192.168.0.250:
....
051123 053125 task_m_xaynqo 0.12903225% /user/root/seeds/urls:31+31
051123 053126 task_m_xaynqo -683.9677% /user/root/seeds/urls:31+31
051123 053127 task_m_xaynqo -2129.9678% /user/root/seeds/urls:31+31
051123 053128 task_m_xaynqo -3483.0322% /user/root/seeds/urls:31+31
051123 053129 task_m_xaynqo -4976.2256% /user/root/seeds/urls:31+31
051123 053130 task_m_xaynqo -6449.1934% /user/root/seeds/urls:31+31
051123 053131 task_m_xaynqo -7898.258% /user/root/seeds/urls:31+31
051123 053132 task_m_xaynqo -9232.193% /user/root/seeds/urls:31+31
051123 053133 task_m_xaynqo -10694.3545% /user/root/seeds/urls:31+31
051123 053134 task_m_xaynqo -12139.226% /user/root/seeds/urls:31+31
051123 053135 task_m_xaynqo -13416.677% /user/root/seeds/urls:31+31
051123 053136 task_m_xaynqo -14885.741% /user/root/seeds/urls:31+31
... and so on... e.g. in this log were records with reducing percents.

 

I concluded that was an attempt to separate inject to 2 machines e.g.
were 2 tasks: 'task_m_z66npx' and 'task_m_xaynqo'. And 'task_m_z66npx'
was finished successfully and 'task_m_xaynqo' caused some problems (negative

progress).

But if I change parameter mapred.reduce.tasks to 4 all tasks finished
successfully and all work right.



-----Original Message-----
From: Doug Cutting [mailto:[hidden email]]
Sent: Tuesday, November 22, 2005 2:10 AM
To: [hidden email]
Subject: Re: mapred.map.tasks

[hidden email] wrote:
> Why we need parameter mapred.map.tasks greater than number of available
> host? If we set it equal to number of host, we got "negative progress
> percentages" problem.

Can you please post a simple example that demonstrates the "negative
progress" problem?  E.g., the minimal changes to your conf/ directory
required to illustrate this, how you start your daemons, etc.

Thanks,

Doug