Use CrawlDb as a metadata Db?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Use CrawlDb as a metadata Db?

HUYLEBROECK Jeremy RD-ILAB-SSF-2

If I am not wrong, segments generated by Generator are some sort of
CrawlDatum.
I am putting metadata in the CrawlDb (I keep information that never
change) and I think they are copied to the segments by the Generator.

But now I want to access those metadata at the Parsing or Indexing step
to put some of them in the ParseData that were extracted (or directly in
the index).

I can't find a way to reassociate the "Content" and the Parse Object to
their respective CrawlDb/Segment.

Basically, I am trying to use CrawlDb as a database of metadata for
every URL and want to use them at the indexing step to enrich the
ParseData and then be able to search against them later on.

Stupid Example: I know this URL is associated to color "blue", but
doesn't have this information in the page pointed by this URL. Blue
would be kept in the metadata of the CrawlDb, then the
generator/fetch/parse steps are done as usual, but when indexing, blue
should be reassociated to the parsedata that has been extracted from the
page.

Is it feasible without changing anything in nutch? (I use nutch as a
library more or lessand avoid changing stuff in it, I prefer redoing my
own injector/generator/fetcher/parser and formats etc... if needed).

I am going through all the different classes in nutch/hadoop now to
understand where stuff are and if they are read and in what kind of
object they are put.
Any pointer to shorten my reading is very welcome ;)

Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: Use CrawlDb as a metadata Db?

Enis Soztutar
HUYLEBROECK Jeremy RD-ILAB-SSF wrote:

> If I am not wrong, segments generated by Generator are some sort of
> CrawlDatum.
> I am putting metadata in the CrawlDb (I keep information that never
> change) and I think they are copied to the segments by the Generator.
>
> But now I want to access those metadata at the Parsing or Indexing step
> to put some of them in the ParseData that were extracted (or directly in
> the index).
>
> I can't find a way to reassociate the "Content" and the Parse Object to
> their respective CrawlDb/Segment.
>
> Basically, I am trying to use CrawlDb as a database of metadata for
> every URL and want to use them at the indexing step to enrich the
> ParseData and then be able to search against them later on.
>
> Stupid Example: I know this URL is associated to color "blue", but
> doesn't have this information in the page pointed by this URL. Blue
> would be kept in the metadata of the CrawlDb, then the
> generator/fetch/parse steps are done as usual, but when indexing, blue
> should be reassociated to the parsedata that has been extracted from the
> page.
>
> Is it feasible without changing anything in nutch? (I use nutch as a
> library more or lessand avoid changing stuff in it, I prefer redoing my
> own injector/generator/fetcher/parser and formats etc... if needed).
>
> I am going through all the different classes in nutch/hadoop now to
> understand where stuff are and if they are read and in what kind of
> object they are put.
> Any pointer to shorten my reading is very welcome ;)
>
> Thanks!
>
>
>  
hi,

The CrawlDatum keeps crawl status information about every url that is
fetched. The class has a metedata field which is an instance of  
MapWritable, behaving similar to a HashMap. Thus I have used the
metadata field for similar purposes. For example in the fetcher, you can
set some property like :

datum.getMetaData().put(<key>,<value>);

and than in the indexing plugin you could retrieve it with :  
datum.getMetaData().get(<key>);





Reply | Threaded
Open this post in threaded view
|

Fetch error

Anton Potekhin
I update hadoop but I am get next error now on fetch step (reduce):

06/08/29 08:31:20 INFO mapred.TaskTracker: task_0003_r_000000_3 0.33333334%
reduce > copy (6 of 6 at 11.77 MB/s)
06/08/29 08:31:20 WARN /:
/getMapOutput.jsp?map=task_0003_m_000002_0&reduce=1:
java.lang.IllegalStateException
        at
org.mortbay.jetty.servlet.ServletHttpResponse.getWriter(ServletHttpResponse.
java:561)
        at
org.apache.jasper.runtime.JspWriterImpl.initOut(JspWriterImpl.java:122)
        at
org.apache.jasper.runtime.JspWriterImpl.flushBuffer(JspWriterImpl.java:115)
        at
org.apache.jasper.runtime.PageContextImpl.release(PageContextImpl.java:190)
        at
org.apache.jasper.runtime.JspFactoryImpl.internalReleasePageContext(JspFacto
ryImpl.java:115)
        at
org.apache.jasper.runtime.JspFactoryImpl.releasePageContext(JspFactoryImpl.j
ava:75)
        at
org.apache.jsp.getMapOutput_jsp._jspService(getMapOutput_jsp.java:100)
        at
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
        at
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:3
24)
        at
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
        at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
        at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
        at
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandl
er.java:475)
        at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
        at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
        at
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext
.java:635)
        at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
        at org.mortbay.http.HttpServer.service(HttpServer.java:954)
        at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
        at
org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
        at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
        at
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
        at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
        at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)


How I can fixed this? While on generate step all works right but on fetch
reduce I get error and task faild?



Reply | Threaded
Open this post in threaded view
|

RE: Fetch error

Anton Potekhin
Preview error I got from tasktracker log. In jobtracker log I am see next
error now:

06/08/30 01:04:07 INFO mapred.TaskInProgress: Error from
task_0001_r_000000_1: java.lang.AbstractMethodError: org.apache.n
utch.fetcher.FetcherOutputFormat.getRecordWriter(Lorg/apache/hadoop/fs/FileS
ystem;Lorg/apache/hadoop/mapred/JobConf;Ljava/
lang/String;Lorg/apache/hadoop/util/Progressable;)Lorg/apache/hadoop/mapred/
RecordWriter;
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:297)
        at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1075)



-----Original Message-----
From: [hidden email] [mailto:[hidden email]]
Sent: Wednesday, August 30, 2006 12:17 PM
To: [hidden email]
Subject: Fetch error
Importance: High

I update hadoop but I am get next error now on fetch step (reduce):

06/08/29 08:31:20 INFO mapred.TaskTracker: task_0003_r_000000_3 0.33333334%
reduce > copy (6 of 6 at 11.77 MB/s)
06/08/29 08:31:20 WARN /:
/getMapOutput.jsp?map=task_0003_m_000002_0&reduce=1:
java.lang.IllegalStateException
        at
org.mortbay.jetty.servlet.ServletHttpResponse.getWriter(ServletHttpResponse.
java:561)
        at
org.apache.jasper.runtime.JspWriterImpl.initOut(JspWriterImpl.java:122)
        at
org.apache.jasper.runtime.JspWriterImpl.flushBuffer(JspWriterImpl.java:115)
        at
org.apache.jasper.runtime.PageContextImpl.release(PageContextImpl.java:190)
        at
org.apache.jasper.runtime.JspFactoryImpl.internalReleasePageContext(JspFacto
ryImpl.java:115)
        at
org.apache.jasper.runtime.JspFactoryImpl.releasePageContext(JspFactoryImpl.j
ava:75)
        at
org.apache.jsp.getMapOutput_jsp._jspService(getMapOutput_jsp.java:100)
        at
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
        at
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:3
24)
        at
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
        at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
        at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
        at
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandl
er.java:475)
        at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
        at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
        at
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext
.java:635)
        at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
        at org.mortbay.http.HttpServer.service(HttpServer.java:954)
        at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
        at
org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
        at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
        at
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
        at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
        at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)


How I can fixed this? While on generate step all works right but on fetch
reduce I get error and task faild?





Reply | Threaded
Open this post in threaded view
|

RE: Use CrawlDb as a metadata Db?

HUYLEBROECK Jeremy RD-ILAB-SSF-2
In reply to this post by HUYLEBROECK Jeremy RD-ILAB-SSF-2

I think at the parser plugin level, you can't get back to the original
crawldatum. The parsers get only the Content.
What I did is putting stuff from the Crawldb in the Content MetaData at
fetch time. Then the Parser gets this Metadata and can put it in the
Parse object as needed.

If you do fetching and parsing in a single shot, the Fetcher class could
put directly the info from the crawldatum into the parse.


-----Original Message-----
From: Enis Soztutar [mailto:[hidden email]]
Sent: Wednesday, August 30, 2006 1:07 AM
To: [hidden email]
Subject: Re: Use CrawlDb as a metadata Db?

HUYLEBROECK Jeremy RD-ILAB-SSF wrote:

> If I am not wrong, segments generated by Generator are some sort of
> CrawlDatum.
> I am putting metadata in the CrawlDb (I keep information that never
> change) and I think they are copied to the segments by the Generator.
>
> But now I want to access those metadata at the Parsing or Indexing
> step to put some of them in the ParseData that were extracted (or
> directly in the index).
>
> I can't find a way to reassociate the "Content" and the Parse Object
> to their respective CrawlDb/Segment.
>
> Basically, I am trying to use CrawlDb as a database of metadata for
> every URL and want to use them at the indexing step to enrich the
> ParseData and then be able to search against them later on.
>
> Stupid Example: I know this URL is associated to color "blue", but
> doesn't have this information in the page pointed by this URL. Blue
> would be kept in the metadata of the CrawlDb, then the
> generator/fetch/parse steps are done as usual, but when indexing, blue

> should be reassociated to the parsedata that has been extracted from
> the page.
>
> Is it feasible without changing anything in nutch? (I use nutch as a
> library more or lessand avoid changing stuff in it, I prefer redoing
> my own injector/generator/fetcher/parser and formats etc... if
needed).

>
> I am going through all the different classes in nutch/hadoop now to
> understand where stuff are and if they are read and in what kind of
> object they are put.
> Any pointer to shorten my reading is very welcome ;)
>
> Thanks!
>
>
>  
hi,

The CrawlDatum keeps crawl status information about every url that is
fetched. The class has a metedata field which is an instance of
MapWritable, behaving similar to a HashMap. Thus I have used the
metadata field for similar purposes. For example in the fetcher, you can
set some property like :

datum.getMetaData().put(<key>,<value>);

and than in the indexing plugin you could retrieve it with :  
datum.getMetaData().get(<key>);