API for injecting content into Nutch?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

API for injecting content into Nutch?

Goldschmidt, Dave
Hello,

 

Is there an API of some sort for injecting content into Nutch *without*
using Nutch's crawler?  Or does anyone have ideas as to how to approach
this problem?  I.e. given a URL, a page of content, metadata about the
page, links, etc., how can I inject this into Nutch without Nutch
performing the crawl?

 

Thanks in advance for your ideas and insights,

 

DaveG

 

Reply | Threaded
Open this post in threaded view
|

Re: API for injecting content into Nutch?

kangas
Dave, you don't want to "inject" anything per-se, at least according  
to nutch terminology. Instead, you'll want create your own synthetic  
crawler. Nutch's crawler outputs one "segment file" (directory of  
files, actually) per crawler pass. It is this segment that is  
processed by the "nutch index" stage.

So, create a program that iterates through your content and writes it  
to a segment file, simulating the crawler's output. Just read the  
source for Fetcher.java to see how it uses  
org.apache.nutch.segment.SegmentWriter and mimic that. Then follow  
the rest of the tutorial as if your segment files had fallen out of  
the real crawler.

--Matt

On Sep 26, 2005, at 2:32 PM, Goldschmidt, Dave wrote:

> Hello,
>
> Is there an API of some sort for injecting content into Nutch  
> *without*
> using Nutch's crawler?  Or does anyone have ideas as to how to  
> approach
> this problem?  I.e. given a URL, a page of content, metadata about the
> page, links, etc., how can I inject this into Nutch without Nutch
> performing the crawl?
>
> Thanks in advance for your ideas and insights,
>
>
> DaveG
>

--
Matt Kangas / [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: API for injecting content into Nutch?

Piotr Kosiorowski
In reply to this post by Goldschmidt, Dave
Hi,
I am not sure what you mean by "injecting content into Nutch" but to
create a segment you can use SegmentWriter class. To update WebDB -
IWebDBWriter interface might be useful. The best place to learn about
what kind of data  is stored in segment is probably fetcher code.
Regards
Piotr
Goldschmidt, Dave wrote:

> Hello,
>
>  
>
> Is there an API of some sort for injecting content into Nutch *without*
> using Nutch's crawler?  Or does anyone have ideas as to how to approach
> this problem?  I.e. given a URL, a page of content, metadata about the
> page, links, etc., how can I inject this into Nutch without Nutch
> performing the crawl?
>
>  
>
> Thanks in advance for your ideas and insights,
>
>  
>
> DaveG
>
>  
>
>

Reply | Threaded
Open this post in threaded view
|

Re: API for injecting content into Nutch?

Jon Shoberg
In reply to this post by Goldschmidt, Dave
Goldschmidt, Dave wrote:

> Hello,
>
>  
>
> Is there an API of some sort for injecting content into Nutch *without*
> using Nutch's crawler?  Or does anyone have ideas as to how to approach
> this problem?  I.e. given a URL, a page of content, metadata about the
> page, links, etc., how can I inject this into Nutch without Nutch
> performing the crawl?
>
>  
>
> Thanks in advance for your ideas and insights,
>
>  
>
> DaveG

You may want to open the source of the Fetcher.java and look at
handleFetch.  You'll see content parsing and how it is written to a
segment.  From there you can decern how to use the API and how it fits
your needs.

-j