Getting the real data not only the segment files/index

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Getting the real data not only the segment files/index

Nils Höller-2
Hi,

I ve worked with Nutch till last year and
I am now trying to do something (about continious queries) new with it.

I have only used nutch for getting the index an searching something in a
generated site-map (with the WebDB).

Now I want to use it for to get a archive of a certain number of sites.
So I ll want to nutch to crawl the sites every day (like I used it
before) but also download and save the REAL content of the sites (all
html and pictures), so I can work with this real content.

Is there a possibility to make nutch save also the content like it is
crawled, and not only creating the WebDB and Index?

Actually I have a solution with a perl script, wget, and lucene, but
it would be perfect if I can use nutch from now on.

Thanks for your help.

Nils

Reply | Threaded
Open this post in threaded view
|

Re: Getting the real data not only the segment files/index

Arun Sharma-3
Hi Nils

   According to my knowledge , Nutch do not support this feature Till Date.
If yes, Do let me know. I also Need nutch to support this feature ,
otherwise I am planning to move to the same tech as u did  like using wget
and Lucene ....

Keep in touch...
./Arun


On 11/7/06, Nils Höller <[hidden email]> wrote:

>
> Hi,
>
> I ve worked with Nutch till last year and
> I am now trying to do something (about continious queries) new with it.
>
> I have only used nutch for getting the index an searching something in a
> generated site-map (with the WebDB).
>
> Now I want to use it for to get a archive of a certain number of sites.
> So I ll want to nutch to crawl the sites every day (like I used it
> before) but also download and save the REAL content of the sites (all
> html and pictures), so I can work with this real content.
>
> Is there a possibility to make nutch save also the content like it is
> crawled, and not only creating the WebDB and Index?
>
> Actually I have a solution with a perl script, wget, and lucene, but
> it would be perfect if I can use nutch from now on.
>
> Thanks for your help.
>
> Nils
>
>