where nutch store crawled data

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

where nutch store crawled data

beansproud
Hi,
    I'm fresh for nutch.And when I use nutch for crawling pages.I can get the crawled data by using the command : nutch readseg.
    My question is can I get the data directly ? I just can't find where nutch put them.
    Can anybody tell me ?
    Thanks very much!
Reply | Threaded
Open this post in threaded view
|

RE: where nutch store crawled data

POIRIER David
When executing a crawl, Nutch creates segments, based on the crawel
depth if I'm not mistaking, in which the fetched content is stored. For
example, if crawling a web site named site-xyz, into the directory
$nutch_home/crawls/crawl-xyz, you will find the segments into the
following directory: $nutch_home/crawls/crawl-xyz/segments. For each
segment directory you will find a content directory.

To be honest, I don't think you can directly access the stored content
found in thoses directories, the idea being to index it and not
necesserely store it.

David



-----Original Message-----
From: beansproud [mailto:[hidden email]]
Sent: lundi, 16. juin 2008 16:42
To: [hidden email]
Subject: where nutch store crawled data


Hi,
    I'm fresh for nutch.And when I use nutch for crawling pages.I can
get
the crawled data by using the command : nutch readseg.
    My question is can I get the data directly ? I just can't find where
nutch put them.
    Can anybody tell me ?
    Thanks very much!
--
View this message in context:
http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17865961
.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

RE: where nutch store crawled data

beansproud
oh, you are right.
thanks

POIRIER David wrote
When executing a crawl, Nutch creates segments, based on the crawel
depth if I'm not mistaking, in which the fetched content is stored. For
example, if crawling a web site named site-xyz, into the directory
$nutch_home/crawls/crawl-xyz, you will find the segments into the
following directory: $nutch_home/crawls/crawl-xyz/segments. For each
segment directory you will find a content directory.

To be honest, I don't think you can directly access the stored content
found in thoses directories, the idea being to index it and not
necesserely store it.

David



-----Original Message-----
From: beansproud [mailto:gaodaqiang1984@gmail.com]
Sent: lundi, 16. juin 2008 16:42
To: nutch-user@lucene.apache.org
Subject: where nutch store crawled data


Hi,
    I'm fresh for nutch.And when I use nutch for crawling pages.I can
get
the crawled data by using the command : nutch readseg.
    My question is can I get the data directly ? I just can't find where
nutch put them.
    Can anybody tell me ?
    Thanks very much!
--
View this message in context:
http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17865961
.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: where nutch store crawled data

Winton Davies-4
I have a follow up question - is it possible to directly write to the Crawl
DB. I have several million HTML pages that are stored in a  single
concatenated flat file, and I'd like to just run a utility over them to feed
them to Nutch parsing/indexing rather than have to dump as individual files.
Looking at the API documentation I'd couldnt find any obvious capabilities.

I've no idea if the fetch -> crawldb does the parse and url extraction
before it writes it anyway. If it's not possible, then it doesnt matter, but
if it's possible, it would save having to write out lots of files.

Winton



On Tue, Jun 17, 2008 at 6:57 AM, beansproud <[hidden email]>
wrote:

>
> oh, you are right.
> thanks
>
>
> POIRIER David wrote:
> >
> > When executing a crawl, Nutch creates segments, based on the crawel
> > depth if I'm not mistaking, in which the fetched content is stored. For
> > example, if crawling a web site named site-xyz, into the directory
> > $nutch_home/crawls/crawl-xyz, you will find the segments into the
> > following directory: $nutch_home/crawls/crawl-xyz/segments. For each
> > segment directory you will find a content directory.
> >
> > To be honest, I don't think you can directly access the stored content
> > found in thoses directories, the idea being to index it and not
> > necesserely store it.
> >
> > David
> >
> >
> >
> > -----Original Message-----
> > From: beansproud [mailto:[hidden email]]
> > Sent: lundi, 16. juin 2008 16:42
> > To: [hidden email]
> > Subject: where nutch store crawled data
> >
> >
> > Hi,
> >     I'm fresh for nutch.And when I use nutch for crawling pages.I can
> > get
> > the crawled data by using the command : nutch readseg.
> >     My question is can I get the data directly ? I just can't find where
> > nutch put them.
> >     Can anybody tell me ?
> >     Thanks very much!
> > --
> > View this message in context:
> > http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17865961
> > .html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17905486.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: where nutch store crawled data

Marcus Herou
In reply to this post by beansproud
You can fetch it but it is not pretty.

It is just a SequenceFileInputFormat:
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html

Look in the org.apache.nutch.crawl.Crawl class and specifically how it uses
the Indexer.

Kindly

//Marcus

On Tue, Jun 17, 2008 at 3:57 PM, beansproud <[hidden email]>
wrote:

>
> oh, you are right.
> thanks
>
>
> POIRIER David wrote:
> >
> > When executing a crawl, Nutch creates segments, based on the crawel
> > depth if I'm not mistaking, in which the fetched content is stored. For
> > example, if crawling a web site named site-xyz, into the directory
> > $nutch_home/crawls/crawl-xyz, you will find the segments into the
> > following directory: $nutch_home/crawls/crawl-xyz/segments. For each
> > segment directory you will find a content directory.
> >
> > To be honest, I don't think you can directly access the stored content
> > found in thoses directories, the idea being to index it and not
> > necesserely store it.
> >
> > David
> >
> >
> >
> > -----Original Message-----
> > From: beansproud [mailto:[hidden email]]
> > Sent: lundi, 16. juin 2008 16:42
> > To: [hidden email]
> > Subject: where nutch store crawled data
> >
> >
> > Hi,
> >     I'm fresh for nutch.And when I use nutch for crawling pages.I can
> > get
> > the crawled data by using the command : nutch readseg.
> >     My question is can I get the data directly ? I just can't find where
> > nutch put them.
> >     Can anybody tell me ?
> >     Thanks very much!
> > --
> > View this message in context:
> > http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17865961
> > .html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17905486.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[hidden email]
http://www.tailsweep.com/
http://blogg.tailsweep.com/
Reply | Threaded
Open this post in threaded view
|

Re: where nutch store crawled data

Marcus Herou
Oh sorry just saw CrawlDbReader which have different methods, one in
particular for retrieving content based on a url.

//Marcus

On Tue, Jun 17, 2008 at 7:57 PM, Marcus Herou <[hidden email]>
wrote:

> You can fetch it but it is not pretty.
>
> It is just a SequenceFileInputFormat:
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html
>
> Look in the org.apache.nutch.crawl.Crawl class and specifically how it uses
> the Indexer.
>
> Kindly
>
> //Marcus
>
>
> On Tue, Jun 17, 2008 at 3:57 PM, beansproud <[hidden email]>
> wrote:
>
>>
>> oh, you are right.
>> thanks
>>
>>
>> POIRIER David wrote:
>> >
>> > When executing a crawl, Nutch creates segments, based on the crawel
>> > depth if I'm not mistaking, in which the fetched content is stored. For
>> > example, if crawling a web site named site-xyz, into the directory
>> > $nutch_home/crawls/crawl-xyz, you will find the segments into the
>> > following directory: $nutch_home/crawls/crawl-xyz/segments. For each
>> > segment directory you will find a content directory.
>> >
>> > To be honest, I don't think you can directly access the stored content
>> > found in thoses directories, the idea being to index it and not
>> > necesserely store it.
>> >
>> > David
>> >
>> >
>> >
>> > -----Original Message-----
>> > From: beansproud [mailto:[hidden email]]
>> > Sent: lundi, 16. juin 2008 16:42
>> > To: [hidden email]
>> > Subject: where nutch store crawled data
>> >
>> >
>> > Hi,
>> >     I'm fresh for nutch.And when I use nutch for crawling pages.I can
>> > get
>> > the crawled data by using the command : nutch readseg.
>> >     My question is can I get the data directly ? I just can't find where
>> > nutch put them.
>> >     Can anybody tell me ?
>> >     Thanks very much!
>> > --
>> > View this message in context:
>> >
>> http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17865961
>> > .html
>> > Sent from the Nutch - User mailing list archive at Nabble.com.
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17905486.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> [hidden email]
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/




--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[hidden email]
http://www.tailsweep.com/
http://blogg.tailsweep.com/
Reply | Threaded
Open this post in threaded view
|

Re: where nutch store crawled data

Chris Anderson-11
In reply to this post by Marcus Herou
My team is working on a Streaming.jar for nutch, that output the
crawled pages in a JSON format. Hopefully we'll be able to share it
once we know it is solid. This way you can send the crawled data to
programs written in any language.

On Tue, Jun 17, 2008 at 10:57 AM, Marcus Herou
<[hidden email]> wrote:

> You can fetch it but it is not pretty.
>
> It is just a SequenceFileInputFormat:
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html
>
> Look in the org.apache.nutch.crawl.Crawl class and specifically how it uses
> the Indexer.
>
> Kindly
>
> //Marcus
>
> On Tue, Jun 17, 2008 at 3:57 PM, beansproud <[hidden email]>
> wrote:
>
>>
>> oh, you are right.
>> thanks
>>
>>
>> POIRIER David wrote:
>> >
>> > When executing a crawl, Nutch creates segments, based on the crawel
>> > depth if I'm not mistaking, in which the fetched content is stored. For
>> > example, if crawling a web site named site-xyz, into the directory
>> > $nutch_home/crawls/crawl-xyz, you will find the segments into the
>> > following directory: $nutch_home/crawls/crawl-xyz/segments. For each
>> > segment directory you will find a content directory.
>> >
>> > To be honest, I don't think you can directly access the stored content
>> > found in thoses directories, the idea being to index it and not
>> > necesserely store it.
>> >
>> > David
>> >
>> >
>> >
>> > -----Original Message-----
>> > From: beansproud [mailto:[hidden email]]
>> > Sent: lundi, 16. juin 2008 16:42
>> > To: [hidden email]
>> > Subject: where nutch store crawled data
>> >
>> >
>> > Hi,
>> >     I'm fresh for nutch.And when I use nutch for crawling pages.I can
>> > get
>> > the crawled data by using the command : nutch readseg.
>> >     My question is can I get the data directly ? I just can't find where
>> > nutch put them.
>> >     Can anybody tell me ?
>> >     Thanks very much!
>> > --
>> > View this message in context:
>> > http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17865961
>> > .html
>> > Sent from the Nutch - User mailing list archive at Nabble.com.
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17905486.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> [hidden email]
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>



--
Chris Anderson
http://jchris.mfdz.com
Reply | Threaded
Open this post in threaded view
|

Re: where nutch store crawled data

Marcus Herou
And I'm working on a solution to use HBase as backend :)

On Tue, Jun 17, 2008 at 8:01 PM, Chris Anderson <[hidden email]> wrote:

> My team is working on a Streaming.jar for nutch, that output the
> crawled pages in a JSON format. Hopefully we'll be able to share it
> once we know it is solid. This way you can send the crawled data to
> programs written in any language.
>
> On Tue, Jun 17, 2008 at 10:57 AM, Marcus Herou
> <[hidden email]> wrote:
> > You can fetch it but it is not pretty.
> >
> > It is just a SequenceFileInputFormat:
> >
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html
> >
> > Look in the org.apache.nutch.crawl.Crawl class and specifically how it
> uses
> > the Indexer.
> >
> > Kindly
> >
> > //Marcus
> >
> > On Tue, Jun 17, 2008 at 3:57 PM, beansproud <[hidden email]>
> > wrote:
> >
> >>
> >> oh, you are right.
> >> thanks
> >>
> >>
> >> POIRIER David wrote:
> >> >
> >> > When executing a crawl, Nutch creates segments, based on the crawel
> >> > depth if I'm not mistaking, in which the fetched content is stored.
> For
> >> > example, if crawling a web site named site-xyz, into the directory
> >> > $nutch_home/crawls/crawl-xyz, you will find the segments into the
> >> > following directory: $nutch_home/crawls/crawl-xyz/segments. For each
> >> > segment directory you will find a content directory.
> >> >
> >> > To be honest, I don't think you can directly access the stored content
> >> > found in thoses directories, the idea being to index it and not
> >> > necesserely store it.
> >> >
> >> > David
> >> >
> >> >
> >> >
> >> > -----Original Message-----
> >> > From: beansproud [mailto:[hidden email]]
> >> > Sent: lundi, 16. juin 2008 16:42
> >> > To: [hidden email]
> >> > Subject: where nutch store crawled data
> >> >
> >> >
> >> > Hi,
> >> >     I'm fresh for nutch.And when I use nutch for crawling pages.I can
> >> > get
> >> > the crawled data by using the command : nutch readseg.
> >> >     My question is can I get the data directly ? I just can't find
> where
> >> > nutch put them.
> >> >     Can anybody tell me ?
> >> >     Thanks very much!
> >> > --
> >> > View this message in context:
> >> >
> http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17865961
> >> > .html
> >> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >> >
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17905486.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
> > --
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > [hidden email]
> > http://www.tailsweep.com/
> > http://blogg.tailsweep.com/
> >
>
>
>
> --
> Chris Anderson
> http://jchris.mfdz.com
>



--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[hidden email]
http://www.tailsweep.com/
http://blogg.tailsweep.com/
Reply | Threaded
Open this post in threaded view
|

Re: where nutch store crawled data

Otis Gospodnetic-2-2
In reply to this post by beansproud
Hi,

Both of you should open some JIRA issues and upload your patches there as you progress, so others can see the direction you are headed and make suggestions when appropriate.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----

> From: Marcus Herou <[hidden email]>
> To: [hidden email]
> Sent: Tuesday, June 17, 2008 2:03:43 PM
> Subject: Re: where nutch store crawled data
>
> And I'm working on a solution to use HBase as backend :)
>
> On Tue, Jun 17, 2008 at 8:01 PM, Chris Anderson wrote:
>
> > My team is working on a Streaming.jar for nutch, that output the
> > crawled pages in a JSON format. Hopefully we'll be able to share it
> > once we know it is solid. This way you can send the crawled data to
> > programs written in any language.
> >
> > On Tue, Jun 17, 2008 at 10:57 AM, Marcus Herou
> > wrote:
> > > You can fetch it but it is not pretty.
> > >
> > > It is just a SequenceFileInputFormat:
> > >
> >
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html
> > >
> > > Look in the org.apache.nutch.crawl.Crawl class and specifically how it
> > uses
> > > the Indexer.
> > >
> > > Kindly
> > >
> > > //Marcus
> > >
> > > On Tue, Jun 17, 2008 at 3:57 PM, beansproud
> > > wrote:
> > >
> > >>
> > >> oh, you are right.
> > >> thanks
> > >>
> > >>
> > >> POIRIER David wrote:
> > >> >
> > >> > When executing a crawl, Nutch creates segments, based on the crawel
> > >> > depth if I'm not mistaking, in which the fetched content is stored.
> > For
> > >> > example, if crawling a web site named site-xyz, into the directory
> > >> > $nutch_home/crawls/crawl-xyz, you will find the segments into the
> > >> > following directory: $nutch_home/crawls/crawl-xyz/segments. For each
> > >> > segment directory you will find a content directory.
> > >> >
> > >> > To be honest, I don't think you can directly access the stored content
> > >> > found in thoses directories, the idea being to index it and not
> > >> > necesserely store it.
> > >> >
> > >> > David
> > >> >
> > >> >
> > >> >
> > >> > -----Original Message-----
> > >> > From: beansproud [mailto:[hidden email]]
> > >> > Sent: lundi, 16. juin 2008 16:42
> > >> > To: [hidden email]
> > >> > Subject: where nutch store crawled data
> > >> >
> > >> >
> > >> > Hi,
> > >> >     I'm fresh for nutch.And when I use nutch for crawling pages.I can
> > >> > get
> > >> > the crawled data by using the command : nutch readseg.
> > >> >     My question is can I get the data directly ? I just can't find
> > where
> > >> > nutch put them.
> > >> >     Can anybody tell me ?
> > >> >     Thanks very much!
> > >> > --
> > >> > View this message in context:
> > >> >
> > http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17865961
> > >> > .html
> > >> > Sent from the Nutch - User mailing list archive at Nabble.com.
> > >> >
> > >> >
> > >> >
> > >>
> > >> --
> > >> View this message in context:
> > >>
> > http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17905486.html
> > >> Sent from the Nutch - User mailing list archive at Nabble.com.
> > >>
> > >>
> > >
> > >
> > > --
> > > Marcus Herou CTO and co-founder Tailsweep AB
> > > +46702561312
> > > [hidden email]
> > > http://www.tailsweep.com/
> > > http://blogg.tailsweep.com/
> > >
> >
> >
> >
> > --
> > Chris Anderson
> > http://jchris.mfdz.com
> >
>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> [hidden email]
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/

Reply | Threaded
Open this post in threaded view
|

Re: where nutch store crawled data

beansproud
In reply to this post by Winton Davies-4
I think you can use the url :
file://
and you should modify the urlFilter, nutch uses the "regex-urlfilter.txt" by default.
I didn't tried it, but I think it should works

Winton Davies-4 wrote
I have a follow up question - is it possible to directly write to the Crawl
DB. I have several million HTML pages that are stored in a  single
concatenated flat file, and I'd like to just run a utility over them to feed
them to Nutch parsing/indexing rather than have to dump as individual files.
Looking at the API documentation I'd couldnt find any obvious capabilities.

I've no idea if the fetch -> crawldb does the parse and url extraction
before it writes it anyway. If it's not possible, then it doesnt matter, but
if it's possible, it would save having to write out lots of files.

Winton



On Tue, Jun 17, 2008 at 6:57 AM, beansproud <gaodaqiang1984@gmail.com>
wrote:

>
> oh, you are right.
> thanks
>
>
> POIRIER David wrote:
> >
> > When executing a crawl, Nutch creates segments, based on the crawel
> > depth if I'm not mistaking, in which the fetched content is stored. For
> > example, if crawling a web site named site-xyz, into the directory
> > $nutch_home/crawls/crawl-xyz, you will find the segments into the
> > following directory: $nutch_home/crawls/crawl-xyz/segments. For each
> > segment directory you will find a content directory.
> >
> > To be honest, I don't think you can directly access the stored content
> > found in thoses directories, the idea being to index it and not
> > necesserely store it.
> >
> > David
> >
> >
> >
> > -----Original Message-----
> > From: beansproud [mailto:gaodaqiang1984@gmail.com]
> > Sent: lundi, 16. juin 2008 16:42
> > To: nutch-user@lucene.apache.org
> > Subject: where nutch store crawled data
> >
> >
> > Hi,
> >     I'm fresh for nutch.And when I use nutch for crawling pages.I can
> > get
> > the crawled data by using the command : nutch readseg.
> >     My question is can I get the data directly ? I just can't find where
> > nutch put them.
> >     Can anybody tell me ?
> >     Thanks very much!
> > --
> > View this message in context:
> > http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17865961
> > .html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17905486.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: where nutch store crawled data

beansproud
In reply to this post by Winton Davies-4
By the way, when I modify the urlfilter, it works until I recompiled, I don't know why, but it just that.

Winton Davies-4 wrote
I have a follow up question - is it possible to directly write to the Crawl
DB. I have several million HTML pages that are stored in a  single
concatenated flat file, and I'd like to just run a utility over them to feed
them to Nutch parsing/indexing rather than have to dump as individual files.
Looking at the API documentation I'd couldnt find any obvious capabilities.

I've no idea if the fetch -> crawldb does the parse and url extraction
before it writes it anyway. If it's not possible, then it doesnt matter, but
if it's possible, it would save having to write out lots of files.

Winton



On Tue, Jun 17, 2008 at 6:57 AM, beansproud <gaodaqiang1984@gmail.com>
wrote:

>
> oh, you are right.
> thanks
>
>
> POIRIER David wrote:
> >
> > When executing a crawl, Nutch creates segments, based on the crawel
> > depth if I'm not mistaking, in which the fetched content is stored. For
> > example, if crawling a web site named site-xyz, into the directory
> > $nutch_home/crawls/crawl-xyz, you will find the segments into the
> > following directory: $nutch_home/crawls/crawl-xyz/segments. For each
> > segment directory you will find a content directory.
> >
> > To be honest, I don't think you can directly access the stored content
> > found in thoses directories, the idea being to index it and not
> > necesserely store it.
> >
> > David
> >
> >
> >
> > -----Original Message-----
> > From: beansproud [mailto:gaodaqiang1984@gmail.com]
> > Sent: lundi, 16. juin 2008 16:42
> > To: nutch-user@lucene.apache.org
> > Subject: where nutch store crawled data
> >
> >
> > Hi,
> >     I'm fresh for nutch.And when I use nutch for crawling pages.I can
> > get
> > the crawled data by using the command : nutch readseg.
> >     My question is can I get the data directly ? I just can't find where
> > nutch put them.
> >     Can anybody tell me ?
> >     Thanks very much!
> > --
> > View this message in context:
> > http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17865961
> > .html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17905486.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: where nutch store crawled data

Marcus Herou
In reply to this post by Winton Davies-4
Look how the Fetcher class writes a segment.
Then you can skip the steps.

Generate, fetch, updatedb and just run merge segments and index.

/M

On Tue, Jun 17, 2008 at 7:02 PM, Winton Davies <[hidden email]> wrote:

> I have a follow up question - is it possible to directly write to the Crawl
> DB. I have several million HTML pages that are stored in a  single
> concatenated flat file, and I'd like to just run a utility over them to
> feed
> them to Nutch parsing/indexing rather than have to dump as individual
> files.
> Looking at the API documentation I'd couldnt find any obvious capabilities.
>
> I've no idea if the fetch -> crawldb does the parse and url extraction
> before it writes it anyway. If it's not possible, then it doesnt matter,
> but
> if it's possible, it would save having to write out lots of files.
>
> Winton
>
>
>
> On Tue, Jun 17, 2008 at 6:57 AM, beansproud <[hidden email]>
> wrote:
>
> >
> > oh, you are right.
> > thanks
> >
> >
> > POIRIER David wrote:
> > >
> > > When executing a crawl, Nutch creates segments, based on the crawel
> > > depth if I'm not mistaking, in which the fetched content is stored. For
> > > example, if crawling a web site named site-xyz, into the directory
> > > $nutch_home/crawls/crawl-xyz, you will find the segments into the
> > > following directory: $nutch_home/crawls/crawl-xyz/segments. For each
> > > segment directory you will find a content directory.
> > >
> > > To be honest, I don't think you can directly access the stored content
> > > found in thoses directories, the idea being to index it and not
> > > necesserely store it.
> > >
> > > David
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: beansproud [mailto:[hidden email]]
> > > Sent: lundi, 16. juin 2008 16:42
> > > To: [hidden email]
> > > Subject: where nutch store crawled data
> > >
> > >
> > > Hi,
> > >     I'm fresh for nutch.And when I use nutch for crawling pages.I can
> > > get
> > > the crawled data by using the command : nutch readseg.
> > >     My question is can I get the data directly ? I just can't find
> where
> > > nutch put them.
> > >     Can anybody tell me ?
> > >     Thanks very much!
> > > --
> > > View this message in context:
> > >
> http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17865961
> > > .html
> > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > >
> > >
> > >
> >
> > --
> > View this message in context:
> >
> http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17905486.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> >
>



--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[hidden email]
http://www.tailsweep.com/
http://blogg.tailsweep.com/