crawl db stats

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

crawl db stats

Stefan Groschupf-2
Hi,
is there any chance to read the statistics of the nutch 0.8 crawl db  
or a trick to get an idea of how many pages are already crawled?
Thanks for the hints.
Stefan

Reply | Threaded
Open this post in threaded view
|

Re: crawl db stats

Michael Ji
using DBAdminTool to dump the webdb and you can get
whole list of Pages in text format,

Michael Ji,

--- Stefan Groschupf <[hidden email]> wrote:

> Hi,
> is there any chance to read the statistics of the
> nutch 0.8 crawl db  
> or a trick to get an idea of how many pages are
> already crawled?
> Thanks for the hints.
> Stefan
>
>



               
__________________________________
Start your day with Yahoo! - Make it your home page!
http://www.yahoo.com/r/hs
Reply | Threaded
Open this post in threaded view
|

Re: crawl db stats

Stefan Groschupf-2
Which class do you mean?
There is the old webdbadmin tool, but I guess this will not work for  
the new crawl db.
The bin/nutch admin command isn't supported until more.
Thanks
Stefan


Am 15.10.2005 um 00:21 schrieb Michael Ji:

> using DBAdminTool to dump the webdb and you can get
> whole list of Pages in text format,
>
> Michael Ji,
>
> --- Stefan Groschupf <[hidden email]> wrote:
>
>
>> Hi,
>> is there any chance to read the statistics of the
>> nutch 0.8 crawl db
>> or a trick to get an idea of how many pages are
>> already crawled?
>> Thanks for the hints.
>> Stefan
>>
>>
>>
>
>
>
>
> __________________________________
> Start your day with Yahoo! - Make it your home page!
> http://www.yahoo.com/r/hs
>
>

Reply | Threaded
Open this post in threaded view
|

Re: crawl db stats

Michael Ji
or, you can use segread in bin/nutch to dump a new
fetch segment to see what page it fetched,

Michael Ji,

--- Stefan Groschupf <[hidden email]> wrote:

> Which class do you mean?
> There is the old webdbadmin tool, but I guess this
> will not work for  
> the new crawl db.
> The bin/nutch admin command isn't supported until
> more.
> Thanks
> Stefan
>
>
> Am 15.10.2005 um 00:21 schrieb Michael Ji:
>
> > using DBAdminTool to dump the webdb and you can
> get
> > whole list of Pages in text format,
> >
> > Michael Ji,
> >
> > --- Stefan Groschupf <[hidden email]> wrote:
> >
> >
> >> Hi,
> >> is there any chance to read the statistics of the
> >> nutch 0.8 crawl db
> >> or a trick to get an idea of how many pages are
> >> already crawled?
> >> Thanks for the hints.
> >> Stefan
> >>
> >>
> >>
> >
> >
> >
> >
> > __________________________________
> > Start your day with Yahoo! - Make it your home
> page!
> > http://www.yahoo.com/r/hs
> >
> >
>
>



               
__________________________________
Yahoo! Music Unlimited
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/
Reply | Threaded
Open this post in threaded view
|

Re: crawl db stats

Stefan Groschupf-2
Michael,
I'm afraid to say but the segread doesn't exists in the 0.8 branch  
anymore.
I was knowing both methods but with map reduce the file structures  
are different, that is why I was asking.
Thanks, anyway.
Stefan

Am 15.10.2005 um 04:22 schrieb Michael Ji:



> or, you can use segread in bin/nutch to dump a new
> fetch segment to see what page it fetched,
>
> Michael Ji,
>
> --- Stefan Groschupf <[hidden email]> wrote:
>
>
>
>
>> Which class do you mean?
>> There is the old webdbadmin tool, but I guess this
>> will not work for
>> the new crawl db.
>> The bin/nutch admin command isn't supported until
>> more.
>> Thanks
>> Stefan
>>
>>
>> Am 15.10.2005 um 00:21 schrieb Michael Ji:
>>
>>
>>
>>
>>> using DBAdminTool to dump the webdb and you can
>>>
>>>
>>>
>> get
>>
>>
>>
>>> whole list of Pages in text format,
>>>
>>> Michael Ji,
>>>
>>> --- Stefan Groschupf <[hidden email]> wrote:
>>>
>>>
>>>
>>>
>>>
>>>> Hi,
>>>> is there any chance to read the statistics of the
>>>> nutch 0.8 crawl db
>>>> or a trick to get an idea of how many pages are
>>>> already crawled?
>>>> Thanks for the hints.
>>>> Stefan
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>> __________________________________
>>> Start your day with Yahoo! - Make it your home
>>>
>>>
>>>
>> page!
>>
>>
>>
>>> http://www.yahoo.com/r/hs
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>
>
>
> __________________________________
> Yahoo! Music Unlimited
> Access over 1 million songs. Try it free.
> http://music.yahoo.com/unlimited/
>
>
>
>



Reply | Threaded
Open this post in threaded view
|

Re: crawl db stats

Andrzej Białecki-2
Stefan Groschupf wrote:
> Michael,
> I'm afraid to say but the segread doesn't exists in the 0.8 branch  
> anymore.
> I was knowing both methods but with map reduce the file structures  are
> different, that is why I was asking.

segread / readdb is on the way... it's actually easy to implement, look
at LinkDbReader for inspiration. If you have some time on your hands I'm
pretty sure you could implement it... if not, I'll do it in the
beginning of next month.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: crawl db stats

Michael Ji
In reply to this post by Stefan Groschupf-2
really? coz currently my development is based on Nutch
07.

I will try 08, maybe I will write a dumping function
for debugging purpose and we can share,

by the way, I didn't 08 being released, did you mean
nutch 0.7.1?

thanks your information,

Michael Ji,

--- Stefan Groschupf <[hidden email]> wrote:

> Michael,
> I'm afraid to say but the segread doesn't exists in
> the 0.8 branch  
> anymore.
> I was knowing both methods but with map reduce the
> file structures  
> are different, that is why I was asking.
> Thanks, anyway.
> Stefan
>
> Am 15.10.2005 um 04:22 schrieb Michael Ji:
>
>
>
> > or, you can use segread in bin/nutch to dump a new
> > fetch segment to see what page it fetched,
> >
> > Michael Ji,
> >
> > --- Stefan Groschupf <[hidden email]> wrote:
> >
> >
> >
> >
> >> Which class do you mean?
> >> There is the old webdbadmin tool, but I guess
> this
> >> will not work for
> >> the new crawl db.
> >> The bin/nutch admin command isn't supported until
> >> more.
> >> Thanks
> >> Stefan
> >>
> >>
> >> Am 15.10.2005 um 00:21 schrieb Michael Ji:
> >>
> >>
> >>
> >>
> >>> using DBAdminTool to dump the webdb and you can
> >>>
> >>>
> >>>
> >> get
> >>
> >>
> >>
> >>> whole list of Pages in text format,
> >>>
> >>> Michael Ji,
> >>>
> >>> --- Stefan Groschupf <[hidden email]> wrote:
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>> Hi,
> >>>> is there any chance to read the statistics of
> the
> >>>> nutch 0.8 crawl db
> >>>> or a trick to get an idea of how many pages are
> >>>> already crawled?
> >>>> Thanks for the hints.
> >>>> Stefan
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>> __________________________________
> >>> Start your day with Yahoo! - Make it your home
> >>>
> >>>
> >>>
> >> page!
> >>
> >>
> >>
> >>> http://www.yahoo.com/r/hs
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
> >
> > __________________________________
> > Yahoo! Music Unlimited
> > Access over 1 million songs. Try it free.
> > http://music.yahoo.com/unlimited/
> >
> >
> >
> >
>
>
>
>



       
               
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
Reply | Threaded
Open this post in threaded view
|

Re: crawl db stats

Stefan Groschupf-2
In reply to this post by Andrzej Białecki-2
Andrzej,
thanks for the hint, I will have a look may later today. :-)
Stefan

Am 15.10.2005 um 08:23 schrieb Andrzej Bialecki:

> Stefan Groschupf wrote:
>
>> Michael,
>> I'm afraid to say but the segread doesn't exists in the 0.8  
>> branch  anymore.
>> I was knowing both methods but with map reduce the file  
>> structures  are different, that is why I was asking.
>>
>
> segread / readdb is on the way... it's actually easy to implement,  
> look at LinkDbReader for inspiration. If you have some time on your  
> hands I'm pretty sure you could implement it... if not, I'll do it  
> in the beginning of next month.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: crawl db stats

Stefan Groschupf-2
In reply to this post by Andrzej Białecki-2
> segread / readdb is on the way... it's actually easy to implement,  
> look at LinkDbReader for inspiration. If you have some time on your  
> hands I'm pretty sure you could implement it... if not, I'll do it  
> in the beginning of next month.
Just using the MapFileOutputFormat and writing a simple class doing  
this is easy, but shouldn't  such a tool a reduce to take advantage  
of the map reduce mechanism anyway?


I may will just write a simple class for now and we can doing it as  
reduce as next step...?
Any thoughts?
Stefan

Reply | Threaded
Open this post in threaded view
|

patch: Re: crawl db stats

Stefan Groschupf-2
In reply to this post by Andrzej Białecki-2
Hi nutch 0.8 geeks.
what you think about following solution?
As mentioned we may have later a map reduce based solution but this  
is fairly fast for a larger db as well.


If there are no comments I will add this to our issue tracking later  
today.

Greetings,
Stefan
Am 15.10.2005 um 08:23 schrieb Andrzej Bialecki:

> Stefan Groschupf wrote:
>
>> Michael,
>> I'm afraid to say but the segread doesn't exists in the 0.8  
>> branch  anymore.
>> I was knowing both methods but with map reduce the file  
>> structures  are different, that is why I was asking.
>>
>
> segread / readdb is on the way... it's actually easy to implement,  
> look at LinkDbReader for inspiration. If you have some time on your  
> hands I'm pretty sure you could implement it... if not, I'll do it  
> in the beginning of next month.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: patch: Re: crawl db stats

Stefan Groschupf-2
Oh interesting, the apache mailing list system filter out  
attachments. :-)
That make sense, I will put everything to the issue tracking...

Am 16.10.2005 um 04:42 schrieb Stefan Groschupf:

> Hi nutch 0.8 geeks.
> what you think about following solution?
> As mentioned we may have later a map reduce based solution but this  
> is fairly fast for a larger db as well.
>
> If there are no comments I will add this to our issue tracking  
> later today.
>
> Greetings,
> Stefan
> Am 15.10.2005 um 08:23 schrieb Andrzej Bialecki:
>
>
>> Stefan Groschupf wrote:
>>
>>
>>> Michael,
>>> I'm afraid to say but the segread doesn't exists in the 0.8  
>>> branch  anymore.
>>> I was knowing both methods but with map reduce the file  
>>> structures  are different, that is why I was asking.
>>>
>>>
>>
>> segread / readdb is on the way... it's actually easy to implement,  
>> look at LinkDbReader for inspiration. If you have some time on  
>> your hands I'm pretty sure you could implement it... if not, I'll  
>> do it in the beginning of next month.
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>>
>>
>
>