Heritrix

classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Heritrix

Aled Jones
Hi

Anyone used Heritrix (http://crawler.archive.org/) as a crawler?  How
does it compare with the Nutch crawler?  Can Nutch serve its crawled
results?   Main reason I'm interested is that it has a WUI interface
that might make maintenance for the IT guys easier, although I know that
some of you guys are working on an interface.

Cheers
Aled


###########################################

This message has been scanned by F-Secure Anti-Virus for Microsoft Exchange.
For more information, connect to http://www.f-secure.com/
************************************************************************
This e-mail and any attachments are strictly confidential and intended solely for the addressee. They may contain information which is covered by legal, professional or other privilege. If you are not the intended addressee, you must not copy the e-mail or the attachments, or use them for any purpose or disclose their contents to any other person. To do so may be unlawful. If you have received this transmission in error, please notify us as soon as possible and delete the message and attachments from all places in your computer where they are stored.

Although we have scanned this e-mail and any attachments for viruses, it is your responsibility to ensure that they are actually virus free.
 

Reply | Threaded
Open this post in threaded view
|

Re: Heritrix

Zaheed Haque
Hi:

Nutch will soon have admin gui thanks to stefan!

http://issues.apache.org/jira/browse/NUTCH-251

Cheers

On 4/28/06, Aled Jones <[hidden email]> wrote:

> Hi
>
> Anyone used Heritrix (http://crawler.archive.org/) as a crawler?  How
> does it compare with the Nutch crawler?  Can Nutch serve its crawled
> results?   Main reason I'm interested is that it has a WUI interface
> that might make maintenance for the IT guys easier, although I know that
> some of you guys are working on an interface.
>
> Cheers
> Aled
>
>
> ###########################################
>
> This message has been scanned by F-Secure Anti-Virus for Microsoft Exchange.
> For more information, connect to http://www.f-secure.com/
> ************************************************************************
> This e-mail and any attachments are strictly confidential and intended solely for the addressee. They may contain information which is covered by legal, professional or other privilege. If you are not the intended addressee, you must not copy the e-mail or the attachments, or use them for any purpose or disclose their contents to any other person. To do so may be unlawful. If you have received this transmission in error, please notify us as soon as possible and delete the message and attachments from all places in your computer where they are stored.
>
> Although we have scanned this e-mail and any attachments for viruses, it is your responsibility to ensure that they are actually virus free.
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Heritrix

Nuther
But admin gui will be in 0.8 version
----- Original Message -----
From: "Zaheed Haque" <[hidden email]>
To: <[hidden email]>
Sent: Friday, April 28, 2006 1:05 PM
Subject: Re: Heritrix


Hi:

Nutch will soon have admin gui thanks to stefan!

http://issues.apache.org/jira/browse/NUTCH-251

Cheers

On 4/28/06, Aled Jones <[hidden email]> wrote:

> Hi
>
> Anyone used Heritrix (http://crawler.archive.org/) as a crawler?  How
> does it compare with the Nutch crawler?  Can Nutch serve its crawled
> results?   Main reason I'm interested is that it has a WUI interface
> that might make maintenance for the IT guys easier, although I know that
> some of you guys are working on an interface.
>
> Cheers
> Aled
>
>
> ###########################################
>
> This message has been scanned by F-Secure Anti-Virus for Microsoft
> Exchange.
> For more information, connect to http://www.f-secure.com/
> ************************************************************************
> This e-mail and any attachments are strictly confidential and intended
> solely for the addressee. They may contain information which is covered by
> legal, professional or other privilege. If you are not the intended
> addressee, you must not copy the e-mail or the attachments, or use them
> for any purpose or disclose their contents to any other person. To do so
> may be unlawful. If you have received this transmission in error, please
> notify us as soon as possible and delete the message and attachments from
> all places in your computer where they are stored.
>
> Although we have scanned this e-mail and any attachments for viruses, it
> is your responsibility to ensure that they are actually virus free.
>
>
>
>


Reply | Threaded
Open this post in threaded view
|

RE: Heritrix

Dan Morrill-3
In reply to this post by Aled Jones
Aled,

I used heritrix before going over to nutch, while it is an excellent
program, with lots of good things to offer, it didn't quite meet my need,
and when designing the architecture had too many dependencies for me to be
comfortable with.

If you want to run an internet archive though, heritrix can not be beat, if
you want to run a search engine, nutch is a good choice.

My personal opinion.
r/d

-----Original Message-----
From: Aled Jones [mailto:[hidden email]]
Sent: Friday, April 28, 2006 1:59 AM
To: [hidden email]
Subject: Heritrix

Hi

Anyone used Heritrix (http://crawler.archive.org/) as a crawler?  How
does it compare with the Nutch crawler?  Can Nutch serve its crawled
results?   Main reason I'm interested is that it has a WUI interface
that might make maintenance for the IT guys easier, although I know that
some of you guys are working on an interface.

Cheers
Aled


###########################################

This message has been scanned by F-Secure Anti-Virus for Microsoft Exchange.
For more information, connect to http://www.f-secure.com/
************************************************************************
This e-mail and any attachments are strictly confidential and intended
solely for the addressee. They may contain information which is covered by
legal, professional or other privilege. If you are not the intended
addressee, you must not copy the e-mail or the attachments, or use them for
any purpose or disclose their contents to any other person. To do so may be
unlawful. If you have received this transmission in error, please notify us
as soon as possible and delete the message and attachments from all places
in your computer where they are stored.

Although we have scanned this e-mail and any attachments for viruses, it is
your responsibility to ensure that they are actually virus free.
 


Reply | Threaded
Open this post in threaded view
|

ATB: Heritrix

Aled Jones
In reply to this post by Aled Jones
Thanks for your replies guys.  I hadn't realised that the admin gui was
already in development.
We should be able to cope till it gets released ;-)

Thanks again
Aled

> -----Neges Wreiddiol-----/-----Original Message-----
> Oddi wrth/From: Dan Morrill [mailto:[hidden email]]
> Anfonwyd/Sent: 28 April 2006 14:07
> At/To: [hidden email]
> Pwnc/Subject: RE: Heritrix
>
> Aled,
>
> I used heritrix before going over to nutch, while it is an
> excellent program, with lots of good things to offer, it
> didn't quite meet my need, and when designing the
> architecture had too many dependencies for me to be comfortable with.
>
> If you want to run an internet archive though, heritrix can
> not be beat, if you want to run a search engine, nutch is a
> good choice.
>
> My personal opinion.
> r/d
>
> -----Original Message-----
> From: Aled Jones [mailto:[hidden email]]
> Sent: Friday, April 28, 2006 1:59 AM
> To: [hidden email]
> Subject: Heritrix
>
> Hi
>
> Anyone used Heritrix (http://crawler.archive.org/) as a
> crawler?  How does it compare with the Nutch crawler?  Can
> Nutch serve its crawled
> results?   Main reason I'm interested is that it has a WUI interface
> that might make maintenance for the IT guys easier, although
> I know that some of you guys are working on an interface.
>
> Cheers
> Aled
>
>
> ###########################################
>
> This message has been scanned by F-Secure Anti-Virus for
> Microsoft Exchange.
> For more information, connect to http://www.f-secure.com/
> **************************************************************
> **********
> This e-mail and any attachments are strictly confidential and
> intended solely for the addressee. They may contain
> information which is covered by legal, professional or other
> privilege. If you are not the intended addressee, you must
> not copy the e-mail or the attachments, or use them for any
> purpose or disclose their contents to any other person. To do
> so may be unlawful. If you have received this transmission in
> error, please notify us as soon as possible and delete the
> message and attachments from all places in your computer
> where they are stored.
>
> Although we have scanned this e-mail and any attachments for
> viruses, it is your responsibility to ensure that they are
> actually virus free.
>  
>
>
>
###########################################

This message has been scanned by F-Secure Anti-Virus for Microsoft Exchange.
For more information, connect to http://www.f-secure.com/

************************************************************************
This e-mail and any attachments are strictly confidential and intended solely for the addressee. They may contain information which is covered by legal, professional or other privilege. If you are not the intended addressee, you must not copy the e-mail or the attachments, or use them for any purpose or disclose their contents to any other person. To do so may be unlawful. If you have received this transmission in error, please notify us as soon as possible and delete the message and attachments from all places in your computer where they are stored.

Although we have scanned this e-mail and any attachments for viruses, it is your responsibility to ensure that they are actually virus free.
 

Reply | Threaded
Open this post in threaded view
|

Admin Gui beta test (was Re: ATB: Heritrix)

Stefan Groschupf-2
Hi there,

since building the gui is some how complicated I was thinking about  
providing a ready to use binary.
This may be would help to get some more beta testers we currently  
looking for.
Any thoughts?

However I afraid that this would hit my server to hard and I have to  
pay for traffic. :-/
Does any one has an idea where we can mirror this file for free?
Any volunteer is very welcome.

Thanks.
Stefan




Am 28.04.2006 um 15:14 schrieb Aled Jones:

> Thanks for your replies guys.  I hadn't realised that the admin gui  
> was
> already in development.
> We should be able to cope till it gets released ;-)
>
> Thanks again
> Aled
>
>> -----Neges Wreiddiol-----/-----Original Message-----
>> Oddi wrth/From: Dan Morrill [mailto:[hidden email]]
>> Anfonwyd/Sent: 28 April 2006 14:07
>> At/To: [hidden email]
>> Pwnc/Subject: RE: Heritrix
>>
>> Aled,
>>
>> I used heritrix before going over to nutch, while it is an
>> excellent program, with lots of good things to offer, it
>> didn't quite meet my need, and when designing the
>> architecture had too many dependencies for me to be comfortable with.
>>
>> If you want to run an internet archive though, heritrix can
>> not be beat, if you want to run a search engine, nutch is a
>> good choice.
>>
>> My personal opinion.
>> r/d
>>
>> -----Original Message-----
>> From: Aled Jones [mailto:[hidden email]]
>> Sent: Friday, April 28, 2006 1:59 AM
>> To: [hidden email]
>> Subject: Heritrix
>>
>> Hi
>>
>> Anyone used Heritrix (http://crawler.archive.org/) as a
>> crawler?  How does it compare with the Nutch crawler?  Can
>> Nutch serve its crawled
>> results?   Main reason I'm interested is that it has a WUI interface
>> that might make maintenance for the IT guys easier, although
>> I know that some of you guys are working on an interface.
>>
>> Cheers
>> Aled
>>
>>
>> ###########################################
>>
>> This message has been scanned by F-Secure Anti-Virus for
>> Microsoft Exchange.
>> For more information, connect to http://www.f-secure.com/
>> **************************************************************
>> **********
>> This e-mail and any attachments are strictly confidential and
>> intended solely for the addressee. They may contain
>> information which is covered by legal, professional or other
>> privilege. If you are not the intended addressee, you must
>> not copy the e-mail or the attachments, or use them for any
>> purpose or disclose their contents to any other person. To do
>> so may be unlawful. If you have received this transmission in
>> error, please notify us as soon as possible and delete the
>> message and attachments from all places in your computer
>> where they are stored.
>>
>> Although we have scanned this e-mail and any attachments for
>> viruses, it is your responsibility to ensure that they are
>> actually virus free.
>>
>>
>>
>>
> ###########################################
>
> This message has been scanned by F-Secure Anti-Virus for Microsoft  
> Exchange.
> For more information, connect to http://www.f-secure.com/
>
> **********************************************************************
> **
> This e-mail and any attachments are strictly confidential and  
> intended solely for the addressee. They may contain information  
> which is covered by legal, professional or other privilege. If you  
> are not the intended addressee, you must not copy the e-mail or the  
> attachments, or use them for any purpose or disclose their contents  
> to any other person. To do so may be unlawful. If you have received  
> this transmission in error, please notify us as soon as possible  
> and delete the message and attachments from all places in your  
> computer where they are stored.
>
> Although we have scanned this e-mail and any attachments for  
> viruses, it is your responsibility to ensure that they are actually  
> virus free.
>
>
>

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com


Reply | Threaded
Open this post in threaded view
|

RE: Heritrix

Dan Morrill-3
In reply to this post by Aled Jones
Aled,

I guess the other question is what are you trying to do, for example, if you
need to automate the crawl you can make a shell script and cron it (well ok,
I am using task manager). If you want to watch the logs on the screen in a
terminal window, you can tail -f crawl.log it (I am using wintail), I am
more than happy to help if you want to automate your nutch jobs.

I automated as much as I could on those processes that I wanted nutch to do,
and it sits quietly in the corner doing all the work, merging, indexing,
rebuilding, stopping and starting tomcat, so it is possible to automate
nutch so that it is 90% stand alone by scripting. Although, its all windows
scripting, I am not running on linux, I have no linux scripts.

r/d

-----Original Message-----
From: Aled Jones [mailto:[hidden email]]
Sent: Friday, April 28, 2006 6:14 AM
To: [hidden email]
Subject: ATB: Heritrix

Thanks for your replies guys.  I hadn't realised that the admin gui was
already in development.
We should be able to cope till it gets released ;-)

Thanks again
Aled

> -----Neges Wreiddiol-----/-----Original Message-----
> Oddi wrth/From: Dan Morrill [mailto:[hidden email]]
> Anfonwyd/Sent: 28 April 2006 14:07
> At/To: [hidden email]
> Pwnc/Subject: RE: Heritrix
>
> Aled,
>
> I used heritrix before going over to nutch, while it is an
> excellent program, with lots of good things to offer, it
> didn't quite meet my need, and when designing the
> architecture had too many dependencies for me to be comfortable with.
>
> If you want to run an internet archive though, heritrix can
> not be beat, if you want to run a search engine, nutch is a
> good choice.
>
> My personal opinion.
> r/d
>
> -----Original Message-----
> From: Aled Jones [mailto:[hidden email]]
> Sent: Friday, April 28, 2006 1:59 AM
> To: [hidden email]
> Subject: Heritrix
>
> Hi
>
> Anyone used Heritrix (http://crawler.archive.org/) as a
> crawler?  How does it compare with the Nutch crawler?  Can
> Nutch serve its crawled
> results?   Main reason I'm interested is that it has a WUI interface
> that might make maintenance for the IT guys easier, although
> I know that some of you guys are working on an interface.
>
> Cheers
> Aled
>
>
> ###########################################
>
> This message has been scanned by F-Secure Anti-Virus for
> Microsoft Exchange.
> For more information, connect to http://www.f-secure.com/
> **************************************************************
> **********
> This e-mail and any attachments are strictly confidential and
> intended solely for the addressee. They may contain
> information which is covered by legal, professional or other
> privilege. If you are not the intended addressee, you must
> not copy the e-mail or the attachments, or use them for any
> purpose or disclose their contents to any other person. To do
> so may be unlawful. If you have received this transmission in
> error, please notify us as soon as possible and delete the
> message and attachments from all places in your computer
> where they are stored.
>
> Although we have scanned this e-mail and any attachments for
> viruses, it is your responsibility to ensure that they are
> actually virus free.
>  
>
>
>
###########################################

This message has been scanned by F-Secure Anti-Virus for Microsoft Exchange.
For more information, connect to http://www.f-secure.com/

************************************************************************
This e-mail and any attachments are strictly confidential and intended
solely for the addressee. They may contain information which is covered by
legal, professional or other privilege. If you are not the intended
addressee, you must not copy the e-mail or the attachments, or use them for
any purpose or disclose their contents to any other person. To do so may be
unlawful. If you have received this transmission in error, please notify us
as soon as possible and delete the message and attachments from all places
in your computer where they are stored.

Although we have scanned this e-mail and any attachments for viruses, it is
your responsibility to ensure that they are actually virus free.
 

=

Reply | Threaded
Open this post in threaded view
|

ATB: Admin Gui beta test (was Re: ATB: Heritrix)

Aled Jones
In reply to this post by Stefan Groschupf-2
Yeah I'd definetly beta test it if there was a binary or fairly easy to
follow instructions.

> -----Neges Wreiddiol-----/-----Original Message-----
> Oddi wrth/From: Stefan Groschupf [mailto:[hidden email]]
> Anfonwyd/Sent: 28 April 2006 14:24
> At/To: [hidden email]
> Pwnc/Subject: Admin Gui beta test (was Re: ATB: Heritrix)
>
> Hi there,
>
> since building the gui is some how complicated I was thinking
> about providing a ready to use binary.
> This may be would help to get some more beta testers we
> currently looking for.
> Any thoughts?
>
> However I afraid that this would hit my server to hard and I
> have to pay for traffic. :-/ Does any one has an idea where
> we can mirror this file for free?
> Any volunteer is very welcome.
>
> Thanks.
> Stefan
>
>
>
>
> Am 28.04.2006 um 15:14 schrieb Aled Jones:
>
> > Thanks for your replies guys.  I hadn't realised that the admin gui
> > was already in development.
> > We should be able to cope till it gets released ;-)
> >
> > Thanks again
> > Aled
> >
> >> -----Neges Wreiddiol-----/-----Original Message----- Oddi
> wrth/From:
> >> Dan Morrill [mailto:[hidden email]]
> >> Anfonwyd/Sent: 28 April 2006 14:07
> >> At/To: [hidden email]
> >> Pwnc/Subject: RE: Heritrix
> >>
> >> Aled,
> >>
> >> I used heritrix before going over to nutch, while it is an
> excellent
> >> program, with lots of good things to offer, it didn't
> quite meet my
> >> need, and when designing the architecture had too many
> dependencies
> >> for me to be comfortable with.
> >>
> >> If you want to run an internet archive though, heritrix can not be
> >> beat, if you want to run a search engine, nutch is a good choice.
> >>
> >> My personal opinion.
> >> r/d
> >>
> >> -----Original Message-----
> >> From: Aled Jones [mailto:[hidden email]]
> >> Sent: Friday, April 28, 2006 1:59 AM
> >> To: [hidden email]
> >> Subject: Heritrix
> >>
> >> Hi
> >>
> >> Anyone used Heritrix (http://crawler.archive.org/) as a
> crawler?  How
> >> does it compare with the Nutch crawler?  Can Nutch serve
> its crawled
> >> results?   Main reason I'm interested is that it has a WUI
> interface
> >> that might make maintenance for the IT guys easier,
> although I know
> >> that some of you guys are working on an interface.
> >>
> >> Cheers
> >> Aled
> >>
> >>
> >> ###########################################
> >>
> >> This message has been scanned by F-Secure Anti-Virus for Microsoft
> >> Exchange.
> >> For more information, connect to http://www.f-secure.com/
> >> **************************************************************
> >> **********
> >> This e-mail and any attachments are strictly confidential and
> >> intended solely for the addressee. They may contain
> information which
> >> is covered by legal, professional or other privilege. If
> you are not
> >> the intended addressee, you must not copy the e-mail or the
> >> attachments, or use them for any purpose or disclose their
> contents
> >> to any other person. To do so may be unlawful. If you have
> received
> >> this transmission in error, please notify us as soon as
> possible and
> >> delete the message and attachments from all places in your
> computer
> >> where they are stored.
> >>
> >> Although we have scanned this e-mail and any attachments
> for viruses,
> >> it is your responsibility to ensure that they are actually virus
> >> free.
> >>
> >>
> >>
> >>
> > ###########################################
> >
> > This message has been scanned by F-Secure Anti-Virus for Microsoft
> > Exchange.
> > For more information, connect to http://www.f-secure.com/
> >
> >
> **********************************************************************
> > **
> > This e-mail and any attachments are strictly confidential
> and intended
> > solely for the addressee. They may contain information which is
> > covered by legal, professional or other privilege. If you
> are not the
> > intended addressee, you must not copy the e-mail or the
> attachments,
> > or use them for any purpose or disclose their contents to any other
> > person. To do so may be unlawful. If you have received this
> > transmission in error, please notify us as soon as possible
> and delete
> > the message and attachments from all places in your computer where
> > they are stored.
> >
> > Although we have scanned this e-mail and any attachments
> for viruses,
> > it is your responsibility to ensure that they are actually
> virus free.
> >
> >
> >
>
> ---------------------------------------------
> blog: http://www.find23.org
> company: http://www.media-style.com
>
>
>
###########################################

This message has been scanned by F-Secure Anti-Virus for Microsoft Exchange.
For more information, connect to http://www.f-secure.com/

************************************************************************
This e-mail and any attachments are strictly confidential and intended solely for the addressee. They may contain information which is covered by legal, professional or other privilege. If you are not the intended addressee, you must not copy the e-mail or the attachments, or use them for any purpose or disclose their contents to any other person. To do so may be unlawful. If you have received this transmission in error, please notify us as soon as possible and delete the message and attachments from all places in your computer where they are stored.

Although we have scanned this e-mail and any attachments for viruses, it is your responsibility to ensure that they are actually virus free.
 

Reply | Threaded
Open this post in threaded view
|

RE: Admin Gui beta test (was Re: ATB: Heritrix)

Dan Morrill-3
In reply to this post by Stefan Groschupf-2
Stefan -

I can host the file at http://www.oaktreesecurity.com if you would like. I
have about 2 gigs of bandwidth a month, and I use maybe 10 megs, I think I
can accommodate. I am more than happy to host a free standing binary.

Do you have a windows compatible version (or will it run in cygwin), or is
it Linux only?

r/d

-----Original Message-----
From: Stefan Groschupf [mailto:[hidden email]]
Sent: Friday, April 28, 2006 6:24 AM
To: [hidden email]
Subject: Admin Gui beta test (was Re: ATB: Heritrix)

Hi there,

since building the gui is some how complicated I was thinking about  
providing a ready to use binary.
This may be would help to get some more beta testers we currently  
looking for.
Any thoughts?

However I afraid that this would hit my server to hard and I have to  
pay for traffic. :-/
Does any one has an idea where we can mirror this file for free?
Any volunteer is very welcome.

Thanks.
Stefan




Am 28.04.2006 um 15:14 schrieb Aled Jones:

> Thanks for your replies guys.  I hadn't realised that the admin gui  
> was
> already in development.
> We should be able to cope till it gets released ;-)
>
> Thanks again
> Aled
>
>> -----Neges Wreiddiol-----/-----Original Message-----
>> Oddi wrth/From: Dan Morrill [mailto:[hidden email]]
>> Anfonwyd/Sent: 28 April 2006 14:07
>> At/To: [hidden email]
>> Pwnc/Subject: RE: Heritrix
>>
>> Aled,
>>
>> I used heritrix before going over to nutch, while it is an
>> excellent program, with lots of good things to offer, it
>> didn't quite meet my need, and when designing the
>> architecture had too many dependencies for me to be comfortable with.
>>
>> If you want to run an internet archive though, heritrix can
>> not be beat, if you want to run a search engine, nutch is a
>> good choice.
>>
>> My personal opinion.
>> r/d
>>
>> -----Original Message-----
>> From: Aled Jones [mailto:[hidden email]]
>> Sent: Friday, April 28, 2006 1:59 AM
>> To: [hidden email]
>> Subject: Heritrix
>>
>> Hi
>>
>> Anyone used Heritrix (http://crawler.archive.org/) as a
>> crawler?  How does it compare with the Nutch crawler?  Can
>> Nutch serve its crawled
>> results?   Main reason I'm interested is that it has a WUI interface
>> that might make maintenance for the IT guys easier, although
>> I know that some of you guys are working on an interface.
>>
>> Cheers
>> Aled
>>
>>
>> ###########################################
>>
>> This message has been scanned by F-Secure Anti-Virus for
>> Microsoft Exchange.
>> For more information, connect to http://www.f-secure.com/
>> **************************************************************
>> **********
>> This e-mail and any attachments are strictly confidential and
>> intended solely for the addressee. They may contain
>> information which is covered by legal, professional or other
>> privilege. If you are not the intended addressee, you must
>> not copy the e-mail or the attachments, or use them for any
>> purpose or disclose their contents to any other person. To do
>> so may be unlawful. If you have received this transmission in
>> error, please notify us as soon as possible and delete the
>> message and attachments from all places in your computer
>> where they are stored.
>>
>> Although we have scanned this e-mail and any attachments for
>> viruses, it is your responsibility to ensure that they are
>> actually virus free.
>>
>>
>>
>>
> ###########################################
>
> This message has been scanned by F-Secure Anti-Virus for Microsoft  
> Exchange.
> For more information, connect to http://www.f-secure.com/
>
> **********************************************************************
> **
> This e-mail and any attachments are strictly confidential and  
> intended solely for the addressee. They may contain information  
> which is covered by legal, professional or other privilege. If you  
> are not the intended addressee, you must not copy the e-mail or the  
> attachments, or use them for any purpose or disclose their contents  
> to any other person. To do so may be unlawful. If you have received  
> this transmission in error, please notify us as soon as possible  
> and delete the message and attachments from all places in your  
> computer where they are stored.
>
> Although we have scanned this e-mail and any attachments for  
> viruses, it is your responsibility to ensure that they are actually  
> virus free.
>
>
>

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com


Reply | Threaded
Open this post in threaded view
|

Re: Admin Gui beta test (was Re: ATB: Heritrix)

sudhendra seshachala
In reply to this post by Stefan Groschupf-2
Hi Stefan
  I would be willing to host the app.
  I have virutal dedicated server from Godaddy with Fedora core2 and apache webserver and tomcat running.
  The IP address is http://68.178.249.66 Right now, on webserver side, I have a default page (hosted by godaddy running)
  But can make sure the Admin GUI is running.. I might need some help, but should not be a problem at all.
   
   
  Thanks
  Sudhi
 

Stefan Groschupf <[hidden email]> wrote:
  Hi there,

since building the gui is some how complicated I was thinking about
providing a ready to use binary.
This may be would help to get some more beta testers we currently
looking for.
Any thoughts?

However I afraid that this would hit my server to hard and I have to
pay for traffic. :-/
Does any one has an idea where we can mirror this file for free?
Any volunteer is very welcome.

Thanks.
Stefan




Am 28.04.2006 um 15:14 schrieb Aled Jones:

> Thanks for your replies guys. I hadn't realised that the admin gui
> was
> already in development.
> We should be able to cope till it gets released ;-)
>
> Thanks again
> Aled
>
>> -----Neges Wreiddiol-----/-----Original Message-----
>> Oddi wrth/From: Dan Morrill [mailto:[hidden email]]
>> Anfonwyd/Sent: 28 April 2006 14:07
>> At/To: [hidden email]
>> Pwnc/Subject: RE: Heritrix
>>
>> Aled,
>>
>> I used heritrix before going over to nutch, while it is an
>> excellent program, with lots of good things to offer, it
>> didn't quite meet my need, and when designing the
>> architecture had too many dependencies for me to be comfortable with.
>>
>> If you want to run an internet archive though, heritrix can
>> not be beat, if you want to run a search engine, nutch is a
>> good choice.
>>
>> My personal opinion.
>> r/d
>>
>> -----Original Message-----
>> From: Aled Jones [mailto:[hidden email]]
>> Sent: Friday, April 28, 2006 1:59 AM
>> To: [hidden email]
>> Subject: Heritrix
>>
>> Hi
>>
>> Anyone used Heritrix (http://crawler.archive.org/) as a
>> crawler? How does it compare with the Nutch crawler? Can
>> Nutch serve its crawled
>> results? Main reason I'm interested is that it has a WUI interface
>> that might make maintenance for the IT guys easier, although
>> I know that some of you guys are working on an interface.
>>
>> Cheers
>> Aled
>>
>>
>> ###########################################
>>
>> This message has been scanned by F-Secure Anti-Virus for
>> Microsoft Exchange.
>> For more information, connect to http://www.f-secure.com/
>> **************************************************************
>> **********
>> This e-mail and any attachments are strictly confidential and
>> intended solely for the addressee. They may contain
>> information which is covered by legal, professional or other
>> privilege. If you are not the intended addressee, you must
>> not copy the e-mail or the attachments, or use them for any
>> purpose or disclose their contents to any other person. To do
>> so may be unlawful. If you have received this transmission in
>> error, please notify us as soon as possible and delete the
>> message and attachments from all places in your computer
>> where they are stored.
>>
>> Although we have scanned this e-mail and any attachments for
>> viruses, it is your responsibility to ensure that they are
>> actually virus free.
>>
>>
>>
>>
> ###########################################
>
> This message has been scanned by F-Secure Anti-Virus for Microsoft
> Exchange.
> For more information, connect to http://www.f-secure.com/
>
> **********************************************************************
> **
> This e-mail and any attachments are strictly confidential and
> intended solely for the addressee. They may contain information
> which is covered by legal, professional or other privilege. If you
> are not the intended addressee, you must not copy the e-mail or the
> attachments, or use them for any purpose or disclose their contents
> to any other person. To do so may be unlawful. If you have received
> this transmission in error, please notify us as soon as possible
> and delete the message and attachments from all places in your
> computer where they are stored.
>
> Although we have scanned this e-mail and any attachments for
> viruses, it is your responsibility to ensure that they are actually
> virus free.
>
>
>

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com





  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


                       
---------------------------------
Yahoo! Mail goes everywhere you do.  Get it on your phone.
Reply | Threaded
Open this post in threaded view
|

ATB: Heritrix

Aled Jones
In reply to this post by Dan Morrill-3
Our live implementation is still in spec at the moment (I'm the
protoypes guy), but I'm guessing we'll need automation of the crawl,
merge, index, dedup etc and a way of monitoring progress, checking for
errors etc.  Our development team should be ok with the cron jobs, logs
etc but ideally for our support team, an easy to use GUI would make
their jobs much easier.  For example, a customer might ring up with a
query etc "new content has been put up but I can't find x", the 1st
level support should be able to look at the admin page that'll quickly
tell them when the content was last crawled, whether there were any
errors etc.

Cheers
Aled

> -----Neges Wreiddiol-----/-----Original Message-----
> Oddi wrth/From: Dan Morrill [mailto:[hidden email]]
> Anfonwyd/Sent: 28 April 2006 14:30
> At/To: [hidden email]
> Pwnc/Subject: RE: Heritrix
>
> Aled,
>
> I guess the other question is what are you trying to do, for
> example, if you need to automate the crawl you can make a
> shell script and cron it (well ok, I am using task manager).
> If you want to watch the logs on the screen in a terminal
> window, you can tail -f crawl.log it (I am using wintail), I
> am more than happy to help if you want to automate your nutch jobs.
>
> I automated as much as I could on those processes that I
> wanted nutch to do, and it sits quietly in the corner doing
> all the work, merging, indexing, rebuilding, stopping and
> starting tomcat, so it is possible to automate nutch so that
> it is 90% stand alone by scripting. Although, its all windows
> scripting, I am not running on linux, I have no linux scripts.
>
> r/d
>
> -----Original Message-----
> From: Aled Jones [mailto:[hidden email]]
> Sent: Friday, April 28, 2006 6:14 AM
> To: [hidden email]
> Subject: ATB: Heritrix
>
> Thanks for your replies guys.  I hadn't realised that the
> admin gui was already in development.
> We should be able to cope till it gets released ;-)
>
> Thanks again
> Aled
>
> > -----Neges Wreiddiol-----/-----Original Message----- Oddi
> wrth/From:
> > Dan Morrill [mailto:[hidden email]]
> > Anfonwyd/Sent: 28 April 2006 14:07
> > At/To: [hidden email]
> > Pwnc/Subject: RE: Heritrix
> >
> > Aled,
> >
> > I used heritrix before going over to nutch, while it is an
> excellent
> > program, with lots of good things to offer, it didn't quite meet my
> > need, and when designing the architecture had too many dependencies
> > for me to be comfortable with.
> >
> > If you want to run an internet archive though, heritrix can not be
> > beat, if you want to run a search engine, nutch is a good choice.
> >
> > My personal opinion.
> > r/d
> >
> > -----Original Message-----
> > From: Aled Jones [mailto:[hidden email]]
> > Sent: Friday, April 28, 2006 1:59 AM
> > To: [hidden email]
> > Subject: Heritrix
> >
> > Hi
> >
> > Anyone used Heritrix (http://crawler.archive.org/) as a
> crawler?  How
> > does it compare with the Nutch crawler?  Can Nutch serve its crawled
> > results?   Main reason I'm interested is that it has a WUI interface
> > that might make maintenance for the IT guys easier, although I know
> > that some of you guys are working on an interface.
> >
> > Cheers
> > Aled
> >
> >
> > ###########################################
> >
> > This message has been scanned by F-Secure Anti-Virus for Microsoft
> > Exchange.
> > For more information, connect to http://www.f-secure.com/
> > **************************************************************
> > **********
> > This e-mail and any attachments are strictly confidential
> and intended
> > solely for the addressee. They may contain information which is
> > covered by legal, professional or other privilege. If you
> are not the
> > intended addressee, you must not copy the e-mail or the
> attachments,
> > or use them for any purpose or disclose their contents to any other
> > person. To do so may be unlawful. If you have received this
> > transmission in error, please notify us as soon as possible
> and delete
> > the message and attachments from all places in your computer where
> > they are stored.
> >
> > Although we have scanned this e-mail and any attachments
> for viruses,
> > it is your responsibility to ensure that they are actually
> virus free.
> >  
> >
> >
> >
> ###########################################
>
> This message has been scanned by F-Secure Anti-Virus for
> Microsoft Exchange.
> For more information, connect to http://www.f-secure.com/
>
> **************************************************************
> **********
> This e-mail and any attachments are strictly confidential and intended
> solely for the addressee. They may contain information which
> is covered by
> legal, professional or other privilege. If you are not the intended
> addressee, you must not copy the e-mail or the attachments,
> or use them for
> any purpose or disclose their contents to any other person.
> To do so may be
> unlawful. If you have received this transmission in error,
> please notify us
> as soon as possible and delete the message and attachments
> from all places
> in your computer where they are stored.
>
> Although we have scanned this e-mail and any attachments for
> viruses, it is
> your responsibility to ensure that they are actually virus free.
>  
>
> =
>
>
###########################################

This message has been scanned by F-Secure Anti-Virus for Microsoft Exchange.
For more information, connect to http://www.f-secure.com/

************************************************************************
This e-mail and any attachments are strictly confidential and intended solely for the addressee. They may contain information which is covered by legal, professional or other privilege. If you are not the intended addressee, you must not copy the e-mail or the attachments, or use them for any purpose or disclose their contents to any other person. To do so may be unlawful. If you have received this transmission in error, please notify us as soon as possible and delete the message and attachments from all places in your computer where they are stored.

Although we have scanned this e-mail and any attachments for viruses, it is your responsibility to ensure that they are actually virus free.
 

Reply | Threaded
Open this post in threaded view
|

Re: Admin Gui beta test (was Re: ATB: Heritrix)

Andrzej Białecki-2
In reply to this post by Stefan Groschupf-2
Stefan Groschupf wrote:

> Hi there,
>
> since building the gui is some how complicated I was thinking about
> providing a ready to use binary.
> This may be would help to get some more beta testers we currently
> looking for.
> Any thoughts?
>
> However I afraid that this would hit my server to hard and I have to
> pay for traffic. :-/
> Does any one has an idea where we can mirror this file for free?
> Any volunteer is very welcome.

I think it should be possible to put your binary at the Apache site,
probably Doug will be the right person to talk to ...

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


vis
Reply | Threaded
Open this post in threaded view
|

Re: Admin Gui beta test (was Re: ATB: Heritrix)

vis
In reply to this post by Stefan Groschupf-2
Sorry, I am on holiday until the 8th of May.

Please contact the [hidden email] for urgent matters.

Kind regards, Herman.

Reply | Threaded
Open this post in threaded view
|

Re: Admin Gui beta test (was Re: ATB: Heritrix)

Doug Cutting
In reply to this post by Andrzej Białecki-2
Andrzej Bialecki wrote:
> I think it should be possible to put your binary at the Apache site,
> probably Doug will be the right person to talk to ...

Have you tried attaching it to a Jira issue?

If that fails, you could attach it to a page on the Wiki, no?

Doug
vis
Reply | Threaded
Open this post in threaded view
|

Re: Admin Gui beta test (was Re: ATB: Heritrix)

vis
In reply to this post by Stefan Groschupf-2
Sorry, I am on holiday until the 8th of May.

Please contact the [hidden email] for urgent matters.

Kind regards, Herman.

Reply | Threaded
Open this post in threaded view
|

Re: Admin Gui beta test (was Re: ATB: Heritrix)

Stefan Groschupf-2
In reply to this post by Doug Cutting

>> I think it should be possible to put your binary at the Apache  
>> site, probably Doug will be the right person to talk to ...
>
> Have you tried attaching it to a Jira issue?
The nutch -xxxtar.gz is 67MB. The maximum file upload size is 10.00 Mb .
>
> If that fails, you could attach it to a page on the Wiki, no?
Is that a good idea? The file is that big and we already got the  
request to use the apache mirror servers.
Anyway I already got some offline offers from people, just was  
thinking it is a good idea to leave that running under the nutch  
project flag.








vis
Reply | Threaded
Open this post in threaded view
|

Re: Admin Gui beta test (was Re: ATB: Heritrix)

vis
In reply to this post by Stefan Groschupf-2
Sorry, I am on holiday until the 8th of May.

Please contact the [hidden email] for urgent matters.

Kind regards, Herman.

Reply | Threaded
Open this post in threaded view
|

Re: Admin Gui beta test (was Re: ATB: Heritrix)

Doug Cutting
In reply to this post by Stefan Groschupf-2
Stefan Groschupf wrote:
>> If that fails, you could attach it to a page on the Wiki, no?
>
> Is that a good idea? The file is that big and we already got the  
> request to use the apache mirror servers.

The Apache mirrors are really for signed, Apache releases, which this is
not.  Is it too big for the wiki?

Doug
vis
Reply | Threaded
Open this post in threaded view
|

Re: Admin Gui beta test (was Re: ATB: Heritrix)

vis
In reply to this post by Stefan Groschupf-2
Sorry, I am on holiday until the 8th of May.

Please contact the [hidden email] for urgent matters.

Kind regards, Herman.

Reply | Threaded
Open this post in threaded view
|

Re: Admin Gui beta test (was Re: ATB: Heritrix)

gekkokid
In reply to this post by Doug Cutting
what about putting it on sourceforge?
http://sourceforge.net/projects/nutch



----- Original Message -----
From: "Doug Cutting" <[hidden email]>
To: <[hidden email]>
Sent: Saturday, April 29, 2006 12:18 AM
Subject: Re: Admin Gui beta test (was Re: ATB: Heritrix)


> Stefan Groschupf wrote:
>>> If that fails, you could attach it to a page on the Wiki, no?
>>
>> Is that a good idea? The file is that big and we already got the  
>> request to use the apache mirror servers.
>
> The Apache mirrors are really for signed, Apache releases, which this is
> not.  Is it too big for the wiki?
>
> Doug
>
12