Strategic Direction of Nutch

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Strategic Direction of Nutch

Anthony May-2
Greetings all,

I have just been handed the administration of our nutch implementation,
we are currently using nutch 0.7 and it very badly needs updating.
However we are evaluating several options, and I wanted to know about
where nutch is going as a project. I have not been able to find anything
in the wiki or in the mailing list archives with this information
(forgive me if I have missed it).

The central issue is that our needs are for our crawling our own
website with about 200,000 pages and documents with a single machine
containing nutch, not for crawling the web with a massively scalar
architecture. I have heard nutch is moving towards the latter and that
the former usage is becoming very slow in 0.8 compared to 0.7, is this
correct?

Thank you for helping me out.

Regards,


Anthony May
Web Developer
NZQA

********************************************************************************
This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or
communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or
information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA.

All emails have been scanned for viruses and content by MailMarshal.
NZQA reserves the right to monitor all email communications through its network.

********************************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Strategic Direction of Nutch

Piotr Kosiorowski
Anthony,
I do not think nutch can forget about small implementations. It was
one of its strong points
and I do think we will want to support them. For any issues please
report them in JIRA and I am sure they would be taken care of.
Regards
Piotr

On 11/12/06, Anthony May <[hidden email]> wrote:

> Greetings all,
>
> I have just been handed the administration of our nutch implementation,
> we are currently using nutch 0.7 and it very badly needs updating.
> However we are evaluating several options, and I wanted to know about
> where nutch is going as a project. I have not been able to find anything
> in the wiki or in the mailing list archives with this information
> (forgive me if I have missed it).
>
> The central issue is that our needs are for our crawling our own
> website with about 200,000 pages and documents with a single machine
> containing nutch, not for crawling the web with a massively scalar
> architecture. I have heard nutch is moving towards the latter and that
> the former usage is becoming very slow in 0.8 compared to 0.7, is this
> correct?
>
> Thank you for helping me out.
>
> Regards,
>
>
> Anthony May
> Web Developer
> NZQA
>
> ********************************************************************************
> This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or
> communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or
> information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA.
>
> All emails have been scanned for viruses and content by MailMarshal.
> NZQA reserves the right to monitor all email communications through its network.
>
> ********************************************************************************
>
Reply | Threaded
Open this post in threaded view
|

Re: Strategic Direction of Nutch

Nutch Newbie
Well, I would like to agree with Piotr here but current development i.e. 0.8
version and onwards single machine nutch install is not optimal there
are various
hadoop related issue example

http://issues.apache.org/jira/browse/HADOOP-206

are important for a single machine install. I don't think "one size
fits all" is the
catch phrase for nutch either. Thats why Anthony I would suggest you
look at Solr or
Lucene for your installation.

The problem regarding 0.8 being slow on single machine is nothing new
just search the
mailing list you will find many example for it. 0.8 was released
earlier this year and the
problem is still not solved so I am sorry to be negative but I am just
stating facts.


On 11/13/06, Piotr Kosiorowski <[hidden email]> wrote:

> Anthony,
> I do not think nutch can forget about small implementations. It was
> one of its strong points
> and I do think we will want to support them. For any issues please
> report them in JIRA and I am sure they would be taken care of.
> Regards
> Piotr
>
> On 11/12/06, Anthony May <[hidden email]> wrote:
> > Greetings all,
> >
> > I have just been handed the administration of our nutch implementation,
> > we are currently using nutch 0.7 and it very badly needs updating.
> > However we are evaluating several options, and I wanted to know about
> > where nutch is going as a project. I have not been able to find anything
> > in the wiki or in the mailing list archives with this information
> > (forgive me if I have missed it).
> >
> > The central issue is that our needs are for our crawling our own
> > website with about 200,000 pages and documents with a single machine
> > containing nutch, not for crawling the web with a massively scalar
> > architecture. I have heard nutch is moving towards the latter and that
> > the former usage is becoming very slow in 0.8 compared to 0.7, is this
> > correct?
> >
> > Thank you for helping me out.
> >
> > Regards,
> >
> >
> > Anthony May
> > Web Developer
> > NZQA
> >
> > ********************************************************************************
> > This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or
> > communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or
> > information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA.
> >
> > All emails have been scanned for viruses and content by MailMarshal.
> > NZQA reserves the right to monitor all email communications through its network.
> >
> > ********************************************************************************
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Strategic Direction of Nutch

Andrzej Białecki-2
Nutch Newbie wrote:
> Well, I would like to agree with Piotr here but current development
> i.e. 0.8
> version and onwards single machine nutch install is not optimal there
> are various
> hadoop related issue example
>
> http://issues.apache.org/jira/browse/HADOOP-206

Is it really still a valid issue? I'm pretty sure this was already
fixed, or perhaps it was a matter of putting hard limits in
hadoop-site.xml (which overrides even job.xml values).


> The problem regarding 0.8 being slow on single machine is nothing new
> just search the
> mailing list you will find many example for it. 0.8 was released
> earlier this year and the
> problem is still not solved so I am sorry to be negative but I am just
> stating facts.

What Nutch needs at this moment is more developers and contributors.
This and similar issues might be solved by directly addressing each
problem, if we had human resources to do so. As it is now, there are few
active Nutch developers at the moment, and issues are being addressed
slower than we would wish it.

(BTW, Chris Mattmann will be joining the committers group, so you can
expect some improvements in this regard).

But what Piotr stated is that use cases such as yours _are_ important to
us, and this problem will be fixed sooner or later, whenever we have
free resources to do it. If you can help us with debugging and testing,
and providing patches, this process will be much quicker.

I suspect that we (Nutch community) are the only serious user of Hadoop
in local mode - most development efforts in Hadoop project are geared
towards supporting massive clusters and not single machines. So, I would
say it's up to us - the Nutch community - to provide sufficient feedback
to Hadoop to have such issues addressed.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Strategic Direction of Nutch

carmmello
Hi,
Nutch, from version 0.8 is, really, very, very slow, using a single machine,
to process data, after the crawling.  Compared with Nutch 0.7.2 I would say,
from my experience in indexing about 500,000 pages  that it is roughly 4 to
5 times slower.  In adition to that, the possibilities to fix some broken
segments (if the crawl is interrupted for some reason) are absent.
So, I think, one of the possibilities for the user of a single machine is
that the Nutch developers could use some of their time do improve the
previous 0.7.2, adding to it some new features, with further releases of
this series.  I don`t belive that there are many Nutch users, in the real
world of searching, with a farm of computers.  I, for myself, have already
built an index of more than one million pages in a single machine, with an
somewhat old Atlhon 2.4+ and 1 gig of memory, using the 0.7.2 version, with
very good results, including the actual searching,  and gave up the same
task, using the 0.8 version, because of the large amount of time required,
time that I did not have,  to complete all the tasks, after the fetching of
the pages.
Thanks,
Wilson Melo



----- Original Message -----
From: "Andrzej Bialecki" <[hidden email]>
To: <[hidden email]>
Sent: Monday, November 13, 2006 7:32 AM
Subject: Re: Strategic Direction of Nutch


> Nutch Newbie wrote:
>> Well, I would like to agree with Piotr here but current development i.e.
>> 0.8
>> version and onwards single machine nutch install is not optimal there
>> are various
>> hadoop related issue example
>>
>> http://issues.apache.org/jira/browse/HADOOP-206
>
> Is it really still a valid issue? I'm pretty sure this was already fixed,
> or perhaps it was a matter of putting hard limits in hadoop-site.xml
> (which overrides even job.xml values).
>
>
>> The problem regarding 0.8 being slow on single machine is nothing new
>> just search the
>> mailing list you will find many example for it. 0.8 was released
>> earlier this year and the
>> problem is still not solved so I am sorry to be negative but I am just
>> stating facts.
>
> What Nutch needs at this moment is more developers and contributors. This
> and similar issues might be solved by directly addressing each problem, if
> we had human resources to do so. As it is now, there are few active Nutch
> developers at the moment, and issues are being addressed slower than we
> would wish it.
>
> (BTW, Chris Mattmann will be joining the committers group, so you can
> expect some improvements in this regard).
>
> But what Piotr stated is that use cases such as yours _are_ important to
> us, and this problem will be fixed sooner or later, whenever we have free
> resources to do it. If you can help us with debugging and testing, and
> providing patches, this process will be much quicker.
>
> I suspect that we (Nutch community) are the only serious user of Hadoop in
> local mode - most development efforts in Hadoop project are geared towards
> supporting massive clusters and not single machines. So, I would say it's
> up to us - the Nutch community - to provide sufficient feedback to Hadoop
> to have such issues addressed.
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>
>
> --
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.409 / Virus Database: 268.14.4/532 - Release Date: 13/11/2006
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Strategic Direction of Nutch

Sami Siren-2
carmmello wrote:

> So, I think, one of the possibilities for the user of a single machine
> is that the Nutch developers could use some of their time do improve the
> previous 0.7.2, adding to it some new features, with further releases of
> this series.  I don`t belive that there are many Nutch users, in the
> real world of searching, with a farm of computers.  I, for myself, have
> already built an index of more than one million pages in a single
> machine, with an somewhat old Atlhon 2.4+ and 1 gig of memory, using the
> 0.7.2 version, with very good results, including the actual searching,  
> and gave up the same task, using the 0.8 version, because of the large
> amount of time required, time that I did not have,  to complete all the
> tasks, after the fetching of the pages.

How fast do you need to go?

I did a 1 million page crawl today with trunk version of nutch patched
with NUTCH-395 [1]. total time for fetching was little over 7 hrs.

But of course there are still various ways to optimize fetching process
- for example optimizing the scheduling of urls to fetch, improving
nutch agent to use Accept header [2] for failing fast on content it
cannot handle etc.

[1]http://issues.apache.org/jira/browse/NUTCH-395
[2]http://www.mail-archive.com/nutch-dev@.../msg04344.html

--
  Sami Siren
Reply | Threaded
Open this post in threaded view
|

Re: Strategic Direction of Nutch

carmmello
Dear Sami Siren,

Thank you for your prompt answer, but my problem with 0.8.1 was not with the
fetching time itself (although your speed in doing so is a lot greater than
mine), that is on pair with 0.7.2.  My problem is with the time for all the
post fetching processes, that is much longer than with 0.7.2.  When I
indexed that million pages, it took me about the weekend (the whole
process);  when I tried to index 500,000 pages with 0.8.1,  the fetching
went ok, but, after that, I could not get the job done.  The weekend went by
and I just could not wait anymore. That`s why I think that, in many cases,
in using a single machine, 0.7.2 could be a better choice, mainly if this
version is updated.

Regads

----- Original Message -----
From: "Sami Siren" <[hidden email]>
To: <[hidden email]>
Sent: Monday, November 13, 2006 4:28 PM
Subject: Re: Strategic Direction of Nutch


> carmmello wrote:
>> So, I think, one of the possibilities for the user of a single machine is
>> that the Nutch developers could use some of their time do improve the
>> previous 0.7.2, adding to it some new features, with further releases of
>> this series.  I don`t belive that there are many Nutch users, in the real
>> world of searching, with a farm of computers.  I, for myself, have
>> already built an index of more than one million pages in a single
>> machine, with an somewhat old Atlhon 2.4+ and 1 gig of memory, using the
>> 0.7.2 version, with very good results, including the actual searching,
>> and gave up the same task, using the 0.8 version, because of the large
>> amount of time required, time that I did not have,  to complete all the
>> tasks, after the fetching of the pages.
>
> How fast do you need to go?
>
> I did a 1 million page crawl today with trunk version of nutch patched
> with NUTCH-395 [1]. total time for fetching was little over 7 hrs.
>
> But of course there are still various ways to optimize fetching process -
> for example optimizing the scheduling of urls to fetch, improving nutch
> agent to use Accept header [2] for failing fast on content it cannot
> handle etc.
>
> [1]http://issues.apache.org/jira/browse/NUTCH-395
> [2]http://www.mail-archive.com/nutch-dev@.../msg04344.html
>
> --
>  Sami Siren
>
>
> --
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.409 / Virus Database: 268.14.4/532 - Release Date: 13/11/2006
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Strategic Direction of Nutch

Uroš Gruber-2
In reply to this post by Sami Siren-2
Sami Siren wrote:

> carmmello wrote:
>> So, I think, one of the possibilities for the user of a single
>> machine is that the Nutch developers could use some of their time do
>> improve the previous 0.7.2, adding to it some new features, with
>> further releases of this series.  I don`t belive that there are many
>> Nutch users, in the real world of searching, with a farm of
>> computers.  I, for myself, have already built an index of more than
>> one million pages in a single machine, with an somewhat old Atlhon
>> 2.4+ and 1 gig of memory, using the 0.7.2 version, with very good
>> results, including the actual searching,  and gave up the same task,
>> using the 0.8 version, because of the large amount of time required,
>> time that I did not have,  to complete all the tasks, after the
>> fetching of the pages.
>
> How fast do you need to go?
>
> I did a 1 million page crawl today with trunk version of nutch patched
> with NUTCH-395 [1]. total time for fetching was little over 7 hrs.
>
How is that even possible.

I have 3.2GHz pentium with 2G ram. I was same speed problem, because of
that I setup nutch with single node. About hour ago fetcher was finished
crawling 1.2 million pages. But this took

30 hours

Map 2
<http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=MAP&status=all>
        2
<http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=MAP&status=SUCCESS>
        0
<http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=MAP&status=FAILED>
        12-Nov-2006 15:10:35 13-Nov-2006 05:22:16 (14hrs, 11mins, 41sec)
Reduce 2
<http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=REDUCE&status=all>
        2
<http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=REDUCE&status=SUCCESS>
        0
<http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=REDUCE&status=FAILED>
        12-Nov-2006 15:10:46 13-Nov-2006 21:59:19 (30hrs, 48mins, 33sec)


while map job I have about 24 pages/s. I din't test it with this patch.
But then reduce job was slow as hell. I realy don't understant what took
so long. It is almost twice as slow as map job.

I think we need to work on that part.

If I use local mode numbers are even worse.

I can't imagine how much it took to crawl let say 10mio pages.

I would like to help making nutch faster, but there is some part I don't
quite understand. I need to work on that first.

regards

Uros

> But of course there are still various ways to optimize fetching
> process - for example optimizing the scheduling of urls to fetch,
> improving nutch agent to use Accept header [2] for failing fast on
> content it cannot handle etc.
>
> [1]http://issues.apache.org/jira/browse/NUTCH-395
> [2]http://www.mail-archive.com/nutch-dev@.../msg04344.html
>
> --
>  Sami Siren

Reply | Threaded
Open this post in threaded view
|

Re: Strategic Direction of Nutch

Nutch Newbie
Here is some general comments:

The problem is in Hadoop i.e. map-reduce, i.e. processing. Hadoop-206
is not solved..Have a look.

http://www.mail-archive.com/hadoop-user%40lucene.apache.org/msg00521.html

Well, again its a wishful thinking to ask for many developers, patch
and bug reporting and bug fixes - without focusing on the need of such
developers.  Same example again!  hadoop-206 was reported and it is
still not solved. So how do you expect to get more developers? when
the developer just have 1 machine and it takes 3 days to perform any
serious testing/fetching/indexing or any sort development? Developers
moves on...

See when the focus of the development is to solve 1000 machine/ large
install,  then the issues like 206 is never solved. Thus asking for
more developer to provide bug fixes is a wishful thinking.

Sorry if I knew how to solve map/reduce problem i would fix it and
submit patch and I am sure I am not the only one here. Map/reduce
stuff is not really walk in the park :-).

The current direction of nutch development is geared towards large
install and its a great software.  However lets not pretend/preach
Nutch is good for small install, Nutch left that life when it embraced
Map/Reduce i.e. starting from 0.8.

Regards,
On 11/13/06, Uroš Gruber <[hidden email]> wrote:

> Sami Siren wrote:
> > carmmello wrote:
> >> So, I think, one of the possibilities for the user of a single
> >> machine is that the Nutch developers could use some of their time do
> >> improve the previous 0.7.2, adding to it some new features, with
> >> further releases of this series.  I don`t belive that there are many
> >> Nutch users, in the real world of searching, with a farm of
> >> computers.  I, for myself, have already built an index of more than
> >> one million pages in a single machine, with an somewhat old Atlhon
> >> 2.4+ and 1 gig of memory, using the 0.7.2 version, with very good
> >> results, including the actual searching,  and gave up the same task,
> >> using the 0.8 version, because of the large amount of time required,
> >> time that I did not have,  to complete all the tasks, after the
> >> fetching of the pages.
> >
> > How fast do you need to go?
> >
> > I did a 1 million page crawl today with trunk version of nutch patched
> > with NUTCH-395 [1]. total time for fetching was little over 7 hrs.
> >
> How is that even possible.
>
> I have 3.2GHz pentium with 2G ram. I was same speed problem, because of
> that I setup nutch with single node. About hour ago fetcher was finished
> crawling 1.2 million pages. But this took
>
> 30 hours
>
> Map     2
> <http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=MAP&status=all>
>         2
> <http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=MAP&status=SUCCESS>
>         0
> <http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=MAP&status=FAILED>
>         12-Nov-2006 15:10:35    13-Nov-2006 05:22:16 (14hrs, 11mins, 41sec)
> Reduce  2
> <http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=REDUCE&status=all>
>         2
> <http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=REDUCE&status=SUCCESS>
>         0
> <http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=REDUCE&status=FAILED>
>         12-Nov-2006 15:10:46    13-Nov-2006 21:59:19 (30hrs, 48mins, 33sec)
>
>
> while map job I have about 24 pages/s. I din't test it with this patch.
> But then reduce job was slow as hell. I realy don't understant what took
> so long. It is almost twice as slow as map job.
>
> I think we need to work on that part.
>
> If I use local mode numbers are even worse.
>
> I can't imagine how much it took to crawl let say 10mio pages.
>
> I would like to help making nutch faster, but there is some part I don't
> quite understand. I need to work on that first.
>
> regards
>
> Uros
> > But of course there are still various ways to optimize fetching
> > process - for example optimizing the scheduling of urls to fetch,
> > improving nutch agent to use Accept header [2] for failing fast on
> > content it cannot handle etc.
> >
> > [1]http://issues.apache.org/jira/browse/NUTCH-395
> > [2]http://www.mail-archive.com/nutch-dev@.../msg04344.html
> >
> > --
> >  Sami Siren
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Strategic Direction of Nutch

Andrzej Białecki-2
(Sorry for the long post, but I felt this issue needs to be made very
clear ...)

Nutch Newbie wrote:

> Here is some general comments:
>
> The problem is in Hadoop i.e. map-reduce, i.e. processing. Hadoop-206
> is not solved..Have a look.
>
> http://www.mail-archive.com/hadoop-user%40lucene.apache.org/msg00521.html
>
> Well, again its a wishful thinking to ask for many developers, patch
> and bug reporting and bug fixes - without focusing on the need of such
> developers.  Same example again!  hadoop-206 was reported and it is
> still not solved. So how do you expect to get more developers? when

Before we get carried away, let me state clearly that reporting a
problem and providing a fix for a problem are two different things -
Hadoop-206 is a problem report, but without a fix. If there was a fix
for it, it would be most probably applied long time ago. The reason it's
not solved is that it's not a high priority issue for active developers,
and there is no easy fix to be applied.

If this issue is a high priority for you, then fix it and provide a
patch so that others may benefit from it - that's how Open Source
projects work. Pointing fingers and saying "you should have done this or
that long time ago" won't fix the stuff by itself. Are you a developer?
Then fix it. If not, then you should now understand why we kindly _ask_
for more developers to get involved. Reporting problems is very useful
and crucial, but so is having the skilled manpower to fix them.

>
> See when the focus of the development is to solve 1000 machine/ large
> install,  then the issues like 206 is never solved. Thus asking for
> more developer to provide bug fixes is a wishful thinking.

No, we ask because we really need developers who could help us, who take
initiative to fix something if it's broken in their particular use case.

The focus is on large clusters because that's what majority of active
developers use. If there were more active developers with focus on small
clusters (or single machine deployments) - hint, hint - the focus would
move in this direction. There is no conspiracy here, nor do we willfully
ignore the needs of people with small deployments - it's just a matter
of what is the priority among active developers.

Complaining about this won't help as much as providing actual patches to
solve issues. Until then, a faster single-machine deployment is a "nice
to have" thing, but not the top priority.

>
> Sorry if I knew how to solve map/reduce problem i would fix it and
> submit patch and I am sure I am not the only one here. Map/reduce
> stuff is not really walk in the park :-).
>
> The current direction of nutch development is geared towards large
> install and its a great software.  However lets not pretend/preach
> Nutch is good for small install, Nutch left that life when it embraced
> Map/Reduce i.e. starting from 0.8.

You need to take into account that this is the first official release of
Nutch after a major brain surgery, so it's no wonder things are a little
bit twitchy ;) There are in fact very few, if any, places in Nutch that
still use the same data models and algorithms as they did in 0.7 era.

Having said that, I just did a crawl of 1 mln pages within ~30 hours, on
a single machine, which should give me a 100 mln collection within 2
months. This speed is acceptable for me, even if it's slower than 0.7,
and if one day I want to go beyond 100 mln pages I know that I will be
able to do it - which _cannot_ be said about 0.7 ... So, you can look at
it as a tradeoff.

(BTW: the issue with slow reduce phase is well known, and people from
the Hadoop project are working on it even as we speak).

Oh, and regarding the subject of this thread - the strategic direction
of Nutch is to provide a viable platform for medium to large scale
search engines, be they Internet-wide or Intranet / constrained to a
specific area. This was the original goal of the project, and it still
reflects our ambitions. HOWEVER, if a significant part of active
community is focused on small / embedded deployments, then you need to
make your voice heard _and_ start contributing to the project so that it
becomes a viable solution also to your needs.

I hope this long answer helps you to understand why things are the way
they are ... ;)

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Strategic Direction of Nutch

Tomi N/A
In reply to this post by carmmello
2006/11/13, carmmello <[hidden email]>:
> Hi,
> Nutch, from version 0.8 is, really, very, very slow, using a single machine,
> to process data, after the crawling.  Compared with Nutch 0.7.2 I would say,
> ...
> this series.  I don`t believe that there are many Nutch users, in the real
> world of searching, with a farm of computers.  I, for myself, have already

Ditto, on both points.
Furthermore, I'd say I'm much more likely to deliver 10 single machine
nutch setups than a single system with 10 nodes. I believe the same
goes for a number of other users.

I had a look at the hadoop code and, well, it'd take a week (probably
an optimistic estimate) just to get acquainted with selected points of
interest, leaving a lot unknown. And this is just to get started. At
the moment, I can't justify a possible hi-risk, multi-week effort to
investigate where the bottleneck is and find a workable solution - I
can only imagine how this problem would look to someone without any
prior knowledge about distributed systems and/or indexing
technology...
...in the meantime, I suspect we might see something that seems much
more reasonable in the mid-term: a lot of useful code back-ported to
0.7.2., doing an excellent nice job on installations on one or a
hand-full of servers.

t.n.a.
Reply | Threaded
Open this post in threaded view
|

Re: Strategic Direction of Nutch

Nutch Newbie
In reply to this post by Andrzej Białecki-2
Actually we are saying the same thing. Sorry I was not really pointing
any fingers, apology if It came across that away. I was just stating
the fact why things didn't get solved because as you pointed out
active developers are on large install and not on small install.

However if the ambition of the project is to address medium size
install, then there has to be some effort from comitters to make sure
not to introduce code that just benefit the big 1000 machine install
or the active developers Correct? (Again no pointing fingers :-).
Otherwise you are just forgetting the little guys and not giving them
the chance to develop and contribute.

I completely understand your view and I am aware of Hadoop work in progress.

Regards,
On 11/14/06, Andrzej Bialecki <[hidden email]> wrote:

> (Sorry for the long post, but I felt this issue needs to be made very
> clear ...)
>
> Nutch Newbie wrote:
> > Here is some general comments:
> >
> > The problem is in Hadoop i.e. map-reduce, i.e. processing. Hadoop-206
> > is not solved..Have a look.
> >
> > http://www.mail-archive.com/hadoop-user%40lucene.apache.org/msg00521.html
> >
> > Well, again its a wishful thinking to ask for many developers, patch
> > and bug reporting and bug fixes - without focusing on the need of such
> > developers.  Same example again!  hadoop-206 was reported and it is
> > still not solved. So how do you expect to get more developers? when
>
> Before we get carried away, let me state clearly that reporting a
> problem and providing a fix for a problem are two different things -
> Hadoop-206 is a problem report, but without a fix. If there was a fix
> for it, it would be most probably applied long time ago. The reason it's
> not solved is that it's not a high priority issue for active developers,
> and there is no easy fix to be applied.
>
> If this issue is a high priority for you, then fix it and provide a
> patch so that others may benefit from it - that's how Open Source
> projects work. Pointing fingers and saying "you should have done this or
> that long time ago" won't fix the stuff by itself. Are you a developer?
> Then fix it. If not, then you should now understand why we kindly _ask_
> for more developers to get involved. Reporting problems is very useful
> and crucial, but so is having the skilled manpower to fix them.
>
> >
> > See when the focus of the development is to solve 1000 machine/ large
> > install,  then the issues like 206 is never solved. Thus asking for
> > more developer to provide bug fixes is a wishful thinking.
>
> No, we ask because we really need developers who could help us, who take
> initiative to fix something if it's broken in their particular use case.
>
> The focus is on large clusters because that's what majority of active
> developers use. If there were more active developers with focus on small
> clusters (or single machine deployments) - hint, hint - the focus would
> move in this direction. There is no conspiracy here, nor do we willfully
> ignore the needs of people with small deployments - it's just a matter
> of what is the priority among active developers.
>
> Complaining about this won't help as much as providing actual patches to
> solve issues. Until then, a faster single-machine deployment is a "nice
> to have" thing, but not the top priority.
>
> >
> > Sorry if I knew how to solve map/reduce problem i would fix it and
> > submit patch and I am sure I am not the only one here. Map/reduce
> > stuff is not really walk in the park :-).
> >
> > The current direction of nutch development is geared towards large
> > install and its a great software.  However lets not pretend/preach
> > Nutch is good for small install, Nutch left that life when it embraced
> > Map/Reduce i.e. starting from 0.8.
>
> You need to take into account that this is the first official release of
> Nutch after a major brain surgery, so it's no wonder things are a little
> bit twitchy ;) There are in fact very few, if any, places in Nutch that
> still use the same data models and algorithms as they did in 0.7 era.
>
> Having said that, I just did a crawl of 1 mln pages within ~30 hours, on
> a single machine, which should give me a 100 mln collection within 2
> months. This speed is acceptable for me, even if it's slower than 0.7,
> and if one day I want to go beyond 100 mln pages I know that I will be
> able to do it - which _cannot_ be said about 0.7 ... So, you can look at
> it as a tradeoff.
>
> (BTW: the issue with slow reduce phase is well known, and people from
> the Hadoop project are working on it even as we speak).
>
> Oh, and regarding the subject of this thread - the strategic direction
> of Nutch is to provide a viable platform for medium to large scale
> search engines, be they Internet-wide or Intranet / constrained to a
> specific area. This was the original goal of the project, and it still
> reflects our ambitions. HOWEVER, if a significant part of active
> community is focused on small / embedded deployments, then you need to
> make your voice heard _and_ start contributing to the project so that it
> becomes a viable solution also to your needs.
>
> I hope this long answer helps you to understand why things are the way
> they are ... ;)
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Strategic Direction of Nutch

Nitin Borwankar


Hi all,

First an intro. I am another Nutch newbie and am finding 0.7.2 to be
quite an effective single machine crawler.
I am not new to Java or data or the Internet. I run an email list called
'tagdb' for people interested in db problems in creating folksonomy
applications, also a blog called tagschema ( http://tagschema.com )

For completely different reasons I am interested in MapReduce (outside
the Nutch context) so I am also interested in seeing how Hadoop evolves.
Personally I see a lot of value in retaining the 0.7.2 code base while
evolving 0.8 into the medium to high end space as a *separate* code
line.
The ability to keep db formats compatible would be nice to allow reuse
of existing results but is not necessary.

As a potential developer I would like to volunteer for the ongoing
maintenance and evolution of 0.7.2 as an effective single machine
crawler.
I understand that the current developer community is more interested in
moving MapReduce based architecture forward and as I said I am also
interested in that.
But it would be a shame if the just fine 0.7.2 code was orphaned and I
would like to step forward and put my money where my mouth is.
I don't know what it would take to maintain separate versions like the
Tomcat folks do but it seems there is a need.

Consider this a proposal to maintain two separate versions by continuing
bug fix versions of 0.7  until one of two things happen

a) 0.8 evolves to something satisfactory for use as also as a single
machine search engine and everyone is happy moving to it
b) a critical mass of developers steps forward to support the ongoing
development of 0.7.2 into say Nutch-lite always and only meant for
single machine use.

Please feel free to shoot down if I am "smoking rope" as famous
newscaster says ....


Nitin Borwankar
http://tagschema.com




On Tue, 14 Nov 2006 00:53:27 +0100, "Nutch Newbie"
<[hidden email]> said:

> Actually we are saying the same thing. Sorry I was not really pointing
> any fingers, apology if It came across that away. I was just stating
> the fact why things didn't get solved because as you pointed out
> active developers are on large install and not on small install.
>
> However if the ambition of the project is to address medium size
> install, then there has to be some effort from comitters to make sure
> not to introduce code that just benefit the big 1000 machine install
> or the active developers Correct? (Again no pointing fingers :-).
> Otherwise you are just forgetting the little guys and not giving them
> the chance to develop and contribute.
>
> I completely understand your view and I am aware of Hadoop work in
> progress.
>
> Regards,
> On 11/14/06, Andrzej Bialecki <[hidden email]> wrote:
> > (Sorry for the long post, but I felt this issue needs to be made very
> > clear ...)
> >
> > Nutch Newbie wrote:
> > > Here is some general comments:
> > >
> > > The problem is in Hadoop i.e. map-reduce, i.e. processing. Hadoop-206
> > > is not solved..Have a look.
> > >
> > > http://www.mail-archive.com/hadoop-user%40lucene.apache.org/msg00521.html
> > >
> > > Well, again its a wishful thinking to ask for many developers, patch
> > > and bug reporting and bug fixes - without focusing on the need of such
> > > developers.  Same example again!  hadoop-206 was reported and it is
> > > still not solved. So how do you expect to get more developers? when
> >
> > Before we get carried away, let me state clearly that reporting a
> > problem and providing a fix for a problem are two different things -
> > Hadoop-206 is a problem report, but without a fix. If there was a fix
> > for it, it would be most probably applied long time ago. The reason it's
> > not solved is that it's not a high priority issue for active developers,
> > and there is no easy fix to be applied.
> >
> > If this issue is a high priority for you, then fix it and provide a
> > patch so that others may benefit from it - that's how Open Source
> > projects work. Pointing fingers and saying "you should have done this or
> > that long time ago" won't fix the stuff by itself. Are you a developer?
> > Then fix it. If not, then you should now understand why we kindly _ask_
> > for more developers to get involved. Reporting problems is very useful
> > and crucial, but so is having the skilled manpower to fix them.
> >
> > >
> > > See when the focus of the development is to solve 1000 machine/ large
> > > install,  then the issues like 206 is never solved. Thus asking for
> > > more developer to provide bug fixes is a wishful thinking.
> >
> > No, we ask because we really need developers who could help us, who take
> > initiative to fix something if it's broken in their particular use case.
> >
> > The focus is on large clusters because that's what majority of active
> > developers use. If there were more active developers with focus on small
> > clusters (or single machine deployments) - hint, hint - the focus would
> > move in this direction. There is no conspiracy here, nor do we willfully
> > ignore the needs of people with small deployments - it's just a matter
> > of what is the priority among active developers.
> >
> > Complaining about this won't help as much as providing actual patches to
> > solve issues. Until then, a faster single-machine deployment is a "nice
> > to have" thing, but not the top priority.
> >
> > >
> > > Sorry if I knew how to solve map/reduce problem i would fix it and
> > > submit patch and I am sure I am not the only one here. Map/reduce
> > > stuff is not really walk in the park :-).
> > >
> > > The current direction of nutch development is geared towards large
> > > install and its a great software.  However lets not pretend/preach
> > > Nutch is good for small install, Nutch left that life when it embraced
> > > Map/Reduce i.e. starting from 0.8.
> >
> > You need to take into account that this is the first official release of
> > Nutch after a major brain surgery, so it's no wonder things are a little
> > bit twitchy ;) There are in fact very few, if any, places in Nutch that
> > still use the same data models and algorithms as they did in 0.7 era.
> >
> > Having said that, I just did a crawl of 1 mln pages within ~30 hours, on
> > a single machine, which should give me a 100 mln collection within 2
> > months. This speed is acceptable for me, even if it's slower than 0.7,
> > and if one day I want to go beyond 100 mln pages I know that I will be
> > able to do it - which _cannot_ be said about 0.7 ... So, you can look at
> > it as a tradeoff.
> >
> > (BTW: the issue with slow reduce phase is well known, and people from
> > the Hadoop project are working on it even as we speak).
> >
> > Oh, and regarding the subject of this thread - the strategic direction
> > of Nutch is to provide a viable platform for medium to large scale
> > search engines, be they Internet-wide or Intranet / constrained to a
> > specific area. This was the original goal of the project, and it still
> > reflects our ambitions. HOWEVER, if a significant part of active
> > community is focused on small / embedded deployments, then you need to
> > make your voice heard _and_ start contributing to the project so that it
> > becomes a viable solution also to your needs.
> >
> > I hope this long answer helps you to understand why things are the way
> > they are ... ;)
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >  ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
> >
--
  Nitin Borwankar
  [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Strategic Direction of Nutch

Anthony May-2
In reply to this post by Anthony May-2
This is one of the options that I have suggested for our organisation to
adopt.

Anthony May
Web Developer
NZQA

>>> [hidden email] 14/11/2006 2:05 p.m. >>>


Hi all,

First an intro. I am another Nutch newbie and am finding 0.7.2 to be
quite an effective single machine crawler.
I am not new to Java or data or the Internet. I run an email list
called
'tagdb' for people interested in db problems in creating folksonomy
applications, also a blog called tagschema ( http://tagschema.com )

For completely different reasons I am interested in MapReduce (outside
the Nutch context) so I am also interested in seeing how Hadoop
evolves.
Personally I see a lot of value in retaining the 0.7.2 code base while
evolving 0.8 into the medium to high end space as a *separate* code
line.
The ability to keep db formats compatible would be nice to allow reuse
of existing results but is not necessary.

As a potential developer I would like to volunteer for the ongoing
maintenance and evolution of 0.7.2 as an effective single machine
crawler.
I understand that the current developer community is more interested
in
moving MapReduce based architecture forward and as I said I am also
interested in that.
But it would be a shame if the just fine 0.7.2 code was orphaned and I
would like to step forward and put my money where my mouth is.
I don't know what it would take to maintain separate versions like the
Tomcat folks do but it seems there is a need.

Consider this a proposal to maintain two separate versions by
continuing
bug fix versions of 0.7  until one of two things happen

a) 0.8 evolves to something satisfactory for use as also as a single
machine search engine and everyone is happy moving to it
b) a critical mass of developers steps forward to support the ongoing
development of 0.7.2 into say Nutch-lite always and only meant for
single machine use.

Please feel free to shoot down if I am "smoking rope" as famous
newscaster says ....


Nitin Borwankar
http://tagschema.com 




On Tue, 14 Nov 2006 00:53:27 +0100, "Nutch Newbie"
<[hidden email]> said:
> Actually we are saying the same thing. Sorry I was not really
pointing
> any fingers, apology if It came across that away. I was just stating
> the fact why things didn't get solved because as you pointed out
> active developers are on large install and not on small install.
>
> However if the ambition of the project is to address medium size
> install, then there has to be some effort from comitters to make
sure
> not to introduce code that just benefit the big 1000 machine install
> or the active developers Correct? (Again no pointing fingers :-).
> Otherwise you are just forgetting the little guys and not giving
them
> the chance to develop and contribute.
>
> I completely understand your view and I am aware of Hadoop work in
> progress.
>
> Regards,
> On 11/14/06, Andrzej Bialecki <[hidden email]> wrote:
> > (Sorry for the long post, but I felt this issue needs to be made
very
> > clear ...)
> >
> > Nutch Newbie wrote:
> > > Here is some general comments:
> > >
> > > The problem is in Hadoop i.e. map-reduce, i.e. processing.
Hadoop-206
> > > is not solved..Have a look.
> > >
> > >
http://www.mail-archive.com/hadoop-user%40lucene.apache.org/msg00521.html

> > >
> > > Well, again its a wishful thinking to ask for many developers,
patch
> > > and bug reporting and bug fixes - without focusing on the need of
such
> > > developers.  Same example again!  hadoop-206 was reported and it
is
> > > still not solved. So how do you expect to get more developers?
when
> >
> > Before we get carried away, let me state clearly that reporting a
> > problem and providing a fix for a problem are two different things
-
> > Hadoop-206 is a problem report, but without a fix. If there was a
fix
> > for it, it would be most probably applied long time ago. The reason
it's
> > not solved is that it's not a high priority issue for active
developers,
> > and there is no easy fix to be applied.
> >
> > If this issue is a high priority for you, then fix it and provide
a
> > patch so that others may benefit from it - that's how Open Source
> > projects work. Pointing fingers and saying "you should have done
this or
> > that long time ago" won't fix the stuff by itself. Are you a
developer?
> > Then fix it. If not, then you should now understand why we kindly
_ask_
> > for more developers to get involved. Reporting problems is very
useful
> > and crucial, but so is having the skilled manpower to fix them.
> >
> > >
> > > See when the focus of the development is to solve 1000 machine/
large
> > > install,  then the issues like 206 is never solved. Thus asking
for
> > > more developer to provide bug fixes is a wishful thinking.
> >
> > No, we ask because we really need developers who could help us, who
take
> > initiative to fix something if it's broken in their particular use
case.
> >
> > The focus is on large clusters because that's what majority of
active
> > developers use. If there were more active developers with focus on
small
> > clusters (or single machine deployments) - hint, hint - the focus
would
> > move in this direction. There is no conspiracy here, nor do we
willfully
> > ignore the needs of people with small deployments - it's just a
matter
> > of what is the priority among active developers.
> >
> > Complaining about this won't help as much as providing actual
patches to
> > solve issues. Until then, a faster single-machine deployment is a
"nice
> > to have" thing, but not the top priority.
> >
> > >
> > > Sorry if I knew how to solve map/reduce problem i would fix it
and
> > > submit patch and I am sure I am not the only one here.
Map/reduce
> > > stuff is not really walk in the park :-).
> > >
> > > The current direction of nutch development is geared towards
large
> > > install and its a great software.  However lets not
pretend/preach
> > > Nutch is good for small install, Nutch left that life when it
embraced
> > > Map/Reduce i.e. starting from 0.8.
> >
> > You need to take into account that this is the first official
release of
> > Nutch after a major brain surgery, so it's no wonder things are a
little
> > bit twitchy ;) There are in fact very few, if any, places in Nutch
that
> > still use the same data models and algorithms as they did in 0.7
era.
> >
> > Having said that, I just did a crawl of 1 mln pages within ~30
hours, on
> > a single machine, which should give me a 100 mln collection within
2
> > months. This speed is acceptable for me, even if it's slower than
0.7,
> > and if one day I want to go beyond 100 mln pages I know that I will
be
> > able to do it - which _cannot_ be said about 0.7 ... So, you can
look at
> > it as a tradeoff.
> >
> > (BTW: the issue with slow reduce phase is well known, and people
from
> > the Hadoop project are working on it even as we speak).
> >
> > Oh, and regarding the subject of this thread - the strategic
direction
> > of Nutch is to provide a viable platform for medium to large scale
> > search engines, be they Internet-wide or Intranet / constrained to
a
> > specific area. This was the original goal of the project, and it
still
> > reflects our ambitions. HOWEVER, if a significant part of active
> > community is focused on small / embedded deployments, then you need
to
> > make your voice heard _and_ start contributing to the project so
that it
> > becomes a viable solution also to your needs.
> >
> > I hope this long answer helps you to understand why things are the
way

> > they are ... ;)
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >  ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
> >
--
  Nitin Borwankar
  [hidden email]


********************************************************************************
This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or
communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or
information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA.

All emails have been scanned for viruses and content by MailMarshal.
NZQA reserves the right to monitor all email communications through its network.

********************************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Strategic Direction of Nutch

Sami Siren-2
In reply to this post by Uroš Gruber-2
Uroš Gruber wrote:

>> How fast do you need to go?
>>
>> I did a 1 million page crawl today with trunk version of nutch patched
>> with NUTCH-395 [1]. total time for fetching was little over 7 hrs.
>>
> How is that even possible.
>
> I have 3.2GHz pentium with 2G ram. I was same speed problem, because of
> that I setup nutch with single node. About hour ago fetcher was finished
> crawling 1.2 million pages. But this took

I am running on amd athlon 64 3600+ with 1 G of memory so it's not even
"high end"
> while map job I have about 24 pages/s. I din't test it with this patch.
> But then reduce job was slow as hell. I realy don't understant what took
> so long. It is almost twice as slow as map job.

Please try the trunk version for comparison and check back for results.
(the patch is now applied to trunk)

There are also other things that count (even more?), please see [1]

> If I use local mode numbers are even worse.

my numbers are with local job runner.

> I can't imagine how much it took to crawl let say 10mio pages.
>
I'll let you know when mine is finished, just started 3rd segment of
size 1 million to test the trunk version (running with local job runner)

--
  Sami Siren


[1]http://www.mail-archive.com/nutch-user@.../msg06533.html
Reply | Threaded
Open this post in threaded view
|

Re: Strategic Direction of Nutch

Andrzej Białecki-2
In reply to this post by Nitin Borwankar
Nitin Borwankar wrote:
> Hi all,
>
> First an intro. I am another Nutch newbie and am finding 0.7.2 to be
> quite an effective single machine crawler.
>  
[..]
> The ability to keep db formats compatible would be nice to allow reuse
> of existing results but is not necessary.
>  


That's probably not going to happen - each branch has specific
requirements from the db and segment formats, which are incompatible.
However, given enough interest we could implement converters, even
bi-directional.


> As a potential developer I would like to volunteer for the ongoing
> maintenance and evolution of 0.7.2 as an effective single machine
> crawler.
>  

That's excellent! I imagine the procedure to get you involved would be
something like this:

* start collecting issues related to maintenance, bugfixes or
improvements of that branch,

* create JIRA issues, plus start collecting patches, tested and ready
for committing. One of the existing developers will commit them on your
behalf.

* after a while we would consider giving you committer rights so that
you could work directly with the code.


> Consider this a proposal to maintain two separate versions by continuing
> bug fix versions of 0.7  until one of two things happen
>
> a) 0.8 evolves to something satisfactory for use as also as a single
> machine search engine and everyone is happy moving to it
> b) a critical mass of developers steps forward to support the ongoing
> development of 0.7.2 into say Nutch-lite always and only meant for
> single machine use.
>  
I do hope that option a) becomes a reality sooner rather than later. But if there is sufficient interest (and enough developers) in developing 0.7 branch, then go for it - keeping in mind, though, that eventually these code bases will diverge so much that maintaining them will require two mostly separate teams ...

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Strategic Direction of Nutch

Piotr Kosiorowski
I agree with Andrzej. On my part if some  takes the effort of
preparing patches and testing I as a committer (not very active one
recently) may focus on 7.2 issues and commit the patches. And in
future prepare 7.3 release.
Regards,
Piotr

On 11/15/06, Andrzej Bialecki <[hidden email]> wrote:

> Nitin Borwankar wrote:
> > Hi all,
> >
> > First an intro. I am another Nutch newbie and am finding 0.7.2 to be
> > quite an effective single machine crawler.
> >
> [..]
> > The ability to keep db formats compatible would be nice to allow reuse
> > of existing results but is not necessary.
> >
>
>
> That's probably not going to happen - each branch has specific
> requirements from the db and segment formats, which are incompatible.
> However, given enough interest we could implement converters, even
> bi-directional.
>
>
> > As a potential developer I would like to volunteer for the ongoing
> > maintenance and evolution of 0.7.2 as an effective single machine
> > crawler.
> >
>
> That's excellent! I imagine the procedure to get you involved would be
> something like this:
>
> * start collecting issues related to maintenance, bugfixes or
> improvements of that branch,
>
> * create JIRA issues, plus start collecting patches, tested and ready
> for committing. One of the existing developers will commit them on your
> behalf.
>
> * after a while we would consider giving you committer rights so that
> you could work directly with the code.
>
>
> > Consider this a proposal to maintain two separate versions by continuing
> > bug fix versions of 0.7  until one of two things happen
> >
> > a) 0.8 evolves to something satisfactory for use as also as a single
> > machine search engine and everyone is happy moving to it
> > b) a critical mass of developers steps forward to support the ongoing
> > development of 0.7.2 into say Nutch-lite always and only meant for
> > single machine use.
> >
> I do hope that option a) becomes a reality sooner rather than later. But if there is sufficient interest (and enough developers) in developing 0.7 branch, then go for it - keeping in mind, though, that eventually these code bases will diverge so much that maintaining them will require two mostly separate teams ...
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Strategic Direction of Nutch

carmmello
Unfortunately I am not a developper.  But as an user of Nutch in a single
machine, and very happy with 0.7.2, I think those are good news.
And there is a feature I would like to see in the nutch.default.xml:
"db.ignore.external.links"; I just don`t know how to do it, as the actual
"db.max.outlinks.per.page", from my experience, does`nt give as good results
as the former, used in 0.8.1.
Tanks
Carmmello

----- Original Message -----
From: "Piotr Kosiorowski" <[hidden email]>
To: <[hidden email]>
Sent: Wednesday, November 15, 2006 11:42 AM
Subject: Re: Strategic Direction of Nutch


>I agree with Andrzej. On my part if some  takes the effort of
> preparing patches and testing I as a committer (not very active one
> recently) may focus on 7.2 issues and commit the patches. And in
> future prepare 7.3 release.
> Regards,
> Piotr
>
> On 11/15/06, Andrzej Bialecki <[hidden email]> wrote:
>> Nitin Borwankar wrote:
>> > Hi all,
>> >
>> > First an intro. I am another Nutch newbie and am finding 0.7.2 to be
>> > quite an effective single machine crawler.
>> >
>> [..]
>> > The ability to keep db formats compatible would be nice to allow reuse
>> > of existing results but is not necessary.
>> >
>>
>>
>> That's probably not going to happen - each branch has specific
>> requirements from the db and segment formats, which are incompatible.
>> However, given enough interest we could implement converters, even
>> bi-directional.
>>
>>
>> > As a potential developer I would like to volunteer for the ongoing
>> > maintenance and evolution of 0.7.2 as an effective single machine
>> > crawler.
>> >
>>
>> That's excellent! I imagine the procedure to get you involved would be
>> something like this:
>>
>> * start collecting issues related to maintenance, bugfixes or
>> improvements of that branch,
>>
>> * create JIRA issues, plus start collecting patches, tested and ready
>> for committing. One of the existing developers will commit them on your
>> behalf.
>>
>> * after a while we would consider giving you committer rights so that
>> you could work directly with the code.
>>
>>
>> > Consider this a proposal to maintain two separate versions by
>> > continuing
>> > bug fix versions of 0.7  until one of two things happen
>> >
>> > a) 0.8 evolves to something satisfactory for use as also as a single
>> > machine search engine and everyone is happy moving to it
>> > b) a critical mass of developers steps forward to support the ongoing
>> > development of 0.7.2 into say Nutch-lite always and only meant for
>> > single machine use.
>> >
>> I do hope that option a) becomes a reality sooner rather than later. But
>> if there is sufficient interest (and enough developers) in developing 0.7
>> branch, then go for it - keeping in mind, though, that eventually these
>> code bases will diverge so much that maintaining them will require two
>> mostly separate teams ...
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>>
>
>
> --
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.5.430 / Virus Database: 268.14.5/534 - Release Date: 14/11/2006
> 15:58
>

Reply | Threaded
Open this post in threaded view
|

Re: Strategic Direction of Nutch

Nitin Borwankar
In reply to this post by Andrzej Białecki-2
Andrzej Bialecki wrote:

> Nitin Borwankar wrote:
>
>> Hi all,
>>
>> First an intro. I am another Nutch newbie and am finding 0.7.2 to be
>> quite an effective single machine crawler.  
>
> [..]
>
>> The ability to keep db formats compatible would be nice to allow reuse
>> of existing results but is not necessary.
>>  
>
>
>
> That's probably not going to happen - each branch has specific
> requirements from the db and segment formats, which are incompatible.
> However, given enough interest we could implement converters, even
> bi-directional.
>
>
>> As a potential developer I would like to volunteer for the ongoing
>> maintenance and evolution of 0.7.2 as an effective single machine
>> crawler.
>>  
>
>
> That's excellent! I imagine the procedure to get you involved would be
> something like this:
>
> * start collecting issues related to maintenance, bugfixes or
> improvements of that branch,

what is the mechanism for this collection process - do we create a
separate email list, a separate alias ... or everyone just sends me
email ( this may get messy fast ).

>
> * create JIRA issues, plus start collecting patches, tested and ready
> for committing. One of the existing developers will commit them on
> your behalf.
>
sounds good.

> * after a while we would consider giving you committer rights so that
> you could work directly with the code.
>
fair enough.  Do we take this offline for further thrashing out ? Or
continue here ?

Nitin Borwankar

>
>> Consider this a proposal to maintain two separate versions by continuing
>> bug fix versions of 0.7  until one of two things happen
>>
>> a) 0.8 evolves to something satisfactory for use as also as a single
>> machine search engine and everyone is happy moving to it
>> b) a critical mass of developers steps forward to support the ongoing
>> development of 0.7.2 into say Nutch-lite always and only meant for
>> single machine use.
>>  
>
> I do hope that option a) becomes a reality sooner rather than later.
> But if there is sufficient interest (and enough developers) in
> developing 0.7 branch, then go for it - keeping in mind, though, that
> eventually these code bases will diverge so much that maintaining them
> will require two mostly separate teams ...
>

Reply | Threaded
Open this post in threaded view
|

Re: Strategic Direction of Nutch

Arun Sharma-3
Hi Nitin,

   As per I understand, following is answer for the things you are looking
for:


On 11/16/06, Nitin Borwankar <[hidden email]> wrote:

>
> Andrzej Bialecki wrote:
>
> > Nitin Borwankar wrote:
> >
> >> Hi all,
> >>
> >> First an intro. I am another Nutch newbie and am finding 0.7.2 to be
> >> quite an effective single machine crawler.
> >
> > [..]
> >
> >> The ability to keep db formats compatible would be nice to allow reuse
> >> of existing results but is not necessary.
> >>
> >
> >
> >
> > That's probably not going to happen - each branch has specific
> > requirements from the db and segment formats, which are incompatible.
> > However, given enough interest we could implement converters, even
> > bi-directional.
> >
> >
> >> As a potential developer I would like to volunteer for the ongoing
> >> maintenance and evolution of 0.7.2 as an effective single machine
> >> crawler.
> >>
> >
> >
> > That's excellent! I imagine the procedure to get you involved would be
> > something like this:
> >
> > * start collecting issues related to maintenance, bugfixes or
> > improvements of that branch,
>
> what is the mechanism for this collection process - do we create a
> separate email list, a separate alias ... or everyone just sends me
> email ( this may get messy fast ).


Well, Once you create your jira user, You can choose among number of project
you want to contribute. You have to filter the request to all the issues
(that includes bug fix, improvement and patches etc )  Or else you can
Browse Nutch project to see its issues. Here you can chhose version 0.7.2


>
> > * create JIRA issues, plus start collecting patches, tested and ready
> > for committing. One of the existing developers will commit them on
> > your behalf.
> >
> sounds good.
>
> > * after a while we would consider giving you committer rights so that
> > you could work directly with the code.
> >
> fair enough.  Do we take this offline for further thrashing out ? Or
> continue here ?
>
> Nitin Borwankar
>
> >
> >> Consider this a proposal to maintain two separate versions by
> continuing
> >> bug fix versions of 0.7  until one of two things happen
> >>
> >> a) 0.8 evolves to something satisfactory for use as also as a single
> >> machine search engine and everyone is happy moving to it
> >> b) a critical mass of developers steps forward to support the ongoing
> >> development of 0.7.2 into say Nutch-lite always and only meant for
> >> single machine use.
> >>
> >
> > I do hope that option a) becomes a reality sooner rather than later.
> > But if there is sufficient interest (and enough developers) in
> > developing 0.7 branch, then go for it - keeping in mind, though, that
> > eventually these code bases will diverge so much that maintaining them
> > will require two mostly separate teams ...
> >
>
> Well, I am also working on nutch 0.7.2 occasionally for in-house product
from last one year. As nutch 0.8.x aheading somewhere in different
direction, We continue want to enhance and upgrade the functionality of
0.7.2 . I will like to extend you help in this regards and want to work with
you to improve it.

My zira id is :"sharma_arun_se ". I have created issue for you.

Keep it up!!!
12