crawl-urlfilter

classic Classic list List threaded Threaded
27 messages Options
12
Reply | Threaded
Open this post in threaded view
|

crawl-urlfilter

adriano50

Hi,
thank you for your hints but I didn' give you the following information:

I modified the file crawl-urlfilter.txt in this mode:
#start crawl-urlfilter
# skip file:, ftp:, & mailto: urls
-^(ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$

# skip URLs containing certain characters as probable queries, etc.
-[*!@]

# accept anything else
+.
#end crawl-urlfilter


I started nutch with this line_command :
bin/nutch crawl urls -dir /home/paul/nutch-searcher.dir -depth 3 -thread 8 >& crawl.log

In the file "urls" there is the url of the following page:

<HTML>

<HEAD>
<TITLE>  TitleOfSite </TITLE>
</HEAD>

<FRAMESET ROWS="14%, *">

<FRAME NORESIZE NAME="MENU" SRC="MyServlet?menu=1" SCROLLING =AUTO">

<FRAME NAME="PAGE"  SRC="../welcome.html" SCROLLING=AUTO">

</FRAMESET>

</HTML>


Nutch crawls and fetchs "welcome.html"  but doesn't work with MyServlet?menu=1
The servlet "MyServlet?menu=1"  shows some links but in the log  nutch doesn't
fetch  any of those links.
I hope the question is clear and am looking forward to receiving your answer.

                                         Adriano
please help me!!!!!!!


-------------------------------------------------------------------------
Visita http://domini.interfree.it, il sito di Interfree dove trovare
soluzioni semplici e complete che soddisfano le tue esigenze in Internet,
ecco due esempi di offerte:

-  Registrazione Dominio: un dominio con 1 MB di spazio disco +  2 caselle
   email a soli 18,59 euro
-  MioDominio: un dominio con 20 MB di spazio disco + 5 caselle email
   a soli 51,13 euro

Vieni a trovarci!

Lo Staff di Interfree
-------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

indexing is very very very slow

Gal Nitzan
Hi,

I am crawling the web...

my machine:
cpu: Xeon 2.8 X 2
ram 2GB
HD raid 2 X 160 GB

After fetching (i stopped the fetcher after it halted (didn't fetch) for
a few hours) i have done the following:

1. s1=`ls -d index/segments/2* | tail -1`

2. bin/nutch updatedb index/db/ $s1
    the following is the last few lines from the updatedb

--------------------------------------------------------------------
050916 135308 Processing document 127000
050916 135316 Processing document 128000
050916 135317 Unexpected EOF in: index/segments/20050916014401/fetcher
at entry #128116.  Ignoring.
050916 135317 Finishing update
050916 135456 Processing pagesByURL: Sorted 3083939 instructions in
99.536 seconds.
050916 135456 Processing pagesByURL: Sorted 30983.15182446552
instructions/second
050916 135559 Processing pagesByURL: Merged to new DB containing 774610
records in 35.355 seconds
050916 135559 Processing pagesByURL: Merged 21909.489464007922
records/second
050916 135611 Processing pagesByMD5: Sorted 803182 instructions in
11.654 seconds.
050916 135611 Processing pagesByMD5: Sorted 68918.99776900635
instructions/second
050916 135627 Processing pagesByMD5: Merged to new DB containing 774610
records in 14.216 seconds
050916 135627 Processing pagesByMD5: Merged 54488.604389420376
records/second
050916 135633 Processing linksByMD5: Sorted 689997 instructions in 6.038
seconds.
050916 135633 Processing linksByMD5: Sorted 114275.75356078171
instructions/second
050916 135648 Processing linksByMD5: Merged to new DB containing 776849
records in 13.624 seconds
050916 135648 Processing linksByMD5: Merged 57020.62536699941 records/second
050916 135655 Processing linksByURL: Sorted 584963 instructions in 7.056
seconds.
050916 135655 Processing linksByURL: Sorted 82902.91950113379
instructions/second
050916 135711 Processing linksByURL: Merged to new DB containing 776849
records in 14.533 seconds
050916 135711 Processing linksByURL: Merged 53454.13885639579 records/second
050916 135718 Processing linksByMD5: Sorted 671867 instructions in 6.732
seconds.
050916 135718 Processing linksByMD5: Sorted 99801.99049316696
instructions/second
050916 135729 Processing linksByMD5: Merged to new DB containing 776849
records in 9.999 seconds
050916 135729 Processing linksByMD5: Merged 77692.66926692669 records/second
050916 135744 Update finished
--------------------------------------------------------------------

As you can see the updatedb gone fine though it encountered the stop of
the fetcher

3. bin/nutch mergesegs -dir index/segments/ -i -ds

from here on is the problem

--------------------------------------------------------------------
050916 141720 parsing file:/nutch/conf/nutch-default.xml
050916 141720 parsing file:/nutch/conf/nutch-site.xml
050916 141720 No FS indicated, using default:local
050916 141720 * Opening 2 segments:
050916 141720  - segment 20050916013342: 42287 records.
050916 141721  - data in segment index/segments/20050916014401 is
corrupt, using only 128115 entries.
050916 141722  - segment 20050916014401: 128116 records.
050916 141722 * TOTAL 170403 input records in 2 segments.
050916 141722 * Creating master index...
050916 141737  Processed 20000 records (1311.9916 rec/s)
050916 141751  Processed 40000 records (1394.0197 rec/s)
050916 154424  Processed 60000 records (3.851173 rec/s)
--------------------------------------------------------------------
as you can see in thelast line, the indexer process 3.8 records per
second which mean too long

Anybody got a clue or a hint please !!!

Regards,

Gal

Reply | Threaded
Open this post in threaded view
|

Re: indexing is very very very slow

Doug Cutting-2
The default for indexer.maxMergeDocs was mistakenly set to 50, which can
make indexing really slow.  Try putting the following in your
nutch-site.xml:

<property>
   <name>indexer.maxMergeDocs</name>
   <value>2147483647</value>
</property>

Does that help?

I just fixed this in trunk.  We should fix this in the 0.7 release branch.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: indexing is very very very slow

Gal Nitzan
Hi Doug,

Thank you for the prompt reply.

Well things got much much faster (i guess about 40% faster), but it
seems that something got really corrupted. Everything gets stuck after
40K records.

[root@kunzon nutch]# bin/nutch mergesegs -dir index/segments/ -i -ds
050917 043331 parsing file:/nutch/conf/nutch-default.xml
050917 043331 parsing file:/nutch/conf/nutch-site.xml
050917 043331 No FS indicated, using default:local
050917 043331 * Opening 2 segments:
050917 043332  - segment 20050916013342: 42287 records.
050917 043332  - data in segment index/segments/20050916014401 is
corrupt, using only 128115 entries.
050917 043332  - segment 20050916014401: 128116 records.
050917 043332 * TOTAL 170403 input records in 2 segments.
050917 043332 * Creating master index...
050917 043345  Processed 20000 records (1613.5538 rec/s)
050917 043354  Processed 40000 records (2113.9414 rec/s)

And that is it. I notice memory is still being consumed but no apparent
activity.

Since I'm really newbie to nutch, could you give me a tip on a way to
rescue the already fetched data and to remove the corruption from the
segment. I already tried the -fix but it didn't help.

Regards,

Gal


Doug Cutting wrote:

> The default for indexer.maxMergeDocs was mistakenly set to 50, which
> can make indexing really slow.  Try putting the following in your
> nutch-site.xml:
>
> <property>
>   <name>indexer.maxMergeDocs</name>
>   <value>2147483647</value>
> </property>
>
> Does that help?
>
> I just fixed this in trunk.  We should fix this in the 0.7 release
> branch.
>
> Doug
>
> .
>

Reply | Threaded
Open this post in threaded view
|

Re: indexing is very very very slow

em-13
'segslice' all partial segments into new ones prior merging.

Gal Nitzan wrote:

> Hi Doug,
>
> Thank you for the prompt reply.
>
> Well things got much much faster (i guess about 40% faster), but it
> seems that something got really corrupted. Everything gets stuck after
> 40K records.

 >050917 043332  - data in segment index/segments/20050916014401 is
corrupt, using only 128115 entries.
Reply | Threaded
Open this post in threaded view
|

Re: indexing is very very very slow

Gal Nitzan
In reply to this post by Doug Cutting-2
Doug,

Should the indexer.maxMergeDocs be set to the same value : 2147483647  ?

Thanks,

Gal

Doug Cutting wrote:

> The default for indexer.maxMergeDocs was mistakenly set to 50, which
> can make indexing really slow.  Try putting the following in your
> nutch-site.xml:
>
> <property>
>   <name>indexer.maxMergeDocs</name>
>   <value>2147483647</value>
> </property>
>
> Does that help?
>
> I just fixed this in trunk.  We should fix this in the 0.7 release
> branch.
>
> Doug
>
> .
>

Reply | Threaded
Open this post in threaded view
|

Re: indexing is very very very slow

Gal Nitzan
In reply to this post by em-13
Hi EM,

After sending that eMail I looked at the segslice and it worked
perfect.....!!!!

Thanks,

Gal

EM wrote:

> 'segslice' all partial segments into new ones prior merging.
>
> Gal Nitzan wrote:
>
>> Hi Doug,
>>
>> Thank you for the prompt reply.
>>
>> Well things got much much faster (i guess about 40% faster), but it
>> seems that something got really corrupted. Everything gets stuck
>> after 40K records.
>
> >050917 043332  - data in segment index/segments/20050916014401 is
> corrupt, using only 128115 entries.
>
> .
>

Reply | Threaded
Open this post in threaded view
|

Re: indexing is very very very slow

Gal Nitzan
In reply to this post by em-13
Hi,

Well I still get a very slow mergesegs:

[root@kunzon nutch]# tail -f nutch-mergesegs-kunzon.com.log
050919 171351  Processed 120000 records (1146.5918 rec/s)
050919 171408  Processed 140000 records (1158.2788 rec/s)
050919 171428  Processed 160000 records (1019.8358 rec/s)
050919 171451  Processed 180000 records (879.2368 rec/s)
050919 171510  Processed 200000 records (1054.9636 rec/s)
050919 171528  Processed 220000 records (1069.2328 rec/s)
050919 171547  Processed 240000 records (1099.868 rec/s)
050919 171832  - creating next subindex...
050919 174512  Processed 260000 records (11.328647 rec/s)
050919 200315  Processed 280000 records (2.4145627 rec/s)

It is falling to 2.4 res per second ...

Can somebody help me please. 400K records is only the beginning what
will happen when it is 4M?

Regards,

Gal

EM wrote:

> 'segslice' all partial segments into new ones prior merging.
>
> Gal Nitzan wrote:
>
>> Hi Doug,
>>
>> Thank you for the prompt reply.
>>
>> Well things got much much faster (i guess about 40% faster), but it
>> seems that something got really corrupted. Everything gets stuck
>> after 40K records.
>
> >050917 043332  - data in segment index/segments/20050916014401 is
> corrupt, using only 128115 entries.
>
> .
>

Reply | Threaded
Open this post in threaded view
|

Re: indexing is very very very slow

Andrzej Białecki-2
Gal Nitzan wrote:

> Hi,
>
> Well I still get a very slow mergesegs:
>
> [root@kunzon nutch]# tail -f nutch-mergesegs-kunzon.com.log
> 050919 171351  Processed 120000 records (1146.5918 rec/s)
> 050919 171408  Processed 140000 records (1158.2788 rec/s)
> 050919 171428  Processed 160000 records (1019.8358 rec/s)
> 050919 171451  Processed 180000 records (879.2368 rec/s)
> 050919 171510  Processed 200000 records (1054.9636 rec/s)
> 050919 171528  Processed 220000 records (1069.2328 rec/s)
> 050919 171547  Processed 240000 records (1099.868 rec/s)
> 050919 171832  - creating next subindex...
> 050919 174512  Processed 260000 records (11.328647 rec/s)
> 050919 200315  Processed 280000 records (2.4145627 rec/s)
>
> It is falling to 2.4 res per second ...
>
> Can somebody help me please. 400K records is only the beginning what
> will happen when it is 4M?

>> >050917 043332  - data in segment index/segments/20050916014401 is
>> corrupt, using only 128115 entries.

This is the real reason for the slowdown. Technically speaking, a
partially corrupted MapFile is readable and usable. However, random
access is orders of magnitude slower...

The fix is simple: delete the "index" files in each subdirectory of the
20050916014401 segment. Then run "nutch segread -fix 20050916014401".
Then re-run mergesegs - it will now work at full speed.

NB. if there are any more segments which give you this warning, do the
same before you run mergesegs.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Proposal: refuse to open partially trunc. MapFile, unless forced (Re: indexing is very very very slow)

Andrzej Białecki-2
In reply to this post by Gal Nitzan
Hi all,

> Well I still get a very slow mergesegs:

>>
>> >050917 043332  - data in segment index/segments/20050916014401 is
>> corrupt, using only 128115 entries.

This is a common and recurring problem. What's worse is that an unfixed
segment like this will destroy the performance of the search, too, not
just the backend pre-processing.

I propose to modify MapFile.Reader so that it refuses to open such file,
and throws an Exception, unless a force=true flag is given. Tools that
want to ignore this can do so, but all other tools will be able to make
a conscious decision whether to fix it first, or to use it as such.

If there are no objections, I will change it in the trunk/ in a couple
of days.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: indexing is very very very slow

Gal Nitzan
In reply to this post by Andrzej Białecki-2
Hi Andrzej,

Thank you for your reply.

I have tried twice but the segment is not being fixed:

[root@kunzon nutch]# find index/segments/20050919092227/ -name index -print
index/segments/20050919092227/fetcher/index
index/segments/20050919092227/parse_text/index
index/segments/20050919092227/content/index
index/segments/20050919092227/parse_data/index
[root@kunzon nutch]# rm -rf  index/segments/20050919092227/fetcher/index
[root@kunzon nutch]# rm -rf index/segments/20050919092227/parse_text/index
[root@kunzon nutch]# rm -rf index/segments/20050919092227/content/index
[root@kunzon nutch]# rm -rf index/segments/20050919092227/parse_data/index
[root@kunzon nutch]# bin/nutch segread index/segments/20050919092227 -fix
050920 031844 parsing file:/nutch/conf/nutch-default.xml
050920 031844 parsing file:/nutch/conf/nutch-site.xml
050920 031845 No FS indicated, using default:local
050920 031849  - fixed fetcher
050920 031932  - fixed content
050920 031952  - fixed parse_data
050920 032006  - fixed parse_text
050920 032006 Finished fixing 20050919092227
050920 032006  - data in segment index/segments/20050919092227 is
corrupt, using only 91212 entries.

Thanks,

Gal

Andrzej Bialecki wrote:

> Gal Nitzan wrote:
>> Hi,
>>
>> Well I still get a very slow mergesegs:
>>
>> [root@kunzon nutch]# tail -f nutch-mergesegs-kunzon.com.log
>> 050919 171351  Processed 120000 records (1146.5918 rec/s)
>> 050919 171408  Processed 140000 records (1158.2788 rec/s)
>> 050919 171428  Processed 160000 records (1019.8358 rec/s)
>> 050919 171451  Processed 180000 records (879.2368 rec/s)
>> 050919 171510  Processed 200000 records (1054.9636 rec/s)
>> 050919 171528  Processed 220000 records (1069.2328 rec/s)
>> 050919 171547  Processed 240000 records (1099.868 rec/s)
>> 050919 171832  - creating next subindex...
>> 050919 174512  Processed 260000 records (11.328647 rec/s)
>> 050919 200315  Processed 280000 records (2.4145627 rec/s)
>>
>> It is falling to 2.4 res per second ...
>>
>> Can somebody help me please. 400K records is only the beginning what
>> will happen when it is 4M?
>
>>> >050917 043332  - data in segment index/segments/20050916014401 is
>>> corrupt, using only 128115 entries.
>
> This is the real reason for the slowdown. Technically speaking, a
> partially corrupted MapFile is readable and usable. However, random
> access is orders of magnitude slower...
>
> The fix is simple: delete the "index" files in each subdirectory of
> the 20050916014401 segment. Then run "nutch segread -fix
> 20050916014401". Then re-run mergesegs - it will now work at full speed.
>
> NB. if there are any more segments which give you this warning, do the
> same before you run mergesegs.
>

Reply | Threaded
Open this post in threaded view
|

Re: indexing is very very very slow

em-13
In reply to this post by Andrzej Białecki-2
Andrzej Bialecki wrote:

>
> The fix is simple: delete the "index" files in each subdirectory of
> the 20050916014401 segment. Then run "nutch segread -fix
> 20050916014401". Then re-run mergesegs - it will now work at full speed.
>
> NB. if there are any more segments which give you this warning, do the
> same before you run mergesegs.
>
"segread -fix" doesn't work, "-segslice" solves this though.


Reply | Threaded
Open this post in threaded view
|

Re: indexing is very very very slow

Gal Nitzan
EM wrote:

> Andrzej Bialecki wrote:
>
>>
>> The fix is simple: delete the "index" files in each subdirectory of
>> the 20050916014401 segment. Then run "nutch segread -fix
>> 20050916014401". Then re-run mergesegs - it will now work at full speed.
>>
>> NB. if there are any more segments which give you this warning, do
>> the same before you run mergesegs.
>>
> "segread -fix" doesn't work, "-segslice" solves this though.
>
>
>
> .
>
Yes segslice solved it.

Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: refuse to open partially trunc. MapFile, unless forced (Re: indexing is very very very slow)

Gal Nitzan
In reply to this post by Andrzej Białecki-2
Andrzej Bialecki wrote:

> Hi all,
>
>> Well I still get a very slow mergesegs:
>
>>>
>>> >050917 043332  - data in segment index/segments/20050916014401 is
>>> corrupt, using only 128115 entries.
>
> This is a common and recurring problem. What's worse is that an
> unfixed segment like this will destroy the performance of the search,
> too, not just the backend pre-processing.
>
> I propose to modify MapFile.Reader so that it refuses to open such
> file, and throws an Exception, unless a force=true flag is given.
> Tools that want to ignore this can do so, but all other tools will be
> able to make a conscious decision whether to fix it first, or to use
> it as such.
>
> If there are no objections, I will change it in the trunk/ in a couple
> of days.
>
Hi,

I think it would be very confusing to old users as well as new users.
Throwing an exception when actually  a segment corruption is trivial and
can be fixed easily (now that I know how to do that :-)...

Instead I would like to suggest building a FAQ for Nutch.

I would like to propose myself  to build at least the skeleton for it.

As a new user to Nutch I have run to so many problems and except this
list there was not much information elsewhere. So, I have all the
answers fresh in my mind and with some help from the rest of the
nutch-users it can be done without too much of a hustle.

Besides, many people on this list contribute on their free time, I would
be happy to contribute to the success of this  project.

Regards,

Gal




Reply | Threaded
Open this post in threaded view
|

regarding gal's faq proposal

gekkokid
is there a place where we can search the mailing list? that could be a short
term solution

_gk
----- Original Message -----
From: "Gal Nitzan" <[hidden email]>
To: <[hidden email]>
Sent: Monday, September 19, 2005 11:37 PM
Subject: Re: Proposal: refuse to open partially trunc. MapFile, unless
forced (Re: indexing is very very very slow)


> Andrzej Bialecki wrote:
>> Hi all,
>>
>>> Well I still get a very slow mergesegs:
>>
>>>>
>>>> >050917 043332  - data in segment index/segments/20050916014401 is
>>>> corrupt, using only 128115 entries.
>>
>> This is a common and recurring problem. What's worse is that an unfixed
>> segment like this will destroy the performance of the search, too, not
>> just the backend pre-processing.
>>
>> I propose to modify MapFile.Reader so that it refuses to open such file,
>> and throws an Exception, unless a force=true flag is given. Tools that
>> want to ignore this can do so, but all other tools will be able to make a
>> conscious decision whether to fix it first, or to use it as such.
>>
>> If there are no objections, I will change it in the trunk/ in a couple of
>> days.
>>
> Hi,
>
> I think it would be very confusing to old users as well as new users.
> Throwing an exception when actually  a segment corruption is trivial and
> can be fixed easily (now that I know how to do that :-)...
>
> Instead I would like to suggest building a FAQ for Nutch.
>
> I would like to propose myself  to build at least the skeleton for it.
>
> As a new user to Nutch I have run to so many problems and except this list
> there was not much information elsewhere. So, I have all the answers fresh
> in my mind and with some help from the rest of the nutch-users it can be
> done without too much of a hustle.
>
> Besides, many people on this list contribute on their free time, I would
> be happy to contribute to the success of this  project.
>
> Regards,
>
> Gal
>
>
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Proposal: refuse to open partially trunc. MapFile, unless forced (Re: indexing is very very very slow)

Andrzej Białecki-2
In reply to this post by Gal Nitzan
Gal Nitzan wrote:

> Andrzej Bialecki wrote:
>
>> Hi all,
>>
>>> Well I still get a very slow mergesegs:
>>
>>
>>>>
>>>> >050917 043332  - data in segment index/segments/20050916014401 is
>>>> corrupt, using only 128115 entries.
>>
>>
>> This is a common and recurring problem. What's worse is that an
>> unfixed segment like this will destroy the performance of the search,
>> too, not just the backend pre-processing.
>>
>> I propose to modify MapFile.Reader so that it refuses to open such
>> file, and throws an Exception, unless a force=true flag is given.
>> Tools that want to ignore this can do so, but all other tools will be
>> able to make a conscious decision whether to fix it first, or to use
>> it as such.
>>
>> If there are no objections, I will change it in the trunk/ in a couple
>> of days.
>>
> Hi,
>
> I think it would be very confusing to old users as well as new users.
> Throwing an exception when actually  a segment corruption is trivial and
> can be fixed easily (now that I know how to do that :-)...

You missed my point - I proposed that we change the API. On the surface,
command-line tools would behave like now, with the benefit that segment
corruption would be fixed automatically by those tools that require
clean segments - unless _prevented_ by a cmd-line switch. So, this is
just to improve the default behaviour, and not to complain even louder
than now.

>
> Instead I would like to suggest building a FAQ for Nutch.
>
> I would like to propose myself  to build at least the skeleton for it.
>
> As a new user to Nutch I have run to so many problems and except this
> list there was not much information elsewhere. So, I have all the
> answers fresh in my mind and with some help from the rest of the
> nutch-users it can be done without too much of a hustle.
>
> Besides, many people on this list contribute on their free time, I would
> be happy to contribute to the success of this  project.

This is always welcome, and there is already a place where we collect
such info. Please see the Nutch Wiki, and feel free to enhance or add
new content there.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Proposal: refuse to open partially trunc. MapFile, unless forced (Re: indexing is very very very slow)

Gal Nitzan
Andrzej Bialecki wrote:

> Gal Nitzan wrote:
>> Andrzej Bialecki wrote:
>>
>>> Hi all,
>>>
>>>> Well I still get a very slow mergesegs:
>>>
>>>
>>>>>
>>>>> >050917 043332  - data in segment index/segments/20050916014401 is
>>>>> corrupt, using only 128115 entries.
>>>
>>>
>>> This is a common and recurring problem. What's worse is that an
>>> unfixed segment like this will destroy the performance of the
>>> search, too, not just the backend pre-processing.
>>>
>>> I propose to modify MapFile.Reader so that it refuses to open such
>>> file, and throws an Exception, unless a force=true flag is given.
>>> Tools that want to ignore this can do so, but all other tools will
>>> be able to make a conscious decision whether to fix it first, or to
>>> use it as such.
>>>
>>> If there are no objections, I will change it in the trunk/ in a
>>> couple of days.
>>>
>> Hi,
>>
>> I think it would be very confusing to old users as well as new users.
>> Throwing an exception when actually  a segment corruption is trivial
>> and can be fixed easily (now that I know how to do that :-)...
>
> You missed my point - I proposed that we change the API. On the
> surface, command-line tools would behave like now, with the benefit
> that segment corruption would be fixed automatically by those tools
> that require clean segments - unless _prevented_ by a cmd-line switch.
> So, this is just to improve the default behaviour, and not to complain
> even louder than now.
>
>>
>> Instead I would like to suggest building a FAQ for Nutch.
>>
>> I would like to propose myself  to build at least the skeleton for it.
>>
>> As a new user to Nutch I have run to so many problems and except this
>> list there was not much information elsewhere. So, I have all the
>> answers fresh in my mind and with some help from the rest of the
>> nutch-users it can be done without too much of a hustle.
>>
>> Besides, many people on this list contribute on their free time, I
>> would be happy to contribute to the success of this  project.
>
> This is always welcome, and there is already a place where we collect
> such info. Please see the Nutch Wiki, and feel free to enhance or add
> new content there.
>
You are right, I did miss your point. And now that I understand :-) I
think it is a very good Idea.

Yes I found the FAQ hiding in the wiki and I have started working on it.

Gal
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: refuse to open partially trunc. MapFile, unless forced (Re: indexing is very very very slow)

Matthias Jaekle
In reply to this post by Andrzej Białecki-2
> You missed my point - I proposed that we change the API. On the surface,
> command-line tools would behave like now, with the benefit that segment
> corruption would be fixed automatically by those tools that require
> clean segments - unless _prevented_ by a cmd-line switch. So, this is
> just to improve the default behaviour, and not to complain even louder
> than now.
That would be great!
Matthias
Reply | Threaded
Open this post in threaded view
|

JDK 1.5

Gal Nitzan
In reply to this post by Gal Nitzan
Hi,

I have tried running Nutch with JDK 1.5 and got very weird results,
like  fetcher is hanging and  merge is  hanging.

After that I switched to 1.4 and all went well.

Is it just a matter of  re-build?

Regards,

Gal
Reply | Threaded
Open this post in threaded view
|

Is it possible to change the list of common words without crawling everything again

Gal Nitzan
In reply to this post by Gal Nitzan
This question was in the FAQ unanswered.

Can someone answer that please. I shall put it in the FAQ

Regards,

Gal
12