Parse-html should be enhanced!

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Parse-html should be enhanced!

Jack.Tang
Hi Nutchers

I think parse-html parse should be enhanced. In some of  my
projects(Intranet search engine), we only need the content in the
specified detectors and filter the junk, say the content between <div
class="start-here"> and </div> or some detectors like XPath. Any
thoughts on this enhancement?

Regards
/Jack
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: Parse-html should be enhanced!

Michael Ji
Will an extension from existing point be a solution?

Our on-going project also needs to deal specific
crawling cases in some sites. We think about extending
the current java class to fit our usage.

Michael Ji,

--- Jack Tang <[hidden email]> wrote:

> Hi Nutchers
>
> I think parse-html parse should be enhanced. In some
> of  my
> projects(Intranet search engine), we only need the
> content in the
> specified detectors and filter the junk, say the
> content between <div
> class="start-here"> and </div> or some detectors
> like XPath. Any
> thoughts on this enhancement?
>
> Regards
> /Jack
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

RE: Parse-html should be enhanced!

Fuad Efendi
In reply to this post by Jack.Tang
Existing PARSE-HTML plugin simply stores clean text (without HTML tags)
for future indexing. It stores, for instance, content of huge <OPTIONS>
tag which we don't need at all in 99.99% of cases.

I found this idea very interesting, Web-SQL:
http://www.lotontech.com
I've bought a book, Tony Loton "Web Content Mining with Java", it
consists 90% from code which I don't really need...
However, I am going to implement some kind of Web-SQL and Math.
Statistics. Usually web-sites have 90% of similar HTML, and I need only
subset.

Also, I need to find a point in Nutch where I can replace Analyzer with
my own "non-analyzer"; I don't need to remove stop-words etc.

I'd like to use Lucene as a database too... To perform a lot of queries,
to calc some statistics...

-Fuad


-----Original Message-----
From: Jack Tang [mailto:[hidden email]]
Sent: Thursday, August 18, 2005 10:15 PM
To: [hidden email]
Subject: Parse-html should be enhanced!


Hi Nutchers

I think parse-html parse should be enhanced. In some of  my
projects(Intranet search engine), we only need the content in the
specified detectors and filter the junk, say the content between <div
class="start-here"> and </div> or some detectors like XPath. Any
thoughts on this enhancement?

Regards
/Jack
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars


Reply | Threaded
Open this post in threaded view
|

Re: Parse-html should be enhanced!

Jack.Tang
Waw, Efendi, the features you metioned sounds coool.
Anyway, I hope nutch will both handle DOM tree parsing and information
extraction(Very high level) well one day. My suggestion is adding one
layer between DOM tree parsing and indexing for information
extraction.

Comments?

/Jack

On 8/19/05, Fuad Efendi <[hidden email]> wrote:

> Existing PARSE-HTML plugin simply stores clean text (without HTML tags)
> for future indexing. It stores, for instance, content of huge <OPTIONS>
> tag which we don't need at all in 99.99% of cases.
>
> I found this idea very interesting, Web-SQL:
> http://www.lotontech.com
> I've bought a book, Tony Loton "Web Content Mining with Java", it
> consists 90% from code which I don't really need...
> However, I am going to implement some kind of Web-SQL and Math.
> Statistics. Usually web-sites have 90% of similar HTML, and I need only
> subset.
>
> Also, I need to find a point in Nutch where I can replace Analyzer with
> my own "non-analyzer"; I don't need to remove stop-words etc.
>
> I'd like to use Lucene as a database too... To perform a lot of queries,
> to calc some statistics...
>
> -Fuad
>
>
> -----Original Message-----
> From: Jack Tang [mailto:[hidden email]]
> Sent: Thursday, August 18, 2005 10:15 PM
> To: [hidden email]
> Subject: Parse-html should be enhanced!
>
>
> Hi Nutchers
>
> I think parse-html parse should be enhanced. In some of  my
> projects(Intranet search engine), we only need the content in the
> specified detectors and filter the junk, say the content between <div
> class="start-here"> and </div> or some detectors like XPath. Any
> thoughts on this enhancement?
>
> Regards
> /Jack
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

RE: Parse-html should be enhanced!

Fuad Efendi
In reply to this post by Jack.Tang
WebDataKit,
http://www.lotontech.com/wdbc.html
- free for download. Some kind of SQL for HTML (even from different
web-sites, concatenation etc.), interesting... I need to search specific
places within HTML...


Some sites have very good design, they have explicit meta-tags... If you
are working on Intranet, it's easiest solution:
<title>TOSHIBA TECRA S2 Pentium M 15.0&quot; nVIDIA GeForce Go 6600
NoteBook - Retail at Newegg.com</title>
<meta name="description" content="Buy TOSHIBA TECRA S2 Pentium M
15.0&quot; nVIDIA GeForce Go 6600 NoteBook - Retail Online" />
<meta name="keywords" content="Buy TOSHIBA TECRA S2 Pentium M 15.0&quot;
nVIDIA GeForce Go 6600 NoteBook - Retail Cheap" />



-----Original Message-----
From: Jack Tang [mailto:[hidden email]]
Sent: Thursday, August 18, 2005 10:15 PM
To: [hidden email]
Subject: Parse-html should be enhanced!


Hi Nutchers

I think parse-html parse should be enhanced. In some of  my
projects(Intranet search engine), we only need the content in the
specified detectors and filter the junk, say the content between <div
class="start-here"> and </div> or some detectors like XPath. Any
thoughts on this enhancement?

Regards
/Jack
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars


Reply | Threaded
Open this post in threaded view
|

RE: Parse-html should be enhanced!

Fuad Efendi
- sorry guys, on Intranet easiest way is to use Lucene without Nutch,
and to index Databases & FileServers...
:)

>>>>If you are working on Intranet, it's easiest solution:
<meta name="keywords" content="Buy TOSHIBA TECRA S2 Pentium M 15.0&quot;
nVIDIA GeForce Go 6600 NoteBook - Retail Cheap" />


Reply | Threaded
Open this post in threaded view
|

RE: Parse-html should be enhanced!

Fuad Efendi
In reply to this post by Jack.Tang
Hi Jack,

I'd like to have more freedom with Nutch... We have two classes,
ParseText and ParseData, which are stored somewhere (I am newbie!) and
then indexed by Lucene. ParseText contains plain text (after parsing by
existing parse-html plugin), and ParseData - links found on a page,
metatags (not sure), etc.

org.apache.nutch.fetcher.Fetch - this class downloads smth using HTTP,
then calls plugin-parser accordingly to "Content" of HTTP header
(text/html in our case)

I'd like to have more freedom, to add more fields to database before
indexing. Probably I can use ParseData.

I'd like to have two-step indexing process, first to index HTML tags and
find similarities (like as usual header, footer, Options, Menu, (c),
etc.), then to use second parsing and second indexing - to index only
unique text.

-Fuad


-----Original Message-----
From: Jack Tang [mailto:[hidden email]]
Sent: Thursday, August 18, 2005 11:30 PM
To: [hidden email]
Subject: Re: Parse-html should be enhanced!


Waw, Efendi, the features you metioned sounds coool.
Anyway, I hope nutch will both handle DOM tree parsing and information
extraction(Very high level) well one day. My suggestion is adding one
layer between DOM tree parsing and indexing for information extraction.

Comments?

/Jack

On 8/19/05, Fuad Efendi <[hidden email]> wrote:
> Existing PARSE-HTML plugin simply stores clean text (without HTML
> tags) for future indexing. It stores, for instance, content of huge
> <OPTIONS> tag which we don't need at all in 99.99% of cases.
>
> I found this idea very interesting, Web-SQL: http://www.lotontech.com
> I've bought a book, Tony Loton "Web Content Mining with Java", it
> consists 90% from code which I don't really need...
> However, I am going to implement some kind of Web-SQL and Math.
> Statistics. Usually web-sites have 90% of similar HTML, and I need
only

> subset.
>
> Also, I need to find a point in Nutch where I can replace Analyzer
> with my own "non-analyzer"; I don't need to remove stop-words etc.
>
> I'd like to use Lucene as a database too... To perform a lot of
> queries, to calc some statistics...
>
> -Fuad
>
>
> -----Original Message-----
> From: Jack Tang [mailto:[hidden email]]
> Sent: Thursday, August 18, 2005 10:15 PM
> To: [hidden email]
> Subject: Parse-html should be enhanced!
>
>
> Hi Nutchers
>
> I think parse-html parse should be enhanced. In some of  my
> projects(Intranet search engine), we only need the content in the
> specified detectors and filter the junk, say the content between <div
> class="start-here"> and </div> or some detectors like XPath. Any
> thoughts on this enhancement?
>
> Regards
> /Jack
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars


Reply | Threaded
Open this post in threaded view
|

Re: Parse-html should be enhanced!

Jack.Tang
Hi Fuad

On 8/19/05, Fuad Efendi <[hidden email]> wrote:

> Hi Jack,
>
> I'd like to have more freedom with Nutch... We have two classes,
> ParseText and ParseData, which are stored somewhere (I am newbie!) and
> then indexed by Lucene. ParseText contains plain text (after parsing by
> existing parse-html plugin), and ParseData - links found on a page,
> metatags (not sure), etc.
>
> org.apache.nutch.fetcher.Fetch - this class downloads smth using HTTP,
> then calls plugin-parser accordingly to "Content" of HTTP header
> (text/html in our case)
>
> I'd like to have more freedom, to add more fields to database before
> indexing. Probably I can use ParseData.
I totally agree with you.
I'd like store the extracted information into the new map, say
ExtractedInfo class

> I'd like to have two-step indexing process, first to index HTML tags and
> find similarities (like as usual header, footer, Options, Menu, (c),
> etc.), then to use second parsing and second indexing - to index only
> unique text.
>
> -Fuad
>
>
> -----Original Message-----
> From: Jack Tang [mailto:[hidden email]]
> Sent: Thursday, August 18, 2005 11:30 PM
> To: [hidden email]
> Subject: Re: Parse-html should be enhanced!
>
>
> Waw, Efendi, the features you metioned sounds coool.
> Anyway, I hope nutch will both handle DOM tree parsing and information
> extraction(Very high level) well one day. My suggestion is adding one
> layer between DOM tree parsing and indexing for information extraction.
>
> Comments?
>
> /Jack
>
> On 8/19/05, Fuad Efendi <[hidden email]> wrote:
> > Existing PARSE-HTML plugin simply stores clean text (without HTML
> > tags) for future indexing. It stores, for instance, content of huge
> > <OPTIONS> tag which we don't need at all in 99.99% of cases.
> >
> > I found this idea very interesting, Web-SQL: http://www.lotontech.com
> > I've bought a book, Tony Loton "Web Content Mining with Java", it
> > consists 90% from code which I don't really need...
> > However, I am going to implement some kind of Web-SQL and Math.
> > Statistics. Usually web-sites have 90% of similar HTML, and I need
> only
> > subset.
> >
> > Also, I need to find a point in Nutch where I can replace Analyzer
> > with my own "non-analyzer"; I don't need to remove stop-words etc.
> >
> > I'd like to use Lucene as a database too... To perform a lot of
> > queries, to calc some statistics...
> >
> > -Fuad
> >
> >
> > -----Original Message-----
> > From: Jack Tang [mailto:[hidden email]]
> > Sent: Thursday, August 18, 2005 10:15 PM
> > To: [hidden email]
> > Subject: Parse-html should be enhanced!
> >
> >
> > Hi Nutchers
> >
> > I think parse-html parse should be enhanced. In some of  my
> > projects(Intranet search engine), we only need the content in the
> > specified detectors and filter the junk, say the content between <div
> > class="start-here"> and </div> or some detectors like XPath. Any
> > thoughts on this enhancement?
> >
> > Regards
> > /Jack
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
> >
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Nutch 6.1 running issu

Michael Ji
hi Andrzej:

Currently, I implemented Nutch 61 in Nutch version 07.

http://issues.apache.org/jira/browse/NUTCH-61

Compiling is successfully.

I create an empty DB, insert urls, so far so good;

But, when I run bin/nutch/generate (segments) command,
I got the following error: Note the line number might
be shift a bit coz I change the code according to
Nutch61 diff file.

"
050910 111335 Overall processing: Sorted 0 entries in
0.0 seconds.
050910 111335 Overall processing: Sorted NaN
entries/second
Exception in thread "main"
java.lang.NullPointerException
        at
org.apache.nutch.tools.FetchListTool.emitFetchList(FetchListTool.java:488)
        at
org.apache.nutch.tools.FetchListTool.emitFetchList(FetchListTool.java:319)
        at
org.apache.nutch.tools.FetchListTool.main(FetchListTool.java:612)
"

And I took look at FetchListTool.java, and focus on
the following piece of new code of Nutch61:

"
FetchListEntry value = new FetchListEntry();
Page page = (Page)value.getPage().clone();
"

Seems value is an empty FetchListEntry instance. Will
that cause clone getPage failure coz it is NULL?

thanks,

Michael Ji




       
               
______________________________________________________
Click here to donate to the Hurricane Katrina relief effort.
http://store.yahoo.com/redcross-donate3/
Reply | Threaded
Open this post in threaded view
|

Re: Nutch 6.1 running issu

Andrzej Białecki-2
Michael Ji wrote:
> "
> FetchListEntry value = new FetchListEntry();
> Page page = (Page)value.getPage().clone();
> "
>
> Seems value is an empty FetchListEntry instance. Will
> that cause clone getPage failure coz it is NULL?

Please try to replace this logic with the following:

                 FetchListEntry value = new FetchListEntry();
                 while (topN > 0 && reader.next(key, value)) {
                   Page page = value.getPage();
                   if (page != null) {
                     Page p = new Page();
                     p.set(page);
                     page = p;
                   }
                     if (forceRefetch) {
                       Page p = value.getPage();
                       // reset fetchTime and MD5, so that the content will
                       // always be new and unique.
                       p.setNextFetchTime(0L);
                       p.setMD5(MD5Hash.digest(p.getURL().toString()));
                     }
                     tables.append(value);
                     topN--;


This patchset still needs a lot of thought and work. Even the part that
avoids re-fetching unmodified content needs additional thinking - it's
easy to end up in a state, where Nutch cannot be forced to re-fetch the
page because every time you try it remains unmodified - but you need
refetching the actual data because e.g. you lost that segment data...

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Nutch 6.1 running issu

Michael Ji
hi Andrzej:

I tried your new code;

1) Since "Page page = value.getPage();" is defined
within while loop, page instance can't be accessed
afterwards---causing failure for the next couple of
lines "page.setNextFetchTime.."

So, I define "Page page = value.getPage();" before
while loop

Will that change be OK with you?

2) "forceRefetch" can't be seen in FetchListTool
package, I just replace it with "true" to let compiler
go through,

any suggestions?

thanks,

Michael,

--- Andrzej Bialecki <[hidden email]> wrote:


 

> Please try to replace this logic with the following:
>
> FetchListEntry value = new
> FetchListEntry();
> while (topN > 0 && reader.next(key,value)) {
>                    Page page = value.getPage();
>                    if (page != null) {
>                      Page p = new Page();
>                      p.set(page);
>                      page = p;
>                    }
>                      if (forceRefetch) {
>                        Page p = value.getPage();
>                        // reset fetchTime and MD5,
> so that the content will
>                        // always be new and unique.
>                        p.setNextFetchTime(0L);
>                      
> p.setMD5(MD5Hash.digest(p.getURL().toString()));
>                      }
>                      tables.append(value);
>                      topN--;
>
>
> This patchset still needs a lot of thought and work.
> Even the part that
> avoids re-fetching unmodified content needs
> additional thinking - it's
> easy to end up in a state, where Nutch cannot be
> forced to re-fetch the
> page because every time you try it remains
> unmodified - but you need
> refetching the actual data because e.g. you lost
> that segment data...
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
>
>


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

Re: Nutch 6.1 running issu

Michael Ji
In reply to this post by Andrzej Białecki-2
hi Andrzej:

Thanks for your correction. The patch is compiled
successfully and running well in Nutch 07.

Just a curious question:

As stated in nutch 61:
"...if content is unmodified it doesn't have to be
fetched and processed..."

And I did test for refetching a page without content
modification and Nutch 6.1 DID parsing this page to
content/, parse_data/, and parse_text/

I took look at code:

In Fetcher.java,
"
ProtocolOutput output =
protocol.getProtocolOutput(fle);
ProtocolStatus pstat = output.getStatus();
:
switch ( pstat ) {
:
:
    case ProtocolStatus.NOTMODIFIED:                
         handleFetch(fle, output);
    break;
:
:
}
"

Should we just do nothing in case of NOTMODIFIED,
which is the flag set when content.MD5 = page.MD5 in
protocol.http.java?

The handleFetch() actually parsing and output data
structure to segments/.

Thanks,

Michael Ji,





--- Andrzej Bialecki <[hidden email]> wrote:

> Michael Ji wrote:
> > "
> > FetchListEntry value = new FetchListEntry();
> > Page page = (Page)value.getPage().clone();
> > "
> >
> > Seems value is an empty FetchListEntry instance.
> Will
> > that cause clone getPage failure coz it is NULL?
>
> Please try to replace this logic with the following:
>
>                  FetchListEntry value = new
> FetchListEntry();
>                  while (topN > 0 && reader.next(key,
> value)) {
>                    Page page = value.getPage();
>                    if (page != null) {
>                      Page p = new Page();
>                      p.set(page);
>                      page = p;
>                    }
>                      if (forceRefetch) {
>                        Page p = value.getPage();
>                        // reset fetchTime and MD5,
> so that the content will
>                        // always be new and unique.
>                        p.setNextFetchTime(0L);
>                      
> p.setMD5(MD5Hash.digest(p.getURL().toString()));
>                      }
>                      tables.append(value);
>                      topN--;
>
>
> This patchset still needs a lot of thought and work.
> Even the part that
> avoids re-fetching unmodified content needs
> additional thinking - it's
> easy to end up in a state, where Nutch cannot be
> forced to re-fetch the
> page because every time you try it remains
> unmodified - but you need
> refetching the actual data because e.g. you lost
> that segment data...
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
>
>


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

Re: Nutch 6.1 running issu

Andrzej Białecki-2
Michael Ji wrote:

> hi Andrzej:
>
> Thanks for your correction. The patch is compiled
> successfully and running well in Nutch 07.
>
> Just a curious question:
>
> As stated in nutch 61:
> "...if content is unmodified it doesn't have to be
> fetched and processed..."
>
> And I did test for refetching a page without content
> modification and Nutch 6.1 DID parsing this page to
> content/, parse_data/, and parse_text/
>

Are you sure the plugin retrieved the page content once again from the
server? Because I use "If-Modified-Since", which means that if the
content is unmodified the server should NOT send the page once again,
just a status 304.

> I took look at code:
>
> In Fetcher.java,
> "
> ProtocolOutput output =
> protocol.getProtocolOutput(fle);
> ProtocolStatus pstat = output.getStatus();
> :
> switch ( pstat ) {
> :
> :
>     case ProtocolStatus.NOTMODIFIED:                
>          handleFetch(fle, output);
>     break;
> :
> :
> }
> "
>
> Should we just do nothing in case of NOTMODIFIED,
> which is the flag set when content.MD5 = page.MD5 in
> protocol.http.java?
>

We can't do nothing - we need to report the status. Even when we report
an error, an additional record is written to segments...

> The handleFetch() actually parsing and output data
> structure to segments/.

Yes, that's correct - this was a conscious decision. The reason is that
the server may return other interesting information in headers, which
some of the parsing plugins or FetchSchedule implementations may need.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Nutch 6.1 running issu

Michael Ji
hi Andrzej:

thanks your reply.

Yes, I saw the unmodified page content stored in the
parse_data/ and parse_text/ within new fetched
segments/. I even print out the new fetched content
MD5 signature in
http.java(56eae3c2556cb10a00e7346738dcb318) which is
matched with the one associated with same URL in
fetchlist.

Several concerns I have:

1) Where is flag "If-Modified-Since" set? I didn't see
it in any core code of fetcher and db..

2) I saw the logic goes to "code == 200", in http.java
so that I can see the content MD5. Does that mean
protocol actually sent back content? So, it doesn't
notice If-Modified-Since flag?

If it skip the response, as you said, should return
304.

3) While I patched nutch 61 in nutch 07, I didn't see
the exactly old code matched with your nutch 61 patch.

For example, in nutch 07 of
"
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java
"

I didn't see
"
get.setHttp11(false);
get.setMethodRetryHandler(null);
"

But these two lines are in your nutch 61 diff.

Will that cause the problem.


thanks,

Michael Ji,

--- Andrzej Bialecki <[hidden email]> wrote:

> Michael Ji wrote:
> > hi Andrzej:
> >
> > Thanks for your correction. The patch is compiled
> > successfully and running well in Nutch 07.
> >
> > Just a curious question:
> >
> > As stated in nutch 61:
> > "...if content is unmodified it doesn't have to be
> > fetched and processed..."
> >
> > And I did test for refetching a page without
> content
> > modification and Nutch 6.1 DID parsing this page
> to
> > content/, parse_data/, and parse_text/
> >
>
> Are you sure the plugin retrieved the page content
> once again from the
> server? Because I use "If-Modified-Since", which
> means that if the
> content is unmodified the server should NOT send the
> page once again,
> just a status 304.
>
> > I took look at code:
> >
> > In Fetcher.java,
> > "
> > ProtocolOutput output =
> > protocol.getProtocolOutput(fle);
> > ProtocolStatus pstat = output.getStatus();
> > :
> > switch ( pstat ) {
> > :
> > :
> >     case ProtocolStatus.NOTMODIFIED:              
>  
> >          handleFetch(fle, output);
> >     break;
> > :
> > :
> > }
> > "
> >
> > Should we just do nothing in case of NOTMODIFIED,
> > which is the flag set when content.MD5 = page.MD5
> in
> > protocol.http.java?
> >
>
> We can't do nothing - we need to report the status.
> Even when we report
> an error, an additional record is written to
> segments...
>
> > The handleFetch() actually parsing and output data
> > structure to segments/.
>
> Yes, that's correct - this was a conscious decision.
> The reason is that
> the server may return other interesting information
> in headers, which
> some of the parsing plugins or FetchSchedule
> implementations may need.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
>
>



               
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com