ParseFilter and IndexingFilter

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

ParseFilter and IndexingFilter

Michael Chen
Hi,

Does anyone know how multiple ParseFilters and IndexingFilters work
together, e.g. does the first parse affect the second, does the one
index operation affect the next? Given that the factories generate
multiple in the first place... I couldn't find a definitive answer in
the docs and it would be great if someone can help answer this question.
Thanks in advance.

Best regards,

Michael


Reply | Threaded
Open this post in threaded view
|

RE: ParseFilter and IndexingFilter

Markus Jelsma-2
Hi,

ParseFilter can add metadata to parsed records. IndexingFilter can access that data and do something with it prior to indexing the metadata fields added earlier by the ParseFilter.

If you just want to index the values added by the ParseFilter, you can just use index-metadata to index it directly. Only use an IndexingFilter if you need additional logic.

Regards,
Markus

 
 
-----Original message-----

> From:Michael Chen <[hidden email]>
> Sent: Wednesday 2nd August 2017 20:58
> To: [hidden email]
> Subject: ParseFilter and IndexingFilter
>
> Hi,
>
> Does anyone know how multiple ParseFilters and IndexingFilters work
> together, e.g. does the first parse affect the second, does the one
> index operation affect the next? Given that the factories generate
> multiple in the first place... I couldn't find a definitive answer in
> the docs and it would be great if someone can help answer this question.
> Thanks in advance.
>
> Best regards,
>
> Michael
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: ParseFilter and IndexingFilter

Michael Chen
In reply to this post by Michael Chen

Hi Markus,

Thanks for the quick response! Please let me know at any point if I
should just read some part of the code. But I'm guessing from the stored
data in HBase (with Nutch 2.x), that "parse" changed (in my case,
cleaned up the html tags in "content") the "Document".

Do you mean that parse only adds meta-data somewhere waiting for
indexing filters to index it into HBase? Maybe I'm not understanding
"indexing" correctly.

I'm trying to use the new jsoup-extractor to parse (and index) certain
fields with CSS selectors. I also want to keep the indexing by
index-basic and index-anchor, and preferably the raw html/data as well.
Am I on the right track?

Thank you!

Michael


On 08/02/2017 12:06 PM, Markus Jelsma wrote:

> Hi,
>
> ParseFilter can add metadata to parsed records. IndexingFilter can access that data and do something with it prior to indexing the metadata fields added earlier by the ParseFilter.
>
> If you just want to index the values added by the ParseFilter, you can just use index-metadata to index it directly. Only use an IndexingFilter if you need additional logic.
>
> Regards,
> Markus
>
>  
>  
> -----Original message-----
>> From:Michael Chen <[hidden email]>
>> Sent: Wednesday 2nd August 2017 20:58
>> To: [hidden email]
>> Subject: ParseFilter and IndexingFilter
>>
>> Hi,
>>
>> Does anyone know how multiple ParseFilters and IndexingFilters work
>> together, e.g. does the first parse affect the second, does the one
>> index operation affect the next? Given that the factories generate
>> multiple in the first place... I couldn't find a definitive answer in
>> the docs and it would be great if someone can help answer this question.
>> Thanks in advance.
>>
>> Best regards,
>>
>> Michael
>>
>>
>>

Reply | Threaded
Open this post in threaded view
|

RE: ParseFilter and IndexingFilter

Markus Jelsma-2
In reply to this post by Michael Chen
You only need an IndexingFilter if you didn't do the logic in the ParseFilter, or, if you want to do something with metadata added by two or more different ParseFilters.

You can use multiple Indexing- or ParseFilters, not a problem.

 
-----Original message-----

> From:Michael Chen <[hidden email]>
> Sent: Wednesday 2nd August 2017 21:23
> To: [hidden email]
> Subject: Re: ParseFilter and IndexingFilter
>
>
> Hi Markus,
>
> Thanks for the quick response! Please let me know at any point if I
> should just read some part of the code. But I'm guessing from the stored
> data in HBase (with Nutch 2.x), that "parse" changed (in my case,
> cleaned up the html tags in "content") the "Document".
>
> Do you mean that parse only adds meta-data somewhere waiting for
> indexing filters to index it into HBase? Maybe I'm not understanding
> "indexing" correctly.
>
> I'm trying to use the new jsoup-extractor to parse (and index) certain
> fields with CSS selectors. I also want to keep the indexing by
> index-basic and index-anchor, and preferably the raw html/data as well.
> Am I on the right track?
>
> Thank you!
>
> Michael
>
>
> On 08/02/2017 12:06 PM, Markus Jelsma wrote:
> > Hi,
> >
> > ParseFilter can add metadata to parsed records. IndexingFilter can access that data and do something with it prior to indexing the metadata fields added earlier by the ParseFilter.
> >
> > If you just want to index the values added by the ParseFilter, you can just use index-metadata to index it directly. Only use an IndexingFilter if you need additional logic.
> >
> > Regards,
> > Markus
> >
> >  
> >  
> > -----Original message-----
> >> From:Michael Chen <[hidden email]>
> >> Sent: Wednesday 2nd August 2017 20:58
> >> To: [hidden email]
> >> Subject: ParseFilter and IndexingFilter
> >>
> >> Hi,
> >>
> >> Does anyone know how multiple ParseFilters and IndexingFilters work
> >> together, e.g. does the first parse affect the second, does the one
> >> index operation affect the next? Given that the factories generate
> >> multiple in the first place... I couldn't find a definitive answer in
> >> the docs and it would be great if someone can help answer this question.
> >> Thanks in advance.
> >>
> >> Best regards,
> >>
> >> Michael
> >>
> >>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: ParseFilter and IndexingFilter

Michael Chen
Hi Markus,

Thanks for the explanation. Just realized that the fetched content is
not altered by parsers, only new metadata fields are created from the
parses. But can a plugin parse existing metadata parsed by another parser?

Also, I tested jsoup-extractor and it doesn't handle HTML well, only
XML. Do you think there's a relatively easy way to adapt it for all HTML?

Thanks!

Michael


On 08/02/2017 12:28 PM, Markus Jelsma wrote:

> You only need an IndexingFilter if you didn't do the logic in the ParseFilter, or, if you want to do something with metadata added by two or more different ParseFilters.
>
> You can use multiple Indexing- or ParseFilters, not a problem.
>
>  
> -----Original message-----
>> From:Michael Chen <[hidden email]>
>> Sent: Wednesday 2nd August 2017 21:23
>> To: [hidden email]
>> Subject: Re: ParseFilter and IndexingFilter
>>
>>
>> Hi Markus,
>>
>> Thanks for the quick response! Please let me know at any point if I
>> should just read some part of the code. But I'm guessing from the stored
>> data in HBase (with Nutch 2.x), that "parse" changed (in my case,
>> cleaned up the html tags in "content") the "Document".
>>
>> Do you mean that parse only adds meta-data somewhere waiting for
>> indexing filters to index it into HBase? Maybe I'm not understanding
>> "indexing" correctly.
>>
>> I'm trying to use the new jsoup-extractor to parse (and index) certain
>> fields with CSS selectors. I also want to keep the indexing by
>> index-basic and index-anchor, and preferably the raw html/data as well.
>> Am I on the right track?
>>
>> Thank you!
>>
>> Michael
>>
>>
>> On 08/02/2017 12:06 PM, Markus Jelsma wrote:
>>> Hi,
>>>
>>> ParseFilter can add metadata to parsed records. IndexingFilter can access that data and do something with it prior to indexing the metadata fields added earlier by the ParseFilter.
>>>
>>> If you just want to index the values added by the ParseFilter, you can just use index-metadata to index it directly. Only use an IndexingFilter if you need additional logic.
>>>
>>> Regards,
>>> Markus
>>>
>>>    
>>>    
>>> -----Original message-----
>>>> From:Michael Chen <[hidden email]>
>>>> Sent: Wednesday 2nd August 2017 20:58
>>>> To: [hidden email]
>>>> Subject: ParseFilter and IndexingFilter
>>>>
>>>> Hi,
>>>>
>>>> Does anyone know how multiple ParseFilters and IndexingFilters work
>>>> together, e.g. does the first parse affect the second, does the one
>>>> index operation affect the next? Given that the factories generate
>>>> multiple in the first place... I couldn't find a definitive answer in
>>>> the docs and it would be great if someone can help answer this question.
>>>> Thanks in advance.
>>>>
>>>> Best regards,
>>>>
>>>> Michael
>>>>
>>>>
>>>>
>>