Exchange documents in indexing job

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Exchange documents in indexing job

Roannel Fernández Hernández
Hi folks:

There is some way in Nutch to send some documents to a particular index writer according to particular values of fields?

I explain myself better. I have a document with a field called "mimetype" and I want to send to Solr only the documents with value "text/plain" for this field and send to RabbitMQ the documents with value "text/html". How can I do that?

Regards

La @universidad_uci es Fidel. Los jóvenes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre
Reply | Threaded
Open this post in threaded view
|

RE: Exchange documents in indexing job

Yossi Tamari
I don't see a good way to do it in configuration, but it should be very easy to override the write method in the two plugins to have it check the mime type and decide whether to call super.write or not.
(One terrible way to do it with configuration only would be to configure only one of the indexers and use mimetype-filter to filter the matching type, and then reconfigure for the other indexer and change mimetype-filter.txt to the other mime type and index again...)

-----Original Message-----
From: Roannel Fernández Hernández [mailto:[hidden email]]
Sent: 23 August 2017 18:05
To: [hidden email]
Subject: Exchange documents in indexing job

Hi folks:

There is some way in Nutch to send some documents to a particular index writer according to particular values of fields?

I explain myself better. I have a document with a field called "mimetype" and I want to send to Solr only the documents with value "text/plain" for this field and send to RabbitMQ the documents with value "text/html". How can I do that?

Regards

La @universidad_uci es Fidel. Los jóvenes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre

Reply | Threaded
Open this post in threaded view
|

RE: Exchange documents in indexing job

Markus Jelsma-2
In reply to this post by Roannel Fernández Hernández
I think MIME-type filter is a fine method this, the only drawback is that you need to run the indexer twice.

Althouh a better solution would be to support JEXL expressions in IndexWriters and IndexerMapReduce to allow global filtering and per-IndexWriter filtering. This would not be very hard to patch in.
 
-----Original message-----

> From:Yossi Tamari <[hidden email]>
> Sent: Wednesday 23rd August 2017 19:40
> To: [hidden email]
> Subject: RE: Exchange documents in indexing job
>
> I don't see a good way to do it in configuration, but it should be very easy to override the write method in the two plugins to have it check the mime type and decide whether to call super.write or not.
> (One terrible way to do it with configuration only would be to configure only one of the indexers and use mimetype-filter to filter the matching type, and then reconfigure for the other indexer and change mimetype-filter.txt to the other mime type and index again...)
>
> -----Original Message-----
> From: Roannel Fernández Hernández [mailto:[hidden email]]
> Sent: 23 August 2017 18:05
> To: [hidden email]
> Subject: Exchange documents in indexing job
>
> Hi folks:
>
> There is some way in Nutch to send some documents to a particular index writer according to particular values of fields?
>
> I explain myself better. I have a document with a field called "mimetype" and I want to send to Solr only the documents with value "text/plain" for this field and send to RabbitMQ the documents with value "text/html". How can I do that?
>
> Regards
>
> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> #HastaSiempreComandante
> #HastalaVictoriaSiempre
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [MASSMAIL]RE: Exchange documents in indexing job

Roannel Fernández Hernández
Hi.

Thanks for your tips. I like the idea of JEXL expressions. I'm going to create the ticket and I'll putting to work.

Thanks a lot.

----- Original Message -----

> From: "Markus Jelsma" <[hidden email]>
> To: [hidden email]
> Sent: Wednesday, August 23, 2017 2:05:21 PM
> Subject: [MASSMAIL]RE: Exchange documents in indexing job
>
> I think MIME-type filter is a fine method this, the only drawback is that you
> need to run the indexer twice.
>
> Althouh a better solution would be to support JEXL expressions in
> IndexWriters and IndexerMapReduce to allow global filtering and
> per-IndexWriter filtering. This would not be very hard to patch in.
>  
> -----Original message-----
> > From:Yossi Tamari <[hidden email]>
> > Sent: Wednesday 23rd August 2017 19:40
> > To: [hidden email]
> > Subject: RE: Exchange documents in indexing job
> >
> > I don't see a good way to do it in configuration, but it should be very
> > easy to override the write method in the two plugins to have it check the
> > mime type and decide whether to call super.write or not.
> > (One terrible way to do it with configuration only would be to configure
> > only one of the indexers and use mimetype-filter to filter the matching
> > type, and then reconfigure for the other indexer and change
> > mimetype-filter.txt to the other mime type and index again...)
> >
> > -----Original Message-----
> > From: Roannel Fernández Hernández [mailto:[hidden email]]
> > Sent: 23 August 2017 18:05
> > To: [hidden email]
> > Subject: Exchange documents in indexing job
> >
> > Hi folks:
> >
> > There is some way in Nutch to send some documents to a particular index
> > writer according to particular values of fields?
> >
> > I explain myself better. I have a document with a field called "mimetype"
> > and I want to send to Solr only the documents with value "text/plain" for
> > this field and send to RabbitMQ the documents with value "text/html". How
> > can I do that?
> >
> > Regards
> >
> > La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> > #HastaSiempreComandante
> > #HastalaVictoriaSiempre
> >
> >
>
La @universidad_uci es Fidel. Los jóvenes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre

Reply | Threaded
Open this post in threaded view
|

RE: [MASSMAIL]RE: Exchange documents in indexing job

Markus Jelsma-2
For examples you can look at CrawlDbReader/CrawlDatum and Generator,

Regards,
Markus

 
 
-----Original message-----

> From:Roannel Fernández Hernández <[hidden email]>
> Sent: Wednesday 23rd August 2017 21:31
> To: [hidden email]
> Subject: Re: [MASSMAIL]RE: Exchange documents in indexing job
>
> Hi.
>
> Thanks for your tips. I like the idea of JEXL expressions. I'm going to create the ticket and I'll putting to work.
>
> Thanks a lot.
>
> ----- Original Message -----
> > From: "Markus Jelsma" <[hidden email]>
> > To: [hidden email]
> > Sent: Wednesday, August 23, 2017 2:05:21 PM
> > Subject: [MASSMAIL]RE: Exchange documents in indexing job
> >
> > I think MIME-type filter is a fine method this, the only drawback is that you
> > need to run the indexer twice.
> >
> > Althouh a better solution would be to support JEXL expressions in
> > IndexWriters and IndexerMapReduce to allow global filtering and
> > per-IndexWriter filtering. This would not be very hard to patch in.
> >  
> > -----Original message-----
> > > From:Yossi Tamari <[hidden email]>
> > > Sent: Wednesday 23rd August 2017 19:40
> > > To: [hidden email]
> > > Subject: RE: Exchange documents in indexing job
> > >
> > > I don't see a good way to do it in configuration, but it should be very
> > > easy to override the write method in the two plugins to have it check the
> > > mime type and decide whether to call super.write or not.
> > > (One terrible way to do it with configuration only would be to configure
> > > only one of the indexers and use mimetype-filter to filter the matching
> > > type, and then reconfigure for the other indexer and change
> > > mimetype-filter.txt to the other mime type and index again...)
> > >
> > > -----Original Message-----
> > > From: Roannel Fernández Hernández [mailto:[hidden email]]
> > > Sent: 23 August 2017 18:05
> > > To: [hidden email]
> > > Subject: Exchange documents in indexing job
> > >
> > > Hi folks:
> > >
> > > There is some way in Nutch to send some documents to a particular index
> > > writer according to particular values of fields?
> > >
> > > I explain myself better. I have a document with a field called "mimetype"
> > > and I want to send to Solr only the documents with value "text/plain" for
> > > this field and send to RabbitMQ the documents with value "text/html". How
> > > can I do that?
> > >
> > > Regards
> > >
> > > La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> > > #HastaSiempreComandante
> > > #HastalaVictoriaSiempre
> > >
> > >
> >
> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> #HastaSiempreComandante
> #HastalaVictoriaSiempre
>
>