Manipulate stored string in Lucene

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Manipulate stored string in Lucene

Pachzelt, Adrian
Dear all,

currently I am reading text fields that contain xml text. Hence, the solr input may look like this:

<field name=”tagged_text”>&lt;sec sec-type="Introduction" id="SECID0E4F"&gt;
&lt;title&gt;Introduction&lt;/title&gt;
&lt;/sec&gt;
</field>

With all “<” and “>” escaped.
I wrote a tokenizer that indexes the tag attributes (e.g. sec-type=”Introduction”) on the position of the tagged word (“Introduction” in this case) and hence I need the HTML tags when indexing. However, I want to strip the HTML in the stored string that is shown to the user on a query. So far, I figured out that the index and the stored string a separated. Thus, I thought it should be possible to manipulate the stored string either after indexing.

Is there a way to do so? I would prefer to manipulate the stored string and not introduce a second field with the plain text in the input file.

I am glad for any help!

Best Regards,

Adrian

-------------------------------------------------------
Adrian Pachzelt
- Fachinformationsdienst Biodiversitaetsforschung -
- Hosting von Open Access-Zeitschriften -
Universitaetsbibliothek Johann Christian Senckenberg
Bockenheimer Landstr. 134-138
60325 Frankfurt am Main
Tel. 069/798-39382
[hidden email]<mailto:[hidden email]>
-------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: Manipulate stored string in Lucene

Uwe Schindler
Hi,

You don't need a second field name, but you can once add the indexed field with stored=false and then add a second instance with same field name and the original stored content, but not indexed. If you want to have docvalues, the same can be done for docvalues. Internally, Lucene does it like that anyways. Adding a field to store and index at same time is just for convenience.

Uwe

Am May 9, 2018 5:57:40 AM UTC schrieb "Pachzelt, Adrian" <[hidden email]>:

>Dear all,
>
>currently I am reading text fields that contain xml text. Hence, the
>solr input may look like this:
>
><field name=”tagged_text”>&lt;sec sec-type="Introduction"
>id="SECID0E4F"&gt;
>&lt;title&gt;Introduction&lt;/title&gt;
>&lt;/sec&gt;
></field>
>
>With all “<” and “>” escaped.
>I wrote a tokenizer that indexes the tag attributes (e.g.
>sec-type=”Introduction”) on the position of the tagged word
>(“Introduction” in this case) and hence I need the HTML tags when
>indexing. However, I want to strip the HTML in the stored string that
>is shown to the user on a query. So far, I figured out that the index
>and the stored string a separated. Thus, I thought it should be
>possible to manipulate the stored string either after indexing.
>
>Is there a way to do so? I would prefer to manipulate the stored string
>and not introduce a second field with the plain text in the input file.
>
>I am glad for any help!
>
>Best Regards,
>
>Adrian
>
>-------------------------------------------------------
>Adrian Pachzelt
>- Fachinformationsdienst Biodiversitaetsforschung -
>- Hosting von Open Access-Zeitschriften -
>Universitaetsbibliothek Johann Christian Senckenberg
>Bockenheimer Landstr. 134-138
>60325 Frankfurt am Main
>Tel. 069/798-39382
>[hidden email]<mailto:[hidden email]>
>-------------------------------------------------------

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de
Reply | Threaded
Open this post in threaded view
|

Re: Manipulate stored string in Lucene

Uwe Schindler
Oh it's Solr? Then it's not easy possible. Plain Lucene works like that.

Uwe

Am May 9, 2018 6:09:42 AM UTC schrieb Uwe Schindler <[hidden email]>:

>Hi,
>
>You don't need a second field name, but you can once add the indexed
>field with stored=false and then add a second instance with same field
>name and the original stored content, but not indexed. If you want to
>have docvalues, the same can be done for docvalues. Internally, Lucene
>does it like that anyways. Adding a field to store and index at same
>time is just for convenience.
>
>Uwe
>
>Am May 9, 2018 5:57:40 AM UTC schrieb "Pachzelt, Adrian"
><[hidden email]>:
>>Dear all,
>>
>>currently I am reading text fields that contain xml text. Hence, the
>>solr input may look like this:
>>
>><field name=”tagged_text”>&lt;sec sec-type="Introduction"
>>id="SECID0E4F"&gt;
>>&lt;title&gt;Introduction&lt;/title&gt;
>>&lt;/sec&gt;
>></field>
>>
>>With all “<” and “>” escaped.
>>I wrote a tokenizer that indexes the tag attributes (e.g.
>>sec-type=”Introduction”) on the position of the tagged word
>>(“Introduction” in this case) and hence I need the HTML tags when
>>indexing. However, I want to strip the HTML in the stored string that
>>is shown to the user on a query. So far, I figured out that the index
>>and the stored string a separated. Thus, I thought it should be
>>possible to manipulate the stored string either after indexing.
>>
>>Is there a way to do so? I would prefer to manipulate the stored
>string
>>and not introduce a second field with the plain text in the input
>file.
>>
>>I am glad for any help!
>>
>>Best Regards,
>>
>>Adrian
>>
>>-------------------------------------------------------
>>Adrian Pachzelt
>>- Fachinformationsdienst Biodiversitaetsforschung -
>>- Hosting von Open Access-Zeitschriften -
>>Universitaetsbibliothek Johann Christian Senckenberg
>>Bockenheimer Landstr. 134-138
>>60325 Frankfurt am Main
>>Tel. 069/798-39382
>>[hidden email]<mailto:[hidden email]>
>>-------------------------------------------------------
>
>--
>Uwe Schindler
>Achterdiek 19, 28357 Bremen
>https://www.thetaphi.de

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de
Reply | Threaded
Open this post in threaded view
|

AW: Manipulate stored string in Lucene

Pachzelt, Adrian
Hi Uwe,

thanks for the advice. Yes, I use Solr overall, but thought it would be a Lucene issue.

Previously, I followed your proposed solution. I set the original field as stored=false indexed=true, created a copyfield, and in the copied field set stored=true indexed=false. However, I do not know how to manipulate the stored string in the copyField. Do you have an idea?

Thanks a lot! :)

Adrian

-------------------------------------------------------
Adrian Pachzelt
- Fachinformationsdienst Biodiversitaetsforschung -
- Hosting von Open Access-Zeitschriften -
Universitaetsbibliothek Johann Christian Senckenberg
Bockenheimer Landstr. 134-138
60325 Frankfurt am Main
Tel. 069/798-39382
[hidden email]
-------------------------------------------------------


-----Ursprüngliche Nachricht-----
Von: Uwe Schindler [mailto:[hidden email]]
Gesendet: Mittwoch, 9. Mai 2018 08:11
An: [hidden email]
Betreff: Re: Manipulate stored string in Lucene

Oh it's Solr? Then it's not easy possible. Plain Lucene works like that.

Uwe

Am May 9, 2018 6:09:42 AM UTC schrieb Uwe Schindler <[hidden email]>:

>Hi,
>
>You don't need a second field name, but you can once add the indexed
>field with stored=false and then add a second instance with same field
>name and the original stored content, but not indexed. If you want to
>have docvalues, the same can be done for docvalues. Internally, Lucene
>does it like that anyways. Adding a field to store and index at same
>time is just for convenience.
>
>Uwe
>
>Am May 9, 2018 5:57:40 AM UTC schrieb "Pachzelt, Adrian"
><[hidden email]>:
>>Dear all,
>>
>>currently I am reading text fields that contain xml text. Hence, the
>>solr input may look like this:
>>
>><field name=”tagged_text”>&lt;sec sec-type="Introduction"
>>id="SECID0E4F"&gt;
>>&lt;title&gt;Introduction&lt;/title&gt;
>>&lt;/sec&gt;
>></field>
>>
>>With all “<” and “>” escaped.
>>I wrote a tokenizer that indexes the tag attributes (e.g.
>>sec-type=”Introduction”) on the position of the tagged word
>>(“Introduction” in this case) and hence I need the HTML tags when
>>indexing. However, I want to strip the HTML in the stored string that
>>is shown to the user on a query. So far, I figured out that the index
>>and the stored string a separated. Thus, I thought it should be
>>possible to manipulate the stored string either after indexing.
>>
>>Is there a way to do so? I would prefer to manipulate the stored
>string
>>and not introduce a second field with the plain text in the input
>file.
>>
>>I am glad for any help!
>>
>>Best Regards,
>>
>>Adrian
>>
>>-------------------------------------------------------
>>Adrian Pachzelt
>>- Fachinformationsdienst Biodiversitaetsforschung -
>>- Hosting von Open Access-Zeitschriften -
>>Universitaetsbibliothek Johann Christian Senckenberg
>>Bockenheimer Landstr. 134-138
>>60325 Frankfurt am Main
>>Tel. 069/798-39382
>>[hidden email]<mailto:[hidden email]>
>>-------------------------------------------------------
>
>--
>Uwe Schindler
>Achterdiek 19, 28357 Bremen
>https://www.thetaphi.de

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de
Reply | Threaded
Open this post in threaded view
|

Re: Manipulate stored string in Lucene

Mikhail Khludnev-2
Hello, Adrien.
If I got you right, it's an UpdateRequestProcessor's duty see
https://lucene.apache.org/solr/guide/7_3/update-request-processors.html


On Wed, May 9, 2018 at 11:39 AM, Pachzelt, Adrian <
[hidden email]> wrote:

> Hi Uwe,
>
> thanks for the advice. Yes, I use Solr overall, but thought it would be a
> Lucene issue.
>
> Previously, I followed your proposed solution. I set the original field as
> stored=false indexed=true, created a copyfield, and in the copied field set
> stored=true indexed=false. However, I do not know how to manipulate the
> stored string in the copyField. Do you have an idea?
>
> Thanks a lot! :)
>
> Adrian
>
> -------------------------------------------------------
> Adrian Pachzelt
> - Fachinformationsdienst Biodiversitaetsforschung -
> - Hosting von Open Access-Zeitschriften -
> Universitaetsbibliothek Johann Christian Senckenberg
> Bockenheimer Landstr. 134-138
> 60325 Frankfurt am Main
> Tel. 069/798-39382
> [hidden email]
> -------------------------------------------------------
>
>
> -----Ursprüngliche Nachricht-----
> Von: Uwe Schindler [mailto:[hidden email]]
> Gesendet: Mittwoch, 9. Mai 2018 08:11
> An: [hidden email]
> Betreff: Re: Manipulate stored string in Lucene
>
> Oh it's Solr? Then it's not easy possible. Plain Lucene works like that.
>
> Uwe
>
> Am May 9, 2018 6:09:42 AM UTC schrieb Uwe Schindler <[hidden email]>:
> >Hi,
> >
> >You don't need a second field name, but you can once add the indexed
> >field with stored=false and then add a second instance with same field
> >name and the original stored content, but not indexed. If you want to
> >have docvalues, the same can be done for docvalues. Internally, Lucene
> >does it like that anyways. Adding a field to store and index at same
> >time is just for convenience.
> >
> >Uwe
> >
> >Am May 9, 2018 5:57:40 AM UTC schrieb "Pachzelt, Adrian"
> ><[hidden email]>:
> >>Dear all,
> >>
> >>currently I am reading text fields that contain xml text. Hence, the
> >>solr input may look like this:
> >>
> >><field name=”tagged_text”>&lt;sec sec-type="Introduction"
> >>id="SECID0E4F"&gt;
> >>&lt;title&gt;Introduction&lt;/title&gt;
> >>&lt;/sec&gt;
> >></field>
> >>
> >>With all “<” and “>” escaped.
> >>I wrote a tokenizer that indexes the tag attributes (e.g.
> >>sec-type=”Introduction”) on the position of the tagged word
> >>(“Introduction” in this case) and hence I need the HTML tags when
> >>indexing. However, I want to strip the HTML in the stored string that
> >>is shown to the user on a query. So far, I figured out that the index
> >>and the stored string a separated. Thus, I thought it should be
> >>possible to manipulate the stored string either after indexing.
> >>
> >>Is there a way to do so? I would prefer to manipulate the stored
> >string
> >>and not introduce a second field with the plain text in the input
> >file.
> >>
> >>I am glad for any help!
> >>
> >>Best Regards,
> >>
> >>Adrian
> >>
> >>-------------------------------------------------------
> >>Adrian Pachzelt
> >>- Fachinformationsdienst Biodiversitaetsforschung -
> >>- Hosting von Open Access-Zeitschriften -
> >>Universitaetsbibliothek Johann Christian Senckenberg
> >>Bockenheimer Landstr. 134-138
> >>60325 Frankfurt am Main
> >>Tel. 069/798-39382
> >>[hidden email]<mailto:[hidden email]>
> >>-------------------------------------------------------
> >
> >--
> >Uwe Schindler
> >Achterdiek 19, 28357 Bremen
> >https://www.thetaphi.de
>
> --
> Uwe Schindler
> Achterdiek 19, 28357 Bremen
> https://www.thetaphi.de
>



--
Sincerely yours
Mikhail Khludnev
Reply | Threaded
Open this post in threaded view
|

AW: Manipulate stored string in Lucene

Pachzelt, Adrian
I will check this out! Thank you, Mikhail! :)

-------------------------------------------------------
Adrian Pachzelt
- Fachinformationsdienst Biodiversitaetsforschung -
- Hosting von Open Access-Zeitschriften -
Universitaetsbibliothek Johann Christian Senckenberg
Bockenheimer Landstr. 134-138
60325 Frankfurt am Main
Tel. 069/798-39382
[hidden email]
-------------------------------------------------------


-----Ursprüngliche Nachricht-----
Von: Mikhail Khludnev [mailto:[hidden email]]
Gesendet: Mittwoch, 9. Mai 2018 11:15
An: [hidden email]
Betreff: Re: Manipulate stored string in Lucene

Hello, Adrien.
If I got you right, it's an UpdateRequestProcessor's duty see
https://lucene.apache.org/solr/guide/7_3/update-request-processors.html


On Wed, May 9, 2018 at 11:39 AM, Pachzelt, Adrian <
[hidden email]> wrote:

> Hi Uwe,
>
> thanks for the advice. Yes, I use Solr overall, but thought it would be a
> Lucene issue.
>
> Previously, I followed your proposed solution. I set the original field as
> stored=false indexed=true, created a copyfield, and in the copied field set
> stored=true indexed=false. However, I do not know how to manipulate the
> stored string in the copyField. Do you have an idea?
>
> Thanks a lot! :)
>
> Adrian
>
> -------------------------------------------------------
> Adrian Pachzelt
> - Fachinformationsdienst Biodiversitaetsforschung -
> - Hosting von Open Access-Zeitschriften -
> Universitaetsbibliothek Johann Christian Senckenberg
> Bockenheimer Landstr. 134-138
> 60325 Frankfurt am Main
> Tel. 069/798-39382
> [hidden email]
> -------------------------------------------------------
>
>
> -----Ursprüngliche Nachricht-----
> Von: Uwe Schindler [mailto:[hidden email]]
> Gesendet: Mittwoch, 9. Mai 2018 08:11
> An: [hidden email]
> Betreff: Re: Manipulate stored string in Lucene
>
> Oh it's Solr? Then it's not easy possible. Plain Lucene works like that.
>
> Uwe
>
> Am May 9, 2018 6:09:42 AM UTC schrieb Uwe Schindler <[hidden email]>:
> >Hi,
> >
> >You don't need a second field name, but you can once add the indexed
> >field with stored=false and then add a second instance with same field
> >name and the original stored content, but not indexed. If you want to
> >have docvalues, the same can be done for docvalues. Internally, Lucene
> >does it like that anyways. Adding a field to store and index at same
> >time is just for convenience.
> >
> >Uwe
> >
> >Am May 9, 2018 5:57:40 AM UTC schrieb "Pachzelt, Adrian"
> ><[hidden email]>:
> >>Dear all,
> >>
> >>currently I am reading text fields that contain xml text. Hence, the
> >>solr input may look like this:
> >>
> >><field name=”tagged_text”>&lt;sec sec-type="Introduction"
> >>id="SECID0E4F"&gt;
> >>&lt;title&gt;Introduction&lt;/title&gt;
> >>&lt;/sec&gt;
> >></field>
> >>
> >>With all “<” and “>” escaped.
> >>I wrote a tokenizer that indexes the tag attributes (e.g.
> >>sec-type=”Introduction”) on the position of the tagged word
> >>(“Introduction” in this case) and hence I need the HTML tags when
> >>indexing. However, I want to strip the HTML in the stored string that
> >>is shown to the user on a query. So far, I figured out that the index
> >>and the stored string a separated. Thus, I thought it should be
> >>possible to manipulate the stored string either after indexing.
> >>
> >>Is there a way to do so? I would prefer to manipulate the stored
> >string
> >>and not introduce a second field with the plain text in the input
> >file.
> >>
> >>I am glad for any help!
> >>
> >>Best Regards,
> >>
> >>Adrian
> >>
> >>-------------------------------------------------------
> >>Adrian Pachzelt
> >>- Fachinformationsdienst Biodiversitaetsforschung -
> >>- Hosting von Open Access-Zeitschriften -
> >>Universitaetsbibliothek Johann Christian Senckenberg
> >>Bockenheimer Landstr. 134-138
> >>60325 Frankfurt am Main
> >>Tel. 069/798-39382
> >>[hidden email]<mailto:[hidden email]>
> >>-------------------------------------------------------
> >
> >--
> >Uwe Schindler
> >Achterdiek 19, 28357 Bremen
> >https://www.thetaphi.de
>
> --
> Uwe Schindler
> Achterdiek 19, 28357 Bremen
> https://www.thetaphi.de
>



--
Sincerely yours
Mikhail Khludnev