Copy field a source of copy field

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Copy field a source of copy field

tstusr
Hi

We want to use a copy field as a source for another copy field or some kind of post processing of a field.

The problem is here. We have a field from a text that is captured by a field, like this:

<copyField source="attr_content*" dest="species"/>

which has (at the end of the processing) just the words in a field.

<field name="species" type="species_type" stored="true" indexed="true" termVectors="true" termPositions="true" termOffsets="true"/>

<fieldType name="species_type" class="solr.TextField" positionIncrementGap="0">
    <analyzer type="index">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <charFilter class="solr.MappingCharFilterFactory" mapping="mapping/mapping-ISOLatin1Accent.txt"/>
      <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[0-9]+|(\-)(\s*)" replacement=""/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/>
      <filter class="solr.KeepWordFilterFactory" words="species.txt" ignoreCase="true"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="false"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

So, what we want to do now is to implement a faceting according to some post processing of this field by using this as a source for another field.

<copyField source="species" dest="genus"/>

<fieldType name="genus_type" class="solr.TextField" positionIncrementGap="0">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.KeepWordFilterFactory" words="genus.txt" ignoreCase="true"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>


As soon as I understand. We don't have a value on genus because the chain is ended. Nevertheless, we are also not available to make two processings to first, capture the words on species and then make a new capture for the genus.

As an example imagine we have on species

abies durangensis
abies flinckii

so, after post processing, we expect to have only
abies

which is a word in genus files

I was as clear as possible with the problem, but maybe there are some black holes in the explanation.

Hope you can help me.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Copy field a source of copy field

Erick Erickson
In a word, "no". Copyfields are not chained together. I'm not at all
sure what you're trying to accomplish with those filter chains anyway,
By shingling _then_ doing the stopwords, you'll have some input like
abies durangensis

become

abies
abies_durangensis
durangensis

Then put that through your keepwords filter which presumably only has
species in it so it would throw out abies and abies_durangensis unless
those are in your keepwords file.... Seems a waste.

That aside, you can construct one long analysis chain that combined
the genus and species chains and just copy from attr_content* into
both. You wouldn't get the different tokenization, but presumably you
don't particularly need it on the second part of the chain.

Best,
Erick

On Mon, Jul 17, 2017 at 3:26 PM, tstusr <[hidden email]> wrote:

> Hi
>
> We want to use a copy field as a source for another copy field or some kind
> of post processing of a field.
>
> The problem is here. We have a field from a text that is captured by a
> field, like this:
>
> <copyField source="attr_content*" dest="species"/>
>
> which has (at the end of the processing) just the words in a field.
>
> <field name="species" type="species_type" stored="true" indexed="true"
> termVectors="true" termPositions="true" termOffsets="true"/>
>
> <fieldType name="species_type" class="solr.TextField"
> positionIncrementGap="0">
>     <analyzer type="index">
>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping/mapping-ISOLatin1Accent.txt"/>
>       <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="[0-9]+|(\-)(\s*)" replacement=""/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.ShingleFilterFactory" maxShingleSize="3"
> outputUnigrams="true"/>
>       <filter class="solr.KeepWordFilterFactory" words="species.txt"
> ignoreCase="true"/>
>     </analyzer>
>     <analyzer type="query">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.ShingleFilterFactory" maxShingleSize="3"
> outputUnigrams="false"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>   </fieldType>
>
> So, what we want to do now is to implement a faceting according to some post
> processing of this field by using this as a source for another field.
>
> <copyField source="species" dest="genus"/>
>
> <fieldType name="genus_type" class="solr.TextField"
> positionIncrementGap="0">
>     <analyzer type="index">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.KeepWordFilterFactory" words="genus.txt"
> ignoreCase="true"/>
>     </analyzer>
>     <analyzer type="query">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>   </fieldType>
>
>
> As soon as I understand. We don't have a value on genus because the chain is
> ended. Nevertheless, we are also not available to make two processings to
> first, capture the words on species and then make a new capture for the
> genus.
>
> As an example imagine we have on species
>
> abies durangensis
> abies flinckii
>
> so, after post processing, we expect to have only
> abies
>
> which is a word in genus files
>
> I was as clear as possible with the problem, but maybe there are some black
> holes in the explanation.
>
> Hope you can help me.
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Copy-field-a-source-of-copy-field-tp4346425.html
> Sent from the Solr - User mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Copy field a source of copy field

Shawn Heisey-2
In reply to this post by tstusr
On 7/17/2017 4:26 PM, tstusr wrote:
> We want to use a copy field as a source for another copy field or some kind
> of post processing of a field.
<snip>
> As an example imagine we have on species
>
> abies durangensis
> abies flinckii
>
> so, after post processing, we expect to have only
> abies
>
> which is a word in genus files

Let's say that you have this in your schema, and you index "Test Words"
(note the capital letters) in field a:

    <copyField source="a" dest="b"/>

Let's say that the index analysis on field a has the whitespace
tokenizer, a lowercase filter, and a stopword filter with "test" in the
list.  This means that the search terms for field a on that document
will only have "words" included.

You might be expecting field b to only receive "words" when it gets
copied from field a ... but this is NOT what happens.  Field b receives
the original text sent to field a, which is "Test Words", including both
words and the uppercase letters.

I think that transitive copies *do* work, so that you can copy field a
to b, then field b to c, though I am not 100 percent sure about that.
If that does work, the end field in the chain is still going to receive
"Test Words" like you sent to field a.

Chaining analysis through copyField does not work.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Copy field a source of copy field

tstusr
In reply to this post by Erick Erickson
Ok, I know shingling will join with "_".

But that is the behaviour we want, imagine we have this fields (contained in species file):

abarema idiopoda
abutilon bakerianum

Those become in:
abarema
idiopoda
abutilon
bakerianum
abarema_idiopoda
abutilon_bakerianum

But now in my genus file maybe is only the word abarema, so, we end up with a field with only that word.

So, the requirements here, are to be able to find all species in species files (step one) and then make a facet with species in file genus, step two.

It seems reasonable to just chain the fields, I just forgot solr didn't change the field, as Shawn points (thanks for it).

So what we came here is to make 2 fields the first with species.

<fieldType name="species_type" class="solr.TextField" positionIncrementGap="0">
    <analyzer type="index">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <charFilter class="solr.MappingCharFilterFactory" mapping="mapping/mapping-ISOLatin1Accent.txt"/>
      <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[0-9]+|(\-)(\s*)" replacement=""/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/>
      <filter class="solr.KeepWordFilterFactory" words="species.txt" ignoreCase="true"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="false"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

And the second one (genus), which contains genus that has to be for facet purposes, like this:

<fieldType name="genus_type" class="solr.TextField" positionIncrementGap="0">
    <analyzer type="index">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <charFilter class="solr.MappingCharFilterFactory" mapping="mapping/mapping-ISOLatin1Accent.txt"/>
      <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[0-9]+|(\-)(\s*)" replacement=""/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/>
      <filter class="solr.KeepWordFilterFactory" words="species.txt" ignoreCase="true"/>
      <filter class="solr.KeepWordFilterFactory" words="genus.txt" ignoreCase="true"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

Nevertheless, there is no second processing for keep word filter as (I) expect. Am I missing something?


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Copy field a source of copy field

Erick Erickson
The code is very simple, it looks at a quick glance like it just reads
the words in then the "accept" method just returns true or false based
on whether the text file contains the token.

Are you sure you reloaded your core/collection and pushed the changed
schema to the right place? The admin/analysis page is very helpful
here, your indexing side should have two keep word filters and you
should be able to see each transformation (uncheck the "verbose"
checkbox for more readability.

Best,
Erick

On Tue, Jul 18, 2017 at 8:49 AM, tstusr <[hidden email]> wrote:

> Ok, I know shingling will join with "_".
>
> But that is the behaviour we want, imagine we have this fields (contained in
> species file):
>
> abarema idiopoda
> abutilon bakerianum
>
> Those become in:
> abarema
> idiopoda
> abutilon
> bakerianum
> abarema_idiopoda
> abutilon_bakerianum
>
> But now in my genus file maybe is only the word abarema, so, we end up with
> a field with only that word.
>
> So, the requirements here, are to be able to find all species in species
> files (step one) and then make a facet with species in file genus, step two.
>
> It seems reasonable to just chain the fields, I just forgot solr didn't
> change the field, as Shawn points (thanks for it).
>
> So what we came here is to make 2 fields the first with species.
>
> <fieldType name="species_type" class="solr.TextField"
> positionIncrementGap="0">
>     <analyzer type="index">
>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping/mapping-ISOLatin1Accent.txt"/>
>       <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="[0-9]+|(\-)(\s*)" replacement=""/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.ShingleFilterFactory" maxShingleSize="3"
> outputUnigrams="true"/>
>       <filter class="solr.KeepWordFilterFactory" words="species.txt"
> ignoreCase="true"/>
>     </analyzer>
>     <analyzer type="query">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.ShingleFilterFactory" maxShingleSize="3"
> outputUnigrams="false"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>   </fieldType>
>
> And the second one (genus), which contains genus that has to be for facet
> purposes, like this:
>
> <fieldType name="genus_type" class="solr.TextField"
> positionIncrementGap="0">
>     <analyzer type="index">
>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping/mapping-ISOLatin1Accent.txt"/>
>       <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="[0-9]+|(\-)(\s*)" replacement=""/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.ShingleFilterFactory" maxShingleSize="3"
> outputUnigrams="true"/>
>       <filter class="solr.KeepWordFilterFactory" words="species.txt"
> ignoreCase="true"/>
>       <filter class="solr.KeepWordFilterFactory" words="genus.txt"
> ignoreCase="true"/>
>     </analyzer>
>     <analyzer type="query">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>   </fieldType>
>
> Nevertheless, there is no second processing for keep word filter as (I)
> expect. Am I missing something?
>
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Copy-field-a-source-of-copy-field-tp4346425p4346593.html
> Sent from the Solr - User mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Copy field a source of copy field

tstusr
It seems that is just taking the last file of keep words.



Now for control purposes, I have in genus file:



And just is taking the composed field, abutilon aurantiacum.

By testing with
abutilon aurantiacum
abutilon bakerianum



It's is not possible to put 2 tokenizers in a field, am I right? Because I just think there is a missing split in between the 2 KWFs.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Copy field a source of copy field

tstusr
Well, I have no idea why that images display as did.

The correct order is:

Field chain analyzer.


KWF-genus file


Test output.


Sorry for the mistake
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Copy field a source of copy field

Erick Erickson
Multiple keyword files work just fine for me.

one issue you're having is that multi-word keepwords aren't going to
do what you expect. The analysis chains work on _tokens_, and only see
one at a time. Plus (apparently) the input is broken up on whitespace
(the docs aren't entirely clear on this, but can be inferred by "one
per line").

Even if there were multi-word keepwords, it wouldn't work as you
apparently expect. The problem is that the analysis chain first breaks
the input into tokens. So even if a "single" keepword were "a b", and
your input was "a b", by the time it gets to the keepword filter the
context would be lost. So the filter would see just "a" and say "nope
it doesn't match 'a b', throw it out". Ditto with "b".

Since keepwords are apparently split on whitespace though, in the
example above both would be kept. The keepword list is "a" and "b" so
in the above example both match and are kept.

Best,
Erkck

On Tue, Jul 18, 2017 at 9:49 AM, tstusr <[hidden email]> wrote:

> Well, I have no idea why that images display as did.
>
> The correct order is:
>
> Field chain analyzer.
> <http://lucene.472066.n3.nabble.com/file/n4346602/1.png>
>
> KWF-genus file
> <http://lucene.472066.n3.nabble.com/file/n4346602/3.png>
>
> Test output.
> <http://lucene.472066.n3.nabble.com/file/n4346602/2.png>
>
> Sorry for the mistake
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Copy-field-a-source-of-copy-field-tp4346425p4346602.html
> Sent from the Solr - User mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Copy field a source of copy field

tstusr
This post was updated on .
It seems that maybe I'm not explaining well.

My field is defined as follows:

  <fieldType name="genus_type" class="solr.TextField" positionIncrementGap="0">
    <analyzer type="index">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <charFilter class="solr.MappingCharFilterFactory" mapping="mapping/mapping-ISOLatin1Accent.txt"/>
      <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[0-9]+|(\-)(\s*)" replacement=""/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/>
      <filter class="solr.KeepWordFilterFactory" words="species.txt" ignoreCase="true"/>
      <filter class="solr.KeepWordFilterFactory" words="genus.txt" ignoreCase="true"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

We have 2 KWF files, "species" and then "genus". It seems that is just working with genus.

Since I'm not able to use copy fields, what choices I have?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Copy field a source of copy field

Erick Erickson
OK, I take it back. Keepwords handle multiple words just fine. So I
have to rewind.

I'm having no trouble at all applying multiple, successive keepwords
filters, even when there are multiple words on a single line in the
keepwords file. Your use of shingles in here is probably going to
confuse things, so I'd probably recommend taking that out until you
work out what's happening with multiple keepwords filters, then add it
back in.

The images you pasted almost look like you're showing the contents of
elevate.xml, but I suspect that's bogus.

But I think this is an XY problem, you're asking about how to chain
copyFields and we got off into talking about chaining keepwords and
the like. You state:

"So, the requirements here, are to be able to find all species in
species files (step one) and then make a facet with species in file
genus, step two."

Then you say:

"And the second one (genus), which contains genus that has to be for
facet purposes, like this"

How are those reconciled? Do you want facets on the genus+species? Or
just on the genus? Or both? So let's just start over.

What's also missing is why you think you need keepwords in the first
place. Is this a free-text field you're trying to extract
genus/species from? Or do you have the genus/species extracted
already?

Give us two docs, a sample search and what you want as outcome.
Because if you just want to facet on genus then do a copyField simply
to a "genus" field that strips out everything but the genus (however
you implement that, tricky given sub-species perhaps).

Ditto if you want to facet on species. Just a species_facet field that
you put whatever you want into. Or just use KeywordTokenizer for
species if you're guaranteed that you want the whole field.

You can then use copyField to copy as you wish.

Best,
Erick


On Tue, Jul 18, 2017 at 2:23 PM, tstusr <[hidden email]> wrote:

> Well, for me it's kind of strange because it's working only with words that
> have blank spaces. It seems that maybe I'm not explaining well.
>
> My field is defined as follows:
>
>   <fieldType name="genus_type" class="solr.TextField"
> positionIncrementGap="0">
>     <analyzer type="index">
>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping/mapping-ISOLatin1Accent.txt"/>
>       <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="[0-9]+|(\-)(\s*)" replacement=""/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.ShingleFilterFactory" maxShingleSize="3"
> outputUnigrams="true"/>
>       <filter class="solr.KeepWordFilterFactory" words="species.txt"
> ignoreCase="true"/>
>       <filter class="solr.KeepWordFilterFactory" words="genus.txt"
> ignoreCase="true"/>
>     </analyzer>
>     <analyzer type="query">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>   </fieldType>
>
> We have 2 KWF files, "species" and then "genus". It seems that is just
> working with genus.
>
> Since I'm not able to use copy fields, what choices I have?
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Copy-field-a-source-of-copy-field-tp4346425p4346665.html
> Sent from the Solr - User mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Copy field a source of copy field

tstusr
Well, our documents consist on pdf files (between 20 to 200 pages).

So, we catch words of all the file, for that, we use the extract handler, that's why we have this fields:
 
<copyField source="attr_conten*" dest="genus"/>
<copyField source="attr_conten*" dest="specie"/>

We catch species in all the pdf content (On attr_content field)

Species captured are used for ranking purposes. So, we have to have the whole name, that's why we use shingles. As an example, we catch from the pdf:

abelmoschus achanioides
abies colimensis
abies concolor

Because that information is important, we provide a facet of those species, grouped by genus (just the first word of the species). So, in the facet we have to have:

abelmoschus (1)
abies (2)

Nevertheless, we need a sort of subquery, because first, we need the complete species and then of those results facet by genus. For example:

the abies something else (This phrase shouldn't have to be captured)
the abies concolor something else (This phrase should've to be captured) -> Finish with just "abies concolor" and for consequence then captured by genus

I realized that all genus are contained on species.

So, there is a way to make a facet with just the first word of a field, like I've got for the field:

abelmoschus achanioides
abies colimensis
abies concolor

Just use the first word of those?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Copy field a source of copy field

Erick Erickson
OK, you'll need two fields pretty much for certain. The trick is
getting _only_ genus names in the genus field.

The simplest thing to do would be a straight copyField with a single
keep word filter that contains a list of all the genera. That
presupposes that the genera are disjoint sets from all other words.
You search on your species field and facet on the genus field.

But assuming your genera are not disjoint from all other words, hmmmm.
Do you have a way of unambiguously identifying genus/species pairs in
the text you're processing? If you do we can work with that, but
without that you're talking entity recognition of some sort.

BTW, there's no real need to shingle the species field, just search
for "genus species" as a phrase. Unless those two appear next to each
other in order you won't get a hit.

Best,
Erick

On Wed, Jul 19, 2017 at 11:07 AM, tstusr <[hidden email]> wrote:

> Well, our documents consist on pdf files (between 20 to 200 pages).
>
> So, we catch words of all the file, for that, we use the extract handler,
> that's why we have this fields:
>
> <copyField source="attr_conten*" dest="genus"/>
> <copyField source="attr_conten*" dest="specie"/>
>
> We catch species in all the pdf content (On attr_content field)
>
> Species captured are used for ranking purposes. So, we have to have the
> whole name, that's why we use shingles. As an example, we catch from the
> pdf:
>
> abelmoschus achanioides
> abies colimensis
> abies concolor
>
> Because that information is important, we provide a facet of those species,
> grouped by genus (just the first word of the species). So, in the facet we
> have to have:
>
> abelmoschus (1)
> abies (2)
>
> Nevertheless, we need a sort of subquery, because first, we need the
> complete species and then of those results facet by genus. For example:
>
> the abies something else (This phrase shouldn't have to be captured)
> the abies concolor something else (This phrase should've to be captured) ->
> Finish with just "abies concolor" and for consequence then captured by genus
>
> I realized that all genus are contained on species.
>
> So, there is a way to make a facet with just the first word of a field, like
> I've got for the field:
>
> abelmoschus achanioides
> abies colimensis
> abies concolor
>
> Just use the first word of those?
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Copy-field-a-source-of-copy-field-tp4346425p4346846.html
> Sent from the Solr - User mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Copy field a source of copy field

tstusr
Well, correct me if I'm wrong.

Your suggestion is to use species field as a source of genus field. We try with this

<copyField source="attr_conten*" dest="species"/>
<copyField source="species" dest="genus"/>

Where species work as described and genus just use a KWF, like this:

<fieldType name="genus_type" class="solr.TextField" positionIncrementGap="0">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.KeepWordFilterFactory" words="genus.txt" ignoreCase="true"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

But now, the problem now is different.

When we try the behavior in analysis section in solr provided UI it works as expected.

Nevertheless, when we use it at indexing time (When we post pdf files, to extractor) the field doesn't even appear. We think it's because the info becomes from another copyField.

Did I misunderstand your suggestion?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Copy field a source of copy field

Erick Erickson
Yep, we're not communication ;)

Use the original source field for the genus, as:

<copyField source="attr_conten*" dest="species"/>
<copyField source="attr_conten*" dest="genus"/>

The difficulty here is that there might be false hits if the genera
names happen to match words in the input that are not part of a
genus/species pair.



On Thu, Jul 20, 2017 at 9:55 AM, tstusr <[hidden email]> wrote:

> Well, correct me if I'm wrong.
>
> Your suggestion is to use species field as a source of genus field. We try
> with this
>
> <copyField source="attr_conten*" dest="species"/>
> <copyField source="species" dest="genus"/>
>
> Where species work as described and genus just use a KWF, like this:
>
> <fieldType name="genus_type" class="solr.TextField"
> positionIncrementGap="0">
>     <analyzer type="index">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.KeepWordFilterFactory" words="genus.txt"
> ignoreCase="true"/>
>     </analyzer>
>     <analyzer type="query">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>   </fieldType>
>
> But now, the problem now is different.
>
> When we try the behavior in analysis section in solr provided UI it works as
> expected.
>
> Nevertheless, when we use it at indexing time (When we post pdf files, to
> extractor) the field doesn't even appear. We think it's because the info
> becomes from another copyField.
>
> Did I misunderstand your suggestion?
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Copy-field-a-source-of-copy-field-tp4346425p4347013.html
> Sent from the Solr - User mailing list archive at Nabble.com.
Loading...