use copyField to gather and then split

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

use copyField to gather and then split

Rodland Fredrik
(sorry if this message ends up being sent twice)

We have a use-case where we'd like to fill a field from multiple sources,
i.e.

<copyField source="title" dest="text" />
<copyField source="body" dest="text" />
… (other source-fields are copied in to text as well)

and then analyze the resulting text-field in a number of ways, each
requiring it's own field.

Is it possible to somehow copy the text-field from above to these new
fields - i.e.

<copyField source="text" dest="textanayzemethod2" />
<copyField source="text" dest="textanayzemethod1" />

Is this at all possible, or do we have to duplicate the first set of
copyFields for each textanayzemethodN.

if possible: is the order of the statements in schema.xml important here?

Any tips or hints is highly appreciated.


regards,


Fredrik Rodland


--
Fredrik Rødland               Mail:  [hidden email]
                              Cell:  +47 99 21 98 17
Open Ad Exchange              MSN:   [hidden email]
Maisen Pedersens vei 1        AIM:   Fredrik Rodland
NO-1363 Høvik, NORWAY         Web:   http://rodland.no
Reply | Threaded
Open this post in threaded view
|

Re: use copyField to gather and then split

Jan Høydahl / Cominvent
Hi pal :)

Unfortunately copyField works only BEFORE analysis and you cannot "chain" them...

The simplest solution would be to duplicate your copyField's:

<copyField source="title" dest="textanayzemethod2" />
<copyField source="body" dest="textanayzemethod2" />

<copyField source="title" dest="textanayzemethod1" />
<copyField source="body" dest="textanayzemethod1" />

Another way would be to look into the UpdateProcessorChain and write a "copy" processor which does whatever you need.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 29. juni 2010, at 12.18, [hidden email] wrote:

> (sorry if this message ends up being sent twice)
>
> We have a use-case where we'd like to fill a field from multiple sources, i.e.
>
> <copyField source="title" dest="text" />
> <copyField source="body" dest="text" />
> … (other source-fields are copied in to text as well)
>
> and then analyze the resulting text-field in a number of ways, each requiring it's own field.
>
> Is it possible to somehow copy the text-field from above to these new fields - i.e.
>
> <copyField source="text" dest="textanayzemethod2" />
> <copyField source="text" dest="textanayzemethod1" />
>
> Is this at all possible, or do we have to duplicate the first set of copyFields for each textanayzemethodN.
>
> if possible: is the order of the statements in schema.xml important here?
>
> Any tips or hints is highly appreciated.
>
>
> regards,
>
>
> Fredrik Rodland
>
>
> --
> Fredrik Rødland               Mail:  [hidden email]
>                             Cell:  +47 99 21 98 17
> Open Ad Exchange              MSN:   [hidden email]
> Maisen Pedersens vei 1        AIM:   Fredrik Rodland
> NO-1363 Høvik, NORWAY         Web:   http://rodland.no

Reply | Threaded
Open this post in threaded view
|

year range field, proper data type?

Jonathan Rochkind
In reply to this post by Rodland Fredrik
So I will have a solr field that contains "years", ie, "1990", "2010",
maybe even "1492", "1209" and "907"/"0907".

I will be doing range limits over this field.  Ie, [1950 TO 1975] or
what have you.  The data represents publication dates of books on a
large library shelves; there will be around 3 million documents, with
the range of data being concentrated in recent years, but with a long
tail stretching off into the past.

So it seems to me clear that I should use a trie field of some type, to
efficiently accomodate the range querries.

It seems to me that I probably don't need/want an actual date field,
since the data isn't complex to demand it, it's just a four-digit year.

So that pretty much leaves storing as a trie integer, or as a trie
string.   Any advice on which is probably better in this case? Or on how
to set up the trie field for this kind of data? Thanks for any,

Jonathan
Reply | Threaded
Open this post in threaded view
|

Re: year range field, proper data type?

Erick Erickson
This isn't a very worrisome case. Most of the messages you see on the board
about
the dangers of dates arise because dates can be stored with many unique
values if
they include milliseconds. Then, when sorting on date your memory explodes
because
all the dates are loaded into memory.

In your case, there are a max of 10,000 years, which isn't the same
magnitude of problem
as, say, 10,000,000 documents each with a unique timestamp.

That said, you might as well go for as much speed as you can get and use a
trie int, that
way you won't be tripped up by three-digit years being out of lexical
order.....

Best
Erick

On Wed, Jul 7, 2010 at 10:55 AM, Jonathan Rochkind <[hidden email]> wrote:

> So I will have a solr field that contains "years", ie, "1990", "2010",
> maybe even "1492", "1209" and "907"/"0907".
>
> I will be doing range limits over this field.  Ie, [1950 TO 1975] or what
> have you.  The data represents publication dates of books on a large library
> shelves; there will be around 3 million documents, with the range of data
> being concentrated in recent years, but with a long tail stretching off into
> the past.
>
> So it seems to me clear that I should use a trie field of some type, to
> efficiently accomodate the range querries.
>
> It seems to me that I probably don't need/want an actual date field, since
> the data isn't complex to demand it, it's just a four-digit year.
>
> So that pretty much leaves storing as a trie integer, or as a trie string.
>   Any advice on which is probably better in this case? Or on how to set up
> the trie field for this kind of data? Thanks for any,
>
> Jonathan
>
Reply | Threaded
Open this post in threaded view
|

Re: year range field, proper data type?

Lance Norskog-2
There is no 'trie string'.

If you use a trie type for this problem, sorting will take much less
memory. Sorting strings uses memory both per document and per unique
term. The Trie types do not use any memory per unique term. So, yes, a
Trie Integer is a good choice for this problem.

On Wed, Jul 7, 2010 at 12:59 PM, Erick Erickson <[hidden email]> wrote:

> This isn't a very worrisome case. Most of the messages you see on the board
> about
> the dangers of dates arise because dates can be stored with many unique
> values if
> they include milliseconds. Then, when sorting on date your memory explodes
> because
> all the dates are loaded into memory.
>
> In your case, there are a max of 10,000 years, which isn't the same
> magnitude of problem
> as, say, 10,000,000 documents each with a unique timestamp.
>
> That said, you might as well go for as much speed as you can get and use a
> trie int, that
> way you won't be tripped up by three-digit years being out of lexical
> order.....
>
> Best
> Erick
>
> On Wed, Jul 7, 2010 at 10:55 AM, Jonathan Rochkind <[hidden email]> wrote:
>
>> So I will have a solr field that contains "years", ie, "1990", "2010",
>> maybe even "1492", "1209" and "907"/"0907".
>>
>> I will be doing range limits over this field.  Ie, [1950 TO 1975] or what
>> have you.  The data represents publication dates of books on a large library
>> shelves; there will be around 3 million documents, with the range of data
>> being concentrated in recent years, but with a long tail stretching off into
>> the past.
>>
>> So it seems to me clear that I should use a trie field of some type, to
>> efficiently accomodate the range querries.
>>
>> It seems to me that I probably don't need/want an actual date field, since
>> the data isn't complex to demand it, it's just a four-digit year.
>>
>> So that pretty much leaves storing as a trie integer, or as a trie string.
>>   Any advice on which is probably better in this case? Or on how to set up
>> the trie field for this kind of data? Thanks for any,
>>
>> Jonathan
>>
>



--
Lance Norskog
[hidden email]