Handling disparate data sources in Solr

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Handling disparate data sources in Solr

Alan Burlison
Hi,

I'm considering using Solr to replace an existing bare-metal Lucene
deployment - the current Lucene setup is embedded inside an existing
monolithic webapp, and I want to factor out the search functionality
into a separate webapp so it can be reused more easily.

At present the content of the Lucene index comes from many different
sources (web pages, documents, blog posts etc) and can be different
formats (plaintext, HTML, PDF etc).  All the various content types are
rendered to plaintext before being inserted into the Lucene index.

The net result is that the data in one field in the index (say
"content") may have come from one of a number of source document types.
  I'm having difficulty understanding how I might map this functionality
onto Solr.  I understand how (for example) I could use
HTMLStripStandardTokenizer to insert the contents of a HTML document
into a field called "content", but (assuming I'd written a PDF analyser)
how would I insert the content of a PDF document into the same "content"
field?

I know I could do this by preprocessing the various document types to
plaintext in the various Solr clients before inserting the data into the
index, but that means that each client would need to know how to do the
document transformation.  As well as centralising the index, I also want
to centralise the handling of the different document types.

Another question:

What do "omitNorms" and "positionIncrementGap" mean in the schema.xml
file?  The documentation is vague to say the least, and google wasn't
much more helpful.

Thanks,

--
Alan Burlison
--
Reply | Threaded
Open this post in threaded view
|

Re: Handling disparate data sources in Solr

Otis Gospodnetic-2
Alan,

omitNorms let's you not use field norms for certain field when calculating document matching score.  This can save you some RAM.  See http://issues.apache.org/jira/browse/LUCENE-448 .
For position increment gap, have a look at http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/Token.html#setPositionIncrement(int) , it is described pretty well there.

I don't know the answer to your main question, though.

Otis


----- Original Message ----
From: Alan Burlison <[hidden email]>
To: [hidden email]
Sent: Friday, December 22, 2006 7:48:47 PM
Subject: Handling disparate data sources in Solr

Hi,

I'm considering using Solr to replace an existing bare-metal Lucene
deployment - the current Lucene setup is embedded inside an existing
monolithic webapp, and I want to factor out the search functionality
into a separate webapp so it can be reused more easily.

At present the content of the Lucene index comes from many different
sources (web pages, documents, blog posts etc) and can be different
formats (plaintext, HTML, PDF etc).  All the various content types are
rendered to plaintext before being inserted into the Lucene index.

The net result is that the data in one field in the index (say
"content") may have come from one of a number of source document types.
  I'm having difficulty understanding how I might map this functionality
onto Solr.  I understand how (for example) I could use
HTMLStripStandardTokenizer to insert the contents of a HTML document
into a field called "content", but (assuming I'd written a PDF analyser)
how would I insert the content of a PDF document into the same "content"
field?

I know I could do this by preprocessing the various document types to
plaintext in the various Solr clients before inserting the data into the
index, but that means that each client would need to know how to do the
document transformation.  As well as centralising the index, I also want
to centralise the handling of the different document types.

Another question:

What do "omitNorms" and "positionIncrementGap" mean in the schema.xml
file?  The documentation is vague to say the least, and google wasn't
much more helpful.

Thanks,

--
Alan Burlison
--



Reply | Threaded
Open this post in threaded view
|

Re: Handling disparate data sources in Solr

Mike Klaas
In reply to this post by Alan Burlison
On 12/22/06, Alan Burlison <[hidden email]> wrote:

> At present the content of the Lucene index comes from many different
> sources (web pages, documents, blog posts etc) and can be different
> formats (plaintext, HTML, PDF etc).  All the various content types are
> rendered to plaintext before being inserted into the Lucene index.
>
> The net result is that the data in one field in the index (say
> "content") may have come from one of a number of source document types.
>   I'm having difficulty understanding how I might map this functionality
> onto Solr.  I understand how (for example) I could use
> HTMLStripStandardTokenizer to insert the contents of a HTML document
> into a field called "content", but (assuming I'd written a PDF analyser)
> how would I insert the content of a PDF document into the same "content"
> field?

You could do it in Solr.  The difficulty is that arbitrary binary data
is not easily transferred via xml.  So you must specify that the input
is in base64 or some other encoding.  Then you could decode it on the
fly using a custom Analyzer before passing it along.

It might be easier to do this outside of solr, but still in a
centralized manner.  Write another webapp which accepts files.   It
will decode them appropriately and pass them along to the solr
instance in the same container.  Then your client don't even need to
know how to talk to solr.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: Handling disparate data sources in Solr

Bertrand Delacretaz
In reply to this post by Alan Burlison
On 12/23/06, Alan Burlison <[hidden email]> wrote:
> ...As well as centralising the index, I also want
> to centralise the handling of the different document types...

My "Subversion and Solr" presentation from the last Cocoon GetTogether
might give you ideas for how to handle this, see the link at
http://wiki.apache.org/solr/SolrResources.

Although it does not handle all binary formats out of the box (might
need to write some java glue code to implement new formats), Cocoon is
a good tool for transforming various document formats to XML and
filter the results to generate the appropriate XML for Solr. I
wouldn't add functionality to Solr for doing this, it's best to keep
things loosely-coupled IMHO.

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: Handling disparate data sources in Solr

Alan Burlison
In reply to this post by Otis Gospodnetic-2
Otis Gospodnetic wrote:

> omitNorms let's you not use field norms for certain field when
> calculating document matching score.  This can save you some RAM.
> See http://issues.apache.org/jira/browse/LUCENE-448.

Thanks.  What eddect does this have on the quality of the returned
matches?  Are there any guidelines as to when you would disable field
norms, and on which type of fields?

> For position
> increment gap, have a look at
> http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/Token.html#setPositionIncrement(int)
> , it is described pretty well there.

Right - so whereas in most cases it is set to 100, in fact a number of 5
or so would be plenty, is that correct?  In fact, isn't it more-ore-less
a boolean switch?

--
Alan Burlison
--
Reply | Threaded
Open this post in threaded view
|

Re: Handling disparate data sources in Solr

Alan Burlison
In reply to this post by Mike Klaas
Mike Klaas wrote:

> You could do it in Solr.  The difficulty is that arbitrary binary data
> is not easily transferred via xml.  So you must specify that the input
> is in base64 or some other encoding.  Then you could decode it on the
> fly using a custom Analyzer before passing it along.

Why won't cdata work?

> It might be easier to do this outside of solr, but still in a
> centralized manner.  Write another webapp which accepts files.   It
> will decode them appropriately and pass them along to the solr
> instance in the same container.  Then your client don't even need to
> know how to talk to solr.

In that case there's little point in using Solr at all - the main
benefit it gives me is that I don't have to write all the HTTP protocol
bits.  If I have to do that myself I might as well use raw Luceme - and
in fact that's how the existing system works.

--
Alan Burlison
--
Reply | Threaded
Open this post in threaded view
|

Re: Handling disparate data sources in Solr

Alan Burlison
In reply to this post by Bertrand Delacretaz
Bertrand Delacretaz wrote:

> My "Subversion and Solr" presentation from the last Cocoon GetTogether
> might give you ideas for how to handle this, see the link at
> http://wiki.apache.org/solr/SolrResources.

Hmm, I'm beginning to think the only way to do this is to write a
complete custom front-end to Solr - even a custom analyser won't do as
analyzers only deal with fields, not a full document (e.g. a PDF file).

> Although it does not handle all binary formats out of the box (might
> need to write some java glue code to implement new formats), Cocoon is
> a good tool for transforming various document formats to XML and
> filter the results to generate the appropriate XML for Solr. I
> wouldn't add functionality to Solr for doing this, it's best to keep
> things loosely-coupled IMHO.

Cocoon?  Thanks for the suggestion, but the last thing I want is yet
another "Web Framework".  I'm trying to simplify things, not add 90%
clutter for 10% functionality.

--
Alan Burlison
--
Reply | Threaded
Open this post in threaded view
|

Re: Handling disparate data sources in Solr

Otis Gospodnetic-2
In reply to this post by Alan Burlison
Hi Alan,

----- Original Message ----
From: Alan Burlison <[hidden email]>
To: [hidden email]
Sent: Saturday, December 23, 2006 8:19:21 AM
Subject: Re: Handling disparate data sources in Solr

Otis Gospodnetic wrote:

> omitNorms let's you not use field norms for certain field when
> calculating document matching score.  This can save you some RAM.
> See http://issues.apache.org/jira/browse/LUCENE-448.

Thanks.  What eddect does this have on the quality of the returned
matches?  Are there any guidelines as to when you would disable field
norms, and on which type of fields?

OG: fields that are really used as booleans and not as full-text searchable fields.  For example, a date field, or a numeric field are good candidates.  A username is a good field.  Untokenized fields whose length should not impact the score.

> For position
> increment gap, have a look at
> http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/Token.html#setPositionIncrement(int)
> , it is described pretty well there.

Right - so whereas in most cases it is set to 100, in fact a number of 5
or so would be plenty, is that correct?  In fact, isn't it more-ore-less
a boolean switch?

OG: depends on what you want to do, as described in the javadoc...

Otis



Reply | Threaded
Open this post in threaded view
|

Re: Handling disparate data sources in Solr

Chris Hostetter-3
In reply to this post by Alan Burlison

: > omitNorms let's you not use field norms for certain field when
: > calculating document matching score.  This can save you some RAM.

: Thanks.  What eddect does this have on the quality of the returned
: matches?  Are there any guidelines as to when you would disable field
: norms, and on which type of fields?

it depends on how you define quality .. if you want length to be a factor
in scoring, you need norms.  if you want index time field or doc boosts to
be a factor, you need norms -- if you have fields where you don't care
baout those things, or where you explicilty *don't* want them to be a
factor, you omit them, and you index (and RAM usage) get small as a bonus.

: > http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/Token.html#setPositionIncrement(int)
: > , it is described pretty well there.
:
: Right - so whereas in most cases it is set to 100, in fact a number of 5
: or so would be plenty, is that correct?  In fact, isn't it more-ore-less
: a boolean switch?

it depends on the Analyzers you use and how sloppy you tend to make your
Phrase or PanNear queries ... a value of 5 menas that if you did a sloppy
query for "brown cow"~20 you could concievably match brown in one field
value and cow in another ... but with a gap of 100 that would not happen.


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Handling disparate data sources in Solr

Chris Hostetter-3
In reply to this post by Alan Burlison

: > You could do it in Solr.  The difficulty is that arbitrary binary data
: > is not easily transferred via xml.  So you must specify that the input
: > is in base64 or some other encoding.  Then you could decode it on the
: > fly using a custom Analyzer before passing it along.
:
: Why won't cdata work?

because your binary data might the byte sequence: 0x5D 0x5D 0x3E --
indicating hte end of the CDATA section. CDATA is short for "Charatacter
DATA" -- you can't put arbitrary binary data in (or even arbitrary text in
it) and be sure thta it will work.

: > It might be easier to do this outside of solr, but still in a
: > centralized manner.  Write another webapp which accepts files.   It
: > will decode them appropriately and pass them along to the solr
: > instance in the same container.  Then your client don't even need to
: > know how to talk to solr.
:
: In that case there's little point in using Solr at all - the main
: benefit it gives me is that I don't have to write all the HTTP protocol
: bits.  If I have to do that myself I might as well use raw Luceme - and
: in fact that's how the existing system works.

For your purposes, if you've got a system that works and does the Document
conversion for you, then you are probably right: Solr may not be a usefull
addition to your architecture.  Solr doesn't really attempt to solve the
problem of parsing differnet kinds of data streams into a unified Document
module -- it just tries to expose all of the Lucene goodness through an
easy to use, easy to configre, HTTP interface.  Besides the
configuration, Solr's other means of being a value add is in it's
IndexReader management, it's caching, and it's plugin support for mixing
and matching request handlers, output writters, and field types as easily
as you can mix and match Analyzers.

There has been some discussion about adding plugin support for the
"update" side of things as well -- at a very simple level this could allow
for messages to be sent via JSON, or CSV instead of just XML -- but
there's no reason a more comple upate plugin couldn't read in a binary PDF
file and parse it into it's appropriate fields ... but we aren't
quite there yet.  Feel free to bring this up on solr-dev if you'd be
interested in working on it.


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Handling disparate data sources in Solr

Alan Burlison
Chris Hostetter wrote:

> : Why won't cdata work?
>
> because your binary data might the byte sequence: 0x5D 0x5D 0x3E --
> indicating hte end of the CDATA section. CDATA is short for "Charatacter
> DATA" -- you can't put arbitrary binary data in (or even arbitrary text in
> it) and be sure thta it will work.

Ok, so I have to escape ]]> - if it occurs - if I do that, why won't it
work?

> For your purposes, if you've got a system that works and does the Document
> conversion for you, then you are probably right: Solr may not be a usefull
> addition to your architecture.  Solr doesn't really attempt to solve the
> problem of parsing differnet kinds of data streams into a unified Document
> module -- it just tries to expose all of the Lucene goodness through an
> easy to use, easy to configre, HTTP interface.  Besides the
> configuration, Solr's other means of being a value add is in it's
> IndexReader management, it's caching, and it's plugin support for mixing
> and matching request handlers, output writters, and field types as easily
> as you can mix and match Analyzers.

Yes, it's all the crunchy goodness that I'm interested in ;-)

> There has been some discussion about adding plugin support for the
> "update" side of things as well -- at a very simple level this could allow
> for messages to be sent via JSON, or CSV instead of just XML -- but
> there's no reason a more comple upate plugin couldn't read in a binary PDF
> file and parse it into it's appropriate fields ... but we aren't
> quite there yet.  Feel free to bring this up on solr-dev if you'd be
> interested in working on it.

Hmm.  That's a possibility.  It all depends on the time tradeoff between
fixing what we have already to make it reusable versus extending Solr.

--
Alan Burlison
--
Reply | Threaded
Open this post in threaded view
|

Re: Handling disparate data sources in Solr

Walter Underwood, Netflix
In reply to this post by Alan Burlison
On 12/23/06 5:28 AM, "Alan Burlison" <[hidden email]> wrote:

>> You could do it in Solr.  The difficulty is that arbitrary binary data
>> is not easily transferred via xml.  So you must specify that the input
>> is in base64 or some other encoding.  Then you could decode it on the
>> fly using a custom Analyzer before passing it along.
>
> Why won't cdata work?

Some octet (byte) values are illegal in XML. Most of the ASCII control
characters are not allowed. If one of those is in an XML document,
it is a fatal error and must stop parsing in any conforming XML
parser.

wunder
--
Walter Underwood
Search Guru, Netflix



Reply | Threaded
Open this post in threaded view
|

Re: Handling disparate data sources in Solr

Alan Burlison
In reply to this post by Chris Hostetter-3
Chris Hostetter wrote:

> For your purposes, if you've got a system that works and does the Document
> conversion for you, then you are probably right: Solr may not be a usefull
> addition to your architecture.  Solr doesn't really attempt to solve the
> problem of parsing differnet kinds of data streams into a unified Document
> module -- it just tries to expose all of the Lucene goodness through an
> easy to use, easy to configre, HTTP interface.  Besides the
> configuration, Solr's other means of being a value add is in it's
> IndexReader management, it's caching, and it's plugin support for mixing
> and matching request handlers, output writters, and field types as easily
> as you can mix and match Analyzers.
>
> There has been some discussion about adding plugin support for the
> "update" side of things as well -- at a very simple level this could allow
> for messages to be sent via JSON, or CSV instead of just XML -- but
> there's no reason a more comple upate plugin couldn't read in a binary PDF
> file and parse it into it's appropriate fields ... but we aren't
> quite there yet.  Feel free to bring this up on solr-dev if you'd be
> interested in working on it.

I'm interested in discussing this further.  I've moved the discussion
onto solr-dev, as suggested.

--
Alan Burlison
--
Reply | Threaded
Open this post in threaded view
|

Re: Handling disparate data sources in Solr

Alan Burlison
Original problem statement:

----------
I'm considering using Solr to replace an existing bare-metal Lucene
deployment - the current Lucene setup is embedded inside an existing
monolithic webapp, and I want to factor out the search functionality
into a separate webapp so it can be reused more easily.

At present the content of the Lucene index comes from many different
sources (web pages, documents, blog posts etc) and can be different
formats (plaintext, HTML, PDF etc).  All the various content types are
rendered to plaintext before being inserted into the Lucene index.

The net result is that the data in one field in the index (say
"content") may have come from one of a number of source document types.
  I'm having difficulty understanding how I might map this functionality
onto Solr.  I understand how (for example) I could use
HTMLStripStandardTokenizer to insert the contents of a HTML document
into a field called "content", but (assuming I'd written a PDF analyser)
how would I insert the content of a PDF document into the same "content"
field?

I know I could do this by preprocessing the various document types to
plaintext in the various Solr clients before inserting the data into the
index, but that means that each client would need to know how to do the
document transformation.  As well as centralising the index, I also want
to centralise the handling of the different document types.
----------

My initial suggestion, to get the discussion started, is to extend the
<doc> and <field> element with the following attributes:

mime-type
Mime type of the document, e.g. application/pdf, text/html and so on.

encoding
Encoding of the document, with base64 being the standard implementation.

href
The URL of any documents that can be accessed over HTTP, instead of
embedding them in the indexing request.  The indexer would fetch the
document using the specified URL.

There would then be entries in the configuration file that map each MIME
type to a handler that is capable of dealing with that document type.

Thoughts?

--
Alan Burlison
--
Reply | Threaded
Open this post in threaded view
|

Re: Handling disparate data sources in Solr

Erik Hatcher
The idea of having Solr handle various document types is a good one,  
for sure.  I'm not sure what specifics would need to be implemented,  
but I at least wanted to reply and say its a good idea!

Care has to be taken when passing a URL to Solr for it to go fetch,  
though.  There are a lot of complexities in fetching resources via  
HTTP, especially when handing something off to Solr which should be  
behind a firewall and may not be able to see the web as you would  
with your browser.

        Erik


On Jan 4, 2007, at 4:53 PM, Alan Burlison wrote:

> Original problem statement:
>
> ----------
> I'm considering using Solr to replace an existing bare-metal Lucene  
> deployment - the current Lucene setup is embedded inside an  
> existing monolithic webapp, and I want to factor out the search  
> functionality into a separate webapp so it can be reused more easily.
>
> At present the content of the Lucene index comes from many  
> different sources (web pages, documents, blog posts etc) and can be  
> different formats (plaintext, HTML, PDF etc).  All the various  
> content types are rendered to plaintext before being inserted into  
> the Lucene index.
>
> The net result is that the data in one field in the index (say  
> "content") may have come from one of a number of source document  
> types.  I'm having difficulty understanding how I might map this  
> functionality onto Solr.  I understand how (for example) I could  
> use HTMLStripStandardTokenizer to insert the contents of a HTML  
> document into a field called "content", but (assuming I'd written a  
> PDF analyser) how would I insert the content of a PDF document into  
> the same "content" field?
>
> I know I could do this by preprocessing the various document types  
> to plaintext in the various Solr clients before inserting the data  
> into the index, but that means that each client would need to know  
> how to do the document transformation.  As well as centralising the  
> index, I also want to centralise the handling of the different  
> document types.
> ----------
>
> My initial suggestion, to get the discussion started, is to extend  
> the <doc> and <field> element with the following attributes:
>
> mime-type
> Mime type of the document, e.g. application/pdf, text/html and so on.
>
> encoding
> Encoding of the document, with base64 being the standard  
> implementation.
>
> href
> The URL of any documents that can be accessed over HTTP, instead of  
> embedding them in the indexing request.  The indexer would fetch  
> the document using the specified URL.
>
> There would then be entries in the configuration file that map each  
> MIME type to a handler that is capable of dealing with that  
> document type.
>
> Thoughts?
>
> --
> Alan Burlison
> --

Reply | Threaded
Open this post in threaded view
|

Re: Handling disparate data sources in Solr

Walter Underwood, Netflix
On 1/7/07 7:24 AM, "Erik Hatcher" <[hidden email]> wrote:

> Care has to be taken when passing a URL to Solr for it to go fetch,
> though.  There are a lot of complexities in fetching resources via
> HTTP, especially when handing something off to Solr which should be
> behind a firewall and may not be able to see the web as you would
> with your browser.

Cracking documents and spidering URLs are both big, big problems.
PDF is a horrid mess, as are old versions of MS Office. Proxies,
logins, cookies, all sort of issues show up with fetching URLs,
along with a fun variety of misbehaving servers.

I remember crashing one server with 25 GET requests before we
implemented session cookies in our spider. That used all that
DB connections and killed the server.

If you need to do a lot of spidering and parse lots of kinds of
documents, I don't know of an open source solution for that.
Products like Ultraseek and the Googlebox are about your only
choice.

wunder
--
Walter Underwood
Search Guru, Netflix
Former Architect for Ultraseek

Reply | Threaded
Open this post in threaded view
|

Re: Handling disparate data sources in Solr

Alan Burlison
In reply to this post by Erik Hatcher
Erik Hatcher wrote:

> The idea of having Solr handle various document types is a good one, for
> sure.  I'm not sure what specifics would need to be implemented, but I
> at least wanted to reply and say its a good idea!
>
> Care has to be taken when passing a URL to Solr for it to go fetch,
> though.  There are a lot of complexities in fetching resources via HTTP,
> especially when handing something off to Solr which should be behind a
> firewall and may not be able to see the web as you would with your browser.

In that case the client should encode the content and send it as part of
the index insert/update request - the aim is to merely prevent the bloat
caused by encoding the document (e.g. as base64) when the indexer can
access the source document directly.

--
Alan Burlison
--
Reply | Threaded
Open this post in threaded view
|

Re: Handling disparate data sources in Solr

Alan Burlison
In reply to this post by Walter Underwood, Netflix
Walter Underwood wrote:

> Cracking documents and spidering URLs are both big, big problems.
> PDF is a horrid mess, as are old versions of MS Office. Proxies,
> logins, cookies, all sort of issues show up with fetching URLs,
> along with a fun variety of misbehaving servers.
>
> I remember crashing one server with 25 GET requests before we
> implemented session cookies in our spider. That used all that
> DB connections and killed the server.
>
> If you need to do a lot of spidering and parse lots of kinds of
> documents, I don't know of an open source solution for that.
> Products like Ultraseek and the Googlebox are about your only
> choice.

I'm not suggesting that Solr be extended to become a spider, I'm just
suggesting we provide a mechanism for direct access to source documents
if they are accessible.  For example if the document being indexed was
on the same machine as Solr, the href would usually start "file://", not
"http://"

BTW, this discussion is also occurring on solr-dev, it might be better
to move all of it over there ;-)

--
Alan Burlison
--
Reply | Threaded
Open this post in threaded view
|

detecting duplicates using the field type 'text'

Ben Incani-2
In reply to this post by Alan Burlison
Hi Solr users,

I have the following fields set in my 'schema.xml'.

*** schema.xml ***
 <fields>
  <field name="id" type="text" indexed="true" stored="true" />
  <field name="document_title" type="text" indexed="true" stored="true"
/>
  ...
 </fields>
 <uniqueKey>id</uniqueKey>
 <defaultSearchField>document_title</defaultSearchField>
 <copyField source="document_title" dest="id"/>
*** schema.xml ***

When I add a document with a duplicate title, it gets duplicated (not
sure why)

<add>
<doc>
 <field name="document_title">duplicate</field>
</doc>
<doc>
 <field name="document_title">duplicate</field>
</doc>
</add>

When I add a document with a duplicate title (numeric only), it does not
get duplicated

<add>
<doc>
 <field name="document_title">123</field>
</doc>
<doc>
 <field name="document_title">123</field>
</doc>
</add>

I can ensure duplicates DO NOT get added when using the field type
'string'.
And I can also ensure that they DO get added when using <add allowDups =
"true">.

Why is there a disparity detecting duplicates when using the field type
'text'?

Is this merely a documentation issue or have I missed something here...

Regards,

Ben
Reply | Threaded
Open this post in threaded view
|

Re: detecting duplicates using the field type 'text'

Chris Hostetter-3
:  <uniqueKey>id</uniqueKey>
:  <defaultSearchField>document_title</defaultSearchField>
:  <copyField source="document_title" dest="id"/>

whoa... that's a pretty out there usecase ... i don't think i've ever seen
someone use their uniqueKey field as the target of a copyField.

off the top of my head, i suspect maybe the copy field is taking place
after the duplicate detection? ... but i'm not sure...

: When I add a document with a duplicate title (numeric only), it does not
: get duplicated

...and now i'm *really* not sure, that doens't make much sense to me at
all.

: I can ensure duplicates DO NOT get added when using the field type
: 'string'.

hmm... could you perhaps add the value directly to your "id" field
(string) and then copyField it into document_title ?  based one what
youv'e said, thta should work -- although i would agree, what you describe
when using your current schema definitely sounds like a bug.

it would be great if you could open a Jira issue describing this problem
... it would be even better if after posting the issue you could
make fixing it easier by attaching a test case. :)



-Hoss