Incremental export of a huge collection

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Incremental export of a huge collection

vidit.asthana
Hi,

I am building a service where I have to continously read data from a Solr
collection and insert it into another database. Collection will receive
daily updates. Initial size of collection is very large. After I have
indexed whole data(through cursor mark), on daily basis I want to only do
incremental inserts.

My documents don't have anything like timestamp which I can use to fetch
"only newly added" documents after a certain point. Is there any internal
field which I can use to create this checkpoint and then later use that to
fetch "only incremental updates" from that point onwards?

I initially tried to sort the document by ID and use last fetched cursor
mark, but my unique-ID field is a string and there is NO guarantee that
newly added document's ID will be in sorted order.

Solr version is 8.2.0.

Regards,
Vidit
Reply | Threaded
Open this post in threaded view
|

Re: Incremental export of a huge collection

Toke Eskildsen-2
Vidit Asthana <[hidden email]> wrote:
> My documents don't have anything like timestamp which I can use to fetch
> "only newly added" documents after a certain point. Is there any internal
> field which I can use to create this checkpoint and then later use that to
> fetch "only incremental updates" from that point onwards?

You could have a timestamped field that is auto-set to the time of indexing the document:

  <field name="index_time" type="date" default="NOW" />

where date is a solr.DatePointField.


There is a warning in the API about doing that in SolrCloud, so use with care or use the TimestampUpdateProcessorFactory that is mentioned:
http://lucene.apache.org/solr/7_2_1/solr-core/org/apache/solr/schema/DatePointField.html


- Toke Eskildsen
Reply | Threaded
Open this post in threaded view
|

Re: Incremental export of a huge collection

Mikhail Khludnev-2
In reply to this post by vidit.asthana
Isn't _version_ a timestamp of insertion by default?

On Mon, Sep 9, 2019 at 9:47 PM Vidit Asthana <[hidden email]>
wrote:

> Hi,
>
> I am building a service where I have to continously read data from a Solr
> collection and insert it into another database. Collection will receive
> daily updates. Initial size of collection is very large. After I have
> indexed whole data(through cursor mark), on daily basis I want to only do
> incremental inserts.
>
> My documents don't have anything like timestamp which I can use to fetch
> "only newly added" documents after a certain point. Is there any internal
> field which I can use to create this checkpoint and then later use that to
> fetch "only incremental updates" from that point onwards?
>
> I initially tried to sort the document by ID and use last fetched cursor
> mark, but my unique-ID field is a string and there is NO guarantee that
> newly added document's ID will be in sorted order.
>
> Solr version is 8.2.0.
>
> Regards,
> Vidit
>


--
Sincerely yours
Mikhail Khludnev
Reply | Threaded
Open this post in threaded view
|

Re: Incremental export of a huge collection

Joel Bernstein
This will do what you describe:

https://lucene.apache.org/solr/guide/8_1/stream-source-reference.html#topic

Joel Bernstein
http://joelsolr.blogspot.com/


On Mon, Sep 9, 2019 at 4:18 PM Mikhail Khludnev <[hidden email]> wrote:

> Isn't _version_ a timestamp of insertion by default?
>
> On Mon, Sep 9, 2019 at 9:47 PM Vidit Asthana <[hidden email]>
> wrote:
>
> > Hi,
> >
> > I am building a service where I have to continously read data from a Solr
> > collection and insert it into another database. Collection will receive
> > daily updates. Initial size of collection is very large. After I have
> > indexed whole data(through cursor mark), on daily basis I want to only do
> > incremental inserts.
> >
> > My documents don't have anything like timestamp which I can use to fetch
> > "only newly added" documents after a certain point. Is there any internal
> > field which I can use to create this checkpoint and then later use that
> to
> > fetch "only incremental updates" from that point onwards?
> >
> > I initially tried to sort the document by ID and use last fetched cursor
> > mark, but my unique-ID field is a string and there is NO guarantee that
> > newly added document's ID will be in sorted order.
> >
> > Solr version is 8.2.0.
> >
> > Regards,
> > Vidit
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>
Reply | Threaded
Open this post in threaded view
|

Re: Incremental export of a huge collection

Paras Lehana
Hey Mikhail,

Isn't _version_ a timestamp of insertion by default?


I think yes. From a similar query on SE
<https://stackoverflow.com/questions/45671144/how-to-get-last-document-insert-in-solr>
:

You can sort by _version_ field in descending order. AFAIK, _version_ field
> is a epoch timestamp (of when the document was indexed into Solr) in
> milliseconds multiplied by 2^20.


 However, I cannot find any official documentation of Solr about this
(willing to know more about this).

Also, if the user wants to do Date Math
<https://lucene.apache.org/solr/guide/6_6/working-with-dates.html#WorkingwithDates-DateMath>
with the indexing time, I prefer the solution given by Toke:

 <field name="index_time" type="date" default="NOW" />


PS: Vidit, never do sorting or querying on _docid_ - I remember querying
for a set of documents using _docid by additionally using start and  rows. The
query never returned the result but I remember that it crashed the Solr
server by shooting up the load to over 10x! I think that's because docid ->
document thing is (logically) not indexed by the way we are querying.


On Fri, 13 Sep 2019 at 06:43, Joel Bernstein <[hidden email]> wrote:

> This will do what you describe:
>
> https://lucene.apache.org/solr/guide/8_1/stream-source-reference.html#topic
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Mon, Sep 9, 2019 at 4:18 PM Mikhail Khludnev <[hidden email]> wrote:
>
> > Isn't _version_ a timestamp of insertion by default?
> >
> > On Mon, Sep 9, 2019 at 9:47 PM Vidit Asthana <[hidden email]>
> > wrote:
> >
> > > Hi,
> > >
> > > I am building a service where I have to continously read data from a
> Solr
> > > collection and insert it into another database. Collection will receive
> > > daily updates. Initial size of collection is very large. After I have
> > > indexed whole data(through cursor mark), on daily basis I want to only
> do
> > > incremental inserts.
> > >
> > > My documents don't have anything like timestamp which I can use to
> fetch
> > > "only newly added" documents after a certain point. Is there any
> internal
> > > field which I can use to create this checkpoint and then later use that
> > to
> > > fetch "only incremental updates" from that point onwards?
> > >
> > > I initially tried to sort the document by ID and use last fetched
> cursor
> > > mark, but my unique-ID field is a string and there is NO guarantee that
> > > newly added document's ID will be in sorted order.
> > >
> > > Solr version is 8.2.0.
> > >
> > > Regards,
> > > Vidit
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> >
>


--
--
Regards,

*Paras Lehana* [65871]
Software Programmer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

--
IMPORTANT: 
NEVER share your IndiaMART OTP/ Password with anyone.