[jira] [Updated] (SOLR-12854) Document steps to improve delta import via DataImportHandler

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Updated] (SOLR-12854) Document steps to improve delta import via DataImportHandler

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/SOLR-12854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amrit Sarkar updated SOLR-12854:
    Issue Type: Improvement  (was: Bug)

> Document steps to improve delta import via DataImportHandler
> -------------------------------------------------------------
>                 Key: SOLR-12854
>                 URL: https://issues.apache.org/jira/browse/SOLR-12854
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public)
>          Components: contrib - DataImportHandler
>    Affects Versions: 7.5
>            Reporter: Amrit Sarkar
>            Priority: Major
> Delta imports in DataImportHandler is sometimes slower than full imports where the delta import makes multiple queries compare to full import and hence making it time complex. Listed in: https://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport
> In the mailing list; http://lucene.472066.n3.nabble.com/Number-of-requests-spike-up-when-i-do-the-delta-Import-td4338162.html one of the Solr users have noted a workaround which works perfectly and improves delta import performance, where we need to specify ${dataimporter.last_index_time} in the delta_import_query, and not delta_sql_query.
> {code}
> I found a hacky way to limit the number of
> times deltaImportQuery was executed.
> As designed, solr executes deltaQuery to get a list of ids that need to be indexed. For each of those, it executes deltaImportQuery, which is typically very similar to the full query.
> I constructed a deltaQuery to purposely only return 1 row. E.g.
> deltaQuery = "SELECT id FROM table WHERE rownum=1" // written for
> oracle, likely requires a different syntax for other dbs. Also, it occurred
> to you could probably include the date>= '${dataimporter.last_index_time}'
> filter here so this returns 0 rows if no data has changed
> Since deltaImportQuery now *only gets called once I needed to add the filter logic to *deltaImportQuery *to only select the changed rows (that logic is normally in *deltaQuery). E.g.
> deltaImportQuery = [normal import query] WHERE date >=
> '${dataimporter.last_index_time}'
> {code}
> A number of other users have adopted the strategy and DIH delta import performance has improved, and henceforth documenting this strategy as TIP will help other users too.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]