Slow file-import

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Slow file-import

arres
Hallo there,

I am working on an import-configuration for my solr-index and I got
some issues with that.

In the first step I configured an import-handler to import data from a
database into the solr-index and it worked just fine, but it is very
slow (7K documents per second). So I wanted to change that towards a
data-import-handler using a FileDataSource. (i am running solr 4.6.1)

I have to import nearly 150_000_000 lines each night and each line has
the following characteristics:
- fields are seperated by tabulator
- 70 fields each line
- one line is nearly 600 characters long
- each line contains multiple data-types (date, int, string...)

In the moment the files are imported into the database, from which
they are imported by solr (database import-handler).
To improve the import performance I wanted to import the files directly.


This is the first approach I tested:
---------------
        <entity
            name="files"
            dataSource="null"
            rootEntity="false"
            processor="FileListEntityProcessor"
            baseDir="/tmp"
            fileName=".*\.infile"
            onError="abort"
            recursive="false">
            <entity
                name="csv_file"
                processor="LineEntityProcessor"
                url="${files.fileAbsolutePath}"
                dataSource="fds"
                transformer="RegexTransformer">
                <field column="rawLine"
regex="^(.*)\t(.*)\t(.*)\t(.*)\t(.*)$"
groupNames="field1,,,field4,field5"/>
            </entity>
-----------------
If I import less than 10 fields this works just fine. But as soon as I
extend the import to 30 fields, the time to import one line increases
to more than 10sec!


So I checked another way, in which I moved the transformation to a script:
----------------
<script><![CDATA[
        function parse(row)  {
            var rawLine = row.get("rawLine")
            var arr = rawLine.split("\t");
            row.put("field1", arr[0]);
            row.put("field67", arr[67]);
//        row.remove("rawLine");
           return row;
        }
    ]]></script>
-----------------
But this was just slightly faster than the database import.


Has someone of you an idea, how I can improve my import performance?

Thank you very, very much,
Sebastian
Reply | Threaded
Open this post in threaded view
|

Re: Slow file-import

iorixxx
Hi,

Try http://wiki.apache.org/solr/UpdateCSV , it should be faster. 
See 'Tab-delimited importing' at the end of the wiki page.

Cheers,
Ahmet

On Monday, May 19, 2014 1:31 PM, Hal Arres <[hidden email]> wrote:



Hallo there,

I am working on an import-configuration for my solr-index and I got
some issues with that.

In the first step I configured an import-handler to import data from a
database into the solr-index and it worked just fine, but it is very
slow (7K documents per second). So I wanted to change that towards a
data-import-handler using a FileDataSource. (i am running solr 4.6.1)

I have to import nearly 150_000_000 lines each night and each line has
the following characteristics:
- fields are seperated by tabulator
- 70 fields each line
- one line is nearly 600 characters long
- each line contains multiple data-types (date, int, string...)

In the moment the files are imported into the database, from which
they are imported by solr (database import-handler).
To improve the import performance I wanted to import the files directly.


This is the first approach I tested:
---------------
        <entity
            name="files"
            dataSource="null"
            rootEntity="false"
            processor="FileListEntityProcessor"
            baseDir="/tmp"
            fileName=".*\.infile"
            onError="abort"
            recursive="false">
            <entity
                name="csv_file"
                processor="LineEntityProcessor"
                url="${files.fileAbsolutePath}"
                dataSource="fds"
                transformer="RegexTransformer">
                <field column="rawLine"
regex="^(.*)\t(.*)\t(.*)\t(.*)\t(.*)$"
groupNames="field1,,,field4,field5"/>
            </entity>
-----------------
If I import less than 10 fields this works just fine. But as soon as I
extend the import to 30 fields, the time to import one line increases
to more than 10sec!


So I checked another way, in which I moved the transformation to a script:
----------------
<script><![CDATA[
        function parse(row)  {
            var rawLine = row.get("rawLine")
            var arr = rawLine.split("\t");
            row.put("field1", arr[0]);
            row.put("field67", arr[67]);
//        row.remove("rawLine");
           return row;
        }
    ]]></script>
-----------------
But this was just slightly faster than the database import.


Has someone of you an idea, how I can improve my import performance?

Thank you very, very much,
Sebastian