Performance of Bulk Importing TSV File in Solr 8

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Performance of Bulk Importing TSV File in Solr 8

Joseph Lorenzini
Hi all,

I have TSV file that contains 1.2 million rows. I want to bulk import this
file into solr where each row becomes a solr document. The TSV has 24
columns. I am using the streaming API like so:

curl -v '
http://localhost:8983/solr/example/update?stream.file=/opt/solr/results.tsv&separator=%09&escape=%5c&stream.contentType=text/csv;charset=utf-8&commit=true
'

The ingestion rate is 167,000 rows a minute and takes about 7.5 minutes to
complete. I have a few questions.

- is there a way to increase the performance of the ingestion rate? I am
open to doing something other than bulk import of a TSV up to and including
writing a small program. I am just not sure what that would look like at a
high level.
- if the file is a TSV, I noticed that solr never closes a HTTP connection
with a 200 OK after all the documents are uploaded. The connection seems to
be held open indefinitely. If however, i upload the same file as a CSV,
then solr does close the http connection. Is this a bug?
Reply | Threaded
Open this post in threaded view
|

Re: Performance of Bulk Importing TSV File in Solr 8

Mikhail Khludnev-2
Hello, Joseph.

This rate looks good to me, although if the node is idling and  has a
plenty of free RAM, you can dissect this file by unix tools and submit
these partitions for import in parallel.
Hanging connection seems like a bug.

On Thu, Jan 2, 2020 at 10:09 PM Joseph Lorenzini <[hidden email]> wrote:

> Hi all,
>
> I have TSV file that contains 1.2 million rows. I want to bulk import this
> file into solr where each row becomes a solr document. The TSV has 24
> columns. I am using the streaming API like so:
>
> curl -v '
>
> http://localhost:8983/solr/example/update?stream.file=/opt/solr/results.tsv&separator=%09&escape=%5c&stream.contentType=text/csv;charset=utf-8&commit=true
> '
>
> The ingestion rate is 167,000 rows a minute and takes about 7.5 minutes to
> complete. I have a few questions.
>
> - is there a way to increase the performance of the ingestion rate? I am
> open to doing something other than bulk import of a TSV up to and including
> writing a small program. I am just not sure what that would look like at a
> high level.
> - if the file is a TSV, I noticed that solr never closes a HTTP connection
> with a 200 OK after all the documents are uploaded. The connection seems to
> be held open indefinitely. If however, i upload the same file as a CSV,
> then solr does close the http connection. Is this a bug?
>


--
Sincerely yours
Mikhail Khludnev
Reply | Threaded
Open this post in threaded view
|

Re: Performance of Bulk Importing TSV File in Solr 8

Paras Lehana
Hi Joseph,

Although your indexing rate is fast at around 2800 docs/sec, you can play
with values of autoCommit, mergePolicy and ramBufferSize.

You can post existing values of these to make us comment about those.

As Mikhail suggested, batches can increase performance by committing in
between.

On Fri, 3 Jan 2020 at 02:37, Mikhail Khludnev <[hidden email]> wrote:

> Hello, Joseph.
>
> This rate looks good to me, although if the node is idling and  has a
> plenty of free RAM, you can dissect this file by unix tools and submit
> these partitions for import in parallel.
> Hanging connection seems like a bug.
>
> On Thu, Jan 2, 2020 at 10:09 PM Joseph Lorenzini <[hidden email]>
> wrote:
>
> > Hi all,
> >
> > I have TSV file that contains 1.2 million rows. I want to bulk import
> this
> > file into solr where each row becomes a solr document. The TSV has 24
> > columns. I am using the streaming API like so:
> >
> > curl -v '
> >
> >
> http://localhost:8983/solr/example/update?stream.file=/opt/solr/results.tsv&separator=%09&escape=%5c&stream.contentType=text/csv;charset=utf-8&commit=true
> > '
> >
> > The ingestion rate is 167,000 rows a minute and takes about 7.5 minutes
> to
> > complete. I have a few questions.
> >
> > - is there a way to increase the performance of the ingestion rate? I am
> > open to doing something other than bulk import of a TSV up to and
> including
> > writing a small program. I am just not sure what that would look like at
> a
> > high level.
> > - if the file is a TSV, I noticed that solr never closes a HTTP
> connection
> > with a 200 OK after all the documents are uploaded. The connection seems
> to
> > be held open indefinitely. If however, i upload the same file as a CSV,
> > then solr does close the http connection. Is this a bug?
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


--
--
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

--
*
*

 <https://www.facebook.com/IndiaMART/videos/578196442936091/>