Terms with different boosts

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Terms with different boosts

Guy Gavriely-2-2
Hi,

I have to index terms with different boosts, meaning that if the word A appears in two documents one document will be ranked higher.

I've tried to index them by putting them in different fields and give the fields different boost but i ran into too many files (caused by too many fields I guess) problem.
Below is the stack trace:

java.io.FileNotFoundException: /tmp/nutch/crawl/indexes/part-00000/_0.nrm (Too many open files)

 at java.io.FileInputStream.open(Native Method)
 at java.io.FileInputStream.<init>(FileInputStream.java:106)
 at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>(RawLocalFileSystem.java:87)
 at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:142)
 at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.<init>(ChecksumFileSystem.java:110)
 at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
 at org.apache.nutch.indexer.FsDirectory$DfsIndexInput$Descriptor.<init>(FsDirectory.java:160)
 at org.apache.nutch.indexer.FsDirectory$DfsIndexInput.<init>(FsDirectory.java:169)
 at org.apache.nutch.indexer.FsDirectory.openInput(FsDirectory.java:115)
 at org.apache.lucene.index.SegmentReader.openNorms(SegmentReader.java:522)
 at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:185)
 at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:140)
 at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:126)
 at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:155)
 at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:579)
 at org.apache.lucene.index.IndexReader.open(IndexReader.java:179)
 at org.apache.lucene.index.IndexReader.open(IndexReader.java:142)
 at org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.<init>(DeleteDuplicates.java:167)
 at org.apache.nutch.indexer.DeleteDuplicates$InputFormat.getRecordReader(DeleteDuplicates.java:240)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:139)
 at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
Exception in thread "main" java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
 at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
 at com.sightix.nutch.crawl.Crawl.main(Crawl.java:139)

Can you suggest another way to index terms in different boosts?
Thanks,
Guy
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Terms with different boosts

Markus Lux
Hi Guy,

I think that isn't a problem related to fields. I experienced this kind of
error caused by an limitation of the underlying file system. The problem was
that I had too much InputStreams open that had never been closed. Please
check that in your code and tell us if it worked.

Markus.


2008/9/11 Guy Gavriely <[hidden email]>

> Hi,
>
> I have to index terms with different boosts, meaning that if the word A
> appears in two documents one document will be ranked higher.
>
> I've tried to index them by putting them in different fields and give the
> fields different boost but i ran into too many files (caused by too many
> fields I guess) problem.
> Below is the stack trace:
>
> java.io.FileNotFoundException: /tmp/nutch/crawl/indexes/part-00000/_0.nrm
> (Too many open files)
>
>  at java.io.FileInputStream.open(Native Method)
>  at java.io.FileInputStream.<init>(FileInputStream.java:106)
>  at
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>(RawLocalFileSystem.java:87)
>  at
> org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:142)
>  at
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.<init>(ChecksumFileSystem.java:110)
>  at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
>  at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
>  at
> org.apache.nutch.indexer.FsDirectory$DfsIndexInput$Descriptor.<init>(FsDirectory.java:160)
>  at
> org.apache.nutch.indexer.FsDirectory$DfsIndexInput.<init>(FsDirectory.java:169)
>  at org.apache.nutch.indexer.FsDirectory.openInput(FsDirectory.java:115)
>  at org.apache.lucene.index.SegmentReader.openNorms(SegmentReader.java:522)
>  at
> org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:185)
>  at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:140)
>  at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:126)
>  at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:155)
>  at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:579)
>  at org.apache.lucene.index.IndexReader.open(IndexReader.java:179)
>  at org.apache.lucene.index.IndexReader.open(IndexReader.java:142)
>  at
> org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.<init>(DeleteDuplicates.java:167)
>  at
> org.apache.nutch.indexer.DeleteDuplicates$InputFormat.getRecordReader(DeleteDuplicates.java:240)
>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:139)
>  at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
> Exception in thread "main" java.io.IOException: Job failed!
>  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>  at
> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
>  at com.sightix.nutch.crawl.Crawl.main(Crawl.java:139)
>
> Can you suggest another way to index terms in different boosts?
> Thanks,
> Guy
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Markus Lux
Heeper Stra├če 241
33607 Bielefeld
mobil +49 162 6960943
http://www.mlux.de
Reply | Threaded
Open this post in threaded view
|

Re: Terms with different boosts

Erick Erickson
In reply to this post by Guy Gavriely-2-2
How many fields are you winding up with for each document? One for
each term?

And what is the higher-level task you're trying to accomplish? What
distinguishes *why* a certain term in a certain document should
boost a particular document? Perhaps if you explained the higher
level task someone would have a suggestion...

Best
Erick

On Thu, Sep 11, 2008 at 10:28 AM, Guy Gavriely <[hidden email]> wrote:

> Hi,
>
> I have to index terms with different boosts, meaning that if the word A
> appears in two documents one document will be ranked higher.
>
> I've tried to index them by putting them in different fields and give the
> fields different boost but i ran into too many files (caused by too many
> fields I guess) problem.
> Below is the stack trace:
>
> java.io.FileNotFoundException: /tmp/nutch/crawl/indexes/part-00000/_0.nrm
> (Too many open files)
>
>  at java.io.FileInputStream.open(Native Method)
>  at java.io.FileInputStream.<init>(FileInputStream.java:106)
>  at
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>(RawLocalFileSystem.java:87)
>  at
> org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:142)
>  at
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.<init>(ChecksumFileSystem.java:110)
>  at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
>  at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
>  at
> org.apache.nutch.indexer.FsDirectory$DfsIndexInput$Descriptor.<init>(FsDirectory.java:160)
>  at
> org.apache.nutch.indexer.FsDirectory$DfsIndexInput.<init>(FsDirectory.java:169)
>  at org.apache.nutch.indexer.FsDirectory.openInput(FsDirectory.java:115)
>  at org.apache.lucene.index.SegmentReader.openNorms(SegmentReader.java:522)
>  at
> org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:185)
>  at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:140)
>  at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:126)
>  at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:155)
>  at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:579)
>  at org.apache.lucene.index.IndexReader.open(IndexReader.java:179)
>  at org.apache.lucene.index.IndexReader.open(IndexReader.java:142)
>  at
> org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.<init>(DeleteDuplicates.java:167)
>  at
> org.apache.nutch.indexer.DeleteDuplicates$InputFormat.getRecordReader(DeleteDuplicates.java:240)
>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:139)
>  at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
> Exception in thread "main" java.io.IOException: Job failed!
>  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>  at
> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
>  at com.sightix.nutch.crawl.Crawl.main(Crawl.java:139)
>
> Can you suggest another way to index terms in different boosts?
> Thanks,
> Guy
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Terms with different boosts

Otis Gospodnetic-2
In reply to this post by Guy Gavriely-2-2
Guy, ulimit -n is your friend.  As is the compound index format.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----

> From: Guy Gavriely <[hidden email]>
> To: "[hidden email]" <[hidden email]>
> Sent: Thursday, September 11, 2008 10:28:34 AM
> Subject: Terms with different boosts
>
> Hi,
>
> I have to index terms with different boosts, meaning that if the word A appears
> in two documents one document will be ranked higher.
>
> I've tried to index them by putting them in different fields and give the fields
> different boost but i ran into too many files (caused by too many fields I
> guess) problem.
> Below is the stack trace:
>
> java.io.FileNotFoundException: /tmp/nutch/crawl/indexes/part-00000/_0.nrm (Too
> many open files)
>
> at java.io.FileInputStream.open(Native Method)
> at java.io.FileInputStream.(FileInputStream.java:106)
> at
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.(RawLocalFileSystem.java:87)
> at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:142)
> at
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110)
> at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
> at
> org.apache.nutch.indexer.FsDirectory$DfsIndexInput$Descriptor.(FsDirectory.java:160)
> at
> org.apache.nutch.indexer.FsDirectory$DfsIndexInput.(FsDirectory.java:169)
> at org.apache.nutch.indexer.FsDirectory.openInput(FsDirectory.java:115)
> at org.apache.lucene.index.SegmentReader.openNorms(SegmentReader.java:522)
> at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:185)
> at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:140)
> at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:126)
> at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:155)
> at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:579)
> at org.apache.lucene.index.IndexReader.open(IndexReader.java:179)
> at org.apache.lucene.index.IndexReader.open(IndexReader.java:142)
> at
> org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.(DeleteDuplicates.java:167)
> at
> org.apache.nutch.indexer.DeleteDuplicates$InputFormat.getRecordReader(DeleteDuplicates.java:240)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:139)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
> Exception in thread "main" java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
> at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
> at com.sightix.nutch.crawl.Crawl.main(Crawl.java:139)
>
> Can you suggest another way to index terms in different boosts?
> Thanks,
> Guy
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]