Sequence File Question

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Sequence File Question

sseveran
Hey guys,
I have a mapreduce job that sets up a directory for pagerank. It iterates
over all the segments and then outputs a MapFile containing the data. When I
go to open the outputted directory with another MapReduce job it fails
saying that it cannot find the path. The path that it thinks it is trying to
open does not include the part-00000 directory. Both my directory (and all
other directories for that matter) have the same structure which is
/path/part-00000/<whatever>. I feel like this is a really stupid error and I
have forgotten something that is easily fixed. Any ideas?

Steve

Reply | Threaded
Open this post in threaded view
|

RE: Sequence File Question

sseveran
Let me actually refine that question we do some directories like the linkdb
have a current, and why do others like parse_data not? Is there a convention
on this?

Steve

> -----Original Message-----
> From: Steve Severance [mailto:[hidden email]]
> Sent: Wednesday, March 28, 2007 4:11 PM
> To: [hidden email]
> Subject: Sequence File Question
>
> Hey guys,
> I have a mapreduce job that sets up a directory for pagerank. It
> iterates
> over all the segments and then outputs a MapFile containing the data.
> When I
> go to open the outputted directory with another MapReduce job it fails
> saying that it cannot find the path. The path that it thinks it is
> trying to
> open does not include the part-00000 directory. Both my directory (and
> all
> other directories for that matter) have the same structure which is
> /path/part-00000/<whatever>. I feel like this is a really stupid error
> and I
> have forgotten something that is easily fixed. Any ideas?
>
> Steve

Reply | Threaded
Open this post in threaded view
|

Re: Sequence File Question

Andrzej Białecki-2
Steve Severance wrote:
> Let me actually refine that question we do some directories like the linkdb
> have a current, and why do others like parse_data not? Is there a convention
> on this?

First, to answer your original question: you should use
MapFileOutputFormat class for reading such output. It handles these
part-xxxx subdirectories automatically.

Second, the "current" subdirectory is there in order to properly handle
DB updates - or actually replacements - see e.g. CrawlDb.install()
method for details. This is not needed in case of segments, which are
created once and never updated.

Thirdly, although you didn't ask about it ;) the latest version of
Hadoop contains a handy facility called Counters - if you use the PR
PowerMethod you need to collect PR from dangling nodes in order to
redistribute it later. You can use Counters for this, and save on a
separate aggregation step.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

RE: Sequence File Question

sseveran
> -----Original Message-----
> From: Andrzej Bialecki [mailto:[hidden email]]
> Sent: Wednesday, March 28, 2007 4:34 PM
> To: [hidden email]
> Subject: Re: Sequence File Question
>
> Steve Severance wrote:
> > Let me actually refine that question we do some directories like the
> linkdb
> > have a current, and why do others like parse_data not? Is there a
> convention
> > on this?
>
> First, to answer your original question: you should use
> MapFileOutputFormat class for reading such output. It handles these
> part-xxxx subdirectories automatically.
>
> Second, the "current" subdirectory is there in order to properly handle
> DB updates - or actually replacements - see e.g. CrawlDb.install()
> method for details. This is not needed in case of segments, which are
> created once and never updated.

How does the reader know which one it is expecting. For instance I can make a reader to read a linkDB just by instantiating it on the directory crawl/linkdb And it knows to go inside the current directory. What when opening a parse_data there is no current. So how does it know which expect?

Steve

>
> Thirdly, although you didn't ask about it ;) the latest version of
> Hadoop contains a handy facility called Counters - if you use the PR
> PowerMethod you need to collect PR from dangling nodes in order to
> redistribute it later. You can use Counters for this, and save on a
> separate aggregation step.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Sequence File Question

Andrzej Białecki-2
Steve Severance wrote:
>> DB updates - or actually replacements - see e.g. CrawlDb.install()
>>  method for details. This is not needed in case of segments, which
>> are created once and never updated.
>
> How does the reader know which one it is expecting. For instance I
> can make a reader to read a linkDB just by instantiating it on the
> directory crawl/linkdb And it knows to go inside the current
> directory. What when opening a parse_data there is no current. So how
>  does it know which expect?

Use The Source Luke ;) It follows this (arbitrary) naming convention
that we always use a "current" subdirectory when working with LinkDb and
CrawlDb. And it follows a different naming convention when we use
SegmentReader.

One comment: CrawlDbReader, LinkDbReader and SegmentReader are Nutch
classes. However, the real data is stored using Hadoop classes,
specifically MapOutputFileFormat. CrawlDbReader knows about Nutch naming
convention and always appends "current" to the db name. But if you were
to use MapFileOutputFormat.getReaders() directly this Hadoop class of
course doesn't know about this, so you need to provide a full path that
includes "current".


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

RE: Sequence File Question

sseveran
Got it. I am going to document this on the wiki. Thanks.

Steve
-----Original Message-----
From: Andrzej Bialecki [mailto:[hidden email]]
Sent: Thursday, March 29, 2007 2:31 PM
To: [hidden email]
Subject: Re: Sequence File Question

Steve Severance wrote:
>> DB updates - or actually replacements - see e.g. CrawlDb.install()
>>  method for details. This is not needed in case of segments, which
>> are created once and never updated.
>
> How does the reader know which one it is expecting. For instance I
> can make a reader to read a linkDB just by instantiating it on the
> directory crawl/linkdb And it knows to go inside the current
> directory. What when opening a parse_data there is no current. So how
>  does it know which expect?

Use The Source Luke ;) It follows this (arbitrary) naming convention
that we always use a "current" subdirectory when working with LinkDb and
CrawlDb. And it follows a different naming convention when we use
SegmentReader.

One comment: CrawlDbReader, LinkDbReader and SegmentReader are Nutch
classes. However, the real data is stored using Hadoop classes,
specifically MapOutputFileFormat. CrawlDbReader knows about Nutch naming
convention and always appends "current" to the db name. But if you were
to use MapFileOutputFormat.getReaders() directly this Hadoop class of
course doesn't know about this, so you need to provide a full path that
includes "current".


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com