File system

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

File system

oSilvio
Do somebody know how do the file structure works, briefly?
It seems that the data are compressed or something, its not possible to understand whats recorded in the data nor index files.
Thanks
Silvio
Reply | Threaded
Open this post in threaded view
|

Re: File system

Dennis Kubes-2
The nutch databases are either SequenceFile or MapFile formats which
store key and value pairs.  Their keys and values are Writable
implementations which translate an object into it byte equivalent and
vice versa.

Data and index files are MapFile format.  Data is a SequenceFile, index
is an index used by MapFiles for seeking to a specific key.

Please see the hadoop wiki for more information about Sequence and Map
files and writable formats.

Dennis

oSilvio wrote:
> Do somebody know how do the file structure works, briefly?
> It seems that the data are compressed or something, its not possible to
> understand whats recorded in the data nor index files.
> Thanks
> Silvio
Reply | Threaded
Open this post in threaded view
|

Re: File system

oSilvio
Very useful information, thanks!
But in order to extract the data inside those files (like html pages) I can find no algorithm available by nutch, nor the process used to store the data. Do you know if it is possible to extract using lucene?

 
Dennis Kubes-2 wrote
The nutch databases are either SequenceFile or MapFile formats which
store key and value pairs.  Their keys and values are Writable
implementations which translate an object into it byte equivalent and
vice versa.

Data and index files are MapFile format.  Data is a SequenceFile, index
is an index used by MapFiles for seeking to a specific key.

Please see the hadoop wiki for more information about Sequence and Map
files and writable formats.

Dennis

oSilvio wrote:
> Do somebody know how do the file structure works, briefly?
> It seems that the data are compressed or something, its not possible to
> understand whats recorded in the data nor index files.
> Thanks
> Silvio
Reply | Threaded
Open this post in threaded view
|

Re: File system

oSilvio
I've seen it now thanks for the attention


oSilvio wrote
Very useful information, thanks!
But in order to extract the data inside those files (like html pages) I can find no algorithm available by nutch, nor the process used to store the data. Do you know if it is possible to extract using lucene?

 
Dennis Kubes-2 wrote
The nutch databases are either SequenceFile or MapFile formats which
store key and value pairs.  Their keys and values are Writable
implementations which translate an object into it byte equivalent and
vice versa.

Data and index files are MapFile format.  Data is a SequenceFile, index
is an index used by MapFiles for seeking to a specific key.

Please see the hadoop wiki for more information about Sequence and Map
files and writable formats.

Dennis

oSilvio wrote:
> Do somebody know how do the file structure works, briefly?
> It seems that the data are compressed or something, its not possible to
> understand whats recorded in the data nor index files.
> Thanks
> Silvio
Reply | Threaded
Open this post in threaded view
|

Re: File system

Dennis Kubes-2
In reply to this post by oSilvio
If you are talking about Nutch Contents which are stored in the segments
during fetching of pages, then you would need to write  MapReduce job to
read in the Contents object and do whatever processing you desire.

Dennis

oSilvio wrote:

> Very useful information, thanks!
> But in order to extract the data inside those files (like html pages) I can
> find no algorithm available by nutch, nor the process used to store the
> data. Do you know if it is possible to extract using lucene?
>
>  
>
> Dennis Kubes-2 wrote:
>> The nutch databases are either SequenceFile or MapFile formats which
>> store key and value pairs.  Their keys and values are Writable
>> implementations which translate an object into it byte equivalent and
>> vice versa.
>>
>> Data and index files are MapFile format.  Data is a SequenceFile, index
>> is an index used by MapFiles for seeking to a specific key.
>>
>> Please see the hadoop wiki for more information about Sequence and Map
>> files and writable formats.
>>
>> Dennis
>>
>> oSilvio wrote:
>>> Do somebody know how do the file structure works, briefly?
>>> It seems that the data are compressed or something, its not possible to
>>> understand whats recorded in the data nor index files.
>>> Thanks
>>> Silvio
>>
>