Image Search Engine Input

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Image Search Engine Input

sseveran
Hey all,
I am working on the basics of an image search engine. I want to ask for
feedback on something.

Should I create a new directory in a segment parse_image? And then put the
images there? If not where should I put them, in the parse_text? I created a
class ImageWritable just like the Jira task said. This class contains image
meta data as well as two BytesWritable for the original image and the
thumbnail.

One more question, what ramifications does that have for the type of Parse
that I am returning? Do I need to create a ParseImage class to hold it? The
actual parsing infrastructure is something that I am still studying so any
ideas here would be great. Thanks,

Steve

Reply | Threaded
Open this post in threaded view
|

RE: Image Search Engine Input

sseveran
So now that I have spent a few hours looking into how this works a lot more
deeply I am even more of a conundrum. The fetcher passes the contents of the
page to the parsers. It assumes that text will be output from the parsers.
For instance even the SWF parser returns text. For all binary data, images,
videos, music, etc... this is problematic. Potentially confounding the
problem even further in the case of music is that text and binary data can
come from the same file. Even if that is a problem I am not going to tackle
it.

So there are 3 choices for moving forward with an image search,

1. All image data can be encoded as strings. I really don't like that choice
since the indexer will index huge amounts of junk.
2. The fetcher can be modified to allow another output for binary data. This
I think is the better choice although it will be a lot more work. I am not
sure that this is possible with MapReduce since MapRunnable has only 1
output.
3. Images can be written into another directory for processing. This would
need more work to automate but is probably non-issue.

I want to do the right thing so that the image search can eventually be in
the trunk. I don't want to have to change the way a lot of things work in
the process. Let me know what you all think.

Steve

> -----Original Message-----
> From: Steve Severance [mailto:[hidden email]]
> Sent: Monday, March 26, 2007 4:04 PM
> To: [hidden email]
> Subject: Image Search Engine Input
>
> Hey all,
> I am working on the basics of an image search engine. I want to ask for
> feedback on something.
>
> Should I create a new directory in a segment parse_image? And then put
> the
> images there? If not where should I put them, in the parse_text? I
> created a
> class ImageWritable just like the Jira task said. This class contains
> image
> meta data as well as two BytesWritable for the original image and the
> thumbnail.
>
> One more question, what ramifications does that have for the type of
> Parse
> that I am returning? Do I need to create a ParseImage class to hold it?
> The
> actual parsing infrastructure is something that I am still studying so
> any
> ideas here would be great. Thanks,
>
> Steve

Reply | Threaded
Open this post in threaded view
|

Re: Image Search Engine Input

Mathijs Homminga
Hi Steve,

Good point.
We are also working on a image search. For the time being, we store the
parsed content (a downscaled version of the image) by replacing the
original content during parsing Not an ideal solution, I know!

My first reaction is that your 2nd suggestion is the way to go.

On the other hand. We prefer to have our images outside the segments so
we can access and modify them more easily (fast retrieval at search time
is a must (for presentation)). So we were thinking of some kind of image
db using BerkleyDB SleepyCat (Oracle now).
Our indexer doesn't need the actual images themselves, it works on a
fingerprint which is computed parse time and stored in the document's
metadata as a string.

Mathijs

Steve Severance wrote:

> So now that I have spent a few hours looking into how this works a lot more
> deeply I am even more of a conundrum. The fetcher passes the contents of the
> page to the parsers. It assumes that text will be output from the parsers.
> For instance even the SWF parser returns text. For all binary data, images,
> videos, music, etc... this is problematic. Potentially confounding the
> problem even further in the case of music is that text and binary data can
> come from the same file. Even if that is a problem I am not going to tackle
> it.
>
> So there are 3 choices for moving forward with an image search,
>
> 1. All image data can be encoded as strings. I really don't like that choice
> since the indexer will index huge amounts of junk.
> 2. The fetcher can be modified to allow another output for binary data. This
> I think is the better choice although it will be a lot more work. I am not
> sure that this is possible with MapReduce since MapRunnable has only 1
> output.
> 3. Images can be written into another directory for processing. This would
> need more work to automate but is probably non-issue.
>
> I want to do the right thing so that the image search can eventually be in
> the trunk. I don't want to have to change the way a lot of things work in
> the process. Let me know what you all think.
>
> Steve
>
>  
>> -----Original Message-----
>> From: Steve Severance [mailto:[hidden email]]
>> Sent: Monday, March 26, 2007 4:04 PM
>> To: [hidden email]
>> Subject: Image Search Engine Input
>>
>> Hey all,
>> I am working on the basics of an image search engine. I want to ask for
>> feedback on something.
>>
>> Should I create a new directory in a segment parse_image? And then put
>> the
>> images there? If not where should I put them, in the parse_text? I
>> created a
>> class ImageWritable just like the Jira task said. This class contains
>> image
>> meta data as well as two BytesWritable for the original image and the
>> thumbnail.
>>
>> One more question, what ramifications does that have for the type of
>> Parse
>> that I am returning? Do I need to create a ParseImage class to hold it?
>> The
>> actual parsing infrastructure is something that I am still studying so
>> any
>> ideas here would be great. Thanks,
>>
>> Steve
>>    
>
>  
Reply | Threaded
Open this post in threaded view
|

Re: Image Search Engine Input

Andrzej Białecki-2
In reply to this post by sseveran
Steve Severance wrote:
> So now that I have spent a few hours looking into how this works a lot more
> deeply I am even more of a conundrum. The fetcher passes the contents of the
> page to the parsers. It assumes that text will be output from the parsers.
> For instance even the SWF parser returns text. For all binary data, images,
> videos, music, etc... this is problematic. Potentially confounding the
> problem even further in the case of music is that text and binary data can
> come from the same file. Even if that is a problem I am not going to tackle
> it.


Well, Nutch was originally intended as a text search engine. Lucene is a
text search library, too - so all it knows is the plain text. If you
want to use Nutch/Lucene for searching you will need to bring your data
to a plain text format - at least the parts that you want to search against.

Now, when it comes to metadata, or other associated binary data, I'm
sure we can figure out a way to store it outside the Lucene index, in a
similar way the original content and parseData is already stored outside
Lucene indexes.

-------

I've been thinking about an extension to the current "segment" format,
which would allow arbitrary parts to be created (and retrieved) - this
is actually needed to support a real-life application. It's a simple
extension of the current model. Currently segments consist of a fixed
number of pre-defined parts (content, crawl_generate, crawl_fetch,
parse_data, parse_text). But it shouldn't be too difficult to extend
segment tools and NutchBean to handle segments consisting of these basic
parts plus other arbitrary parts.

In your case: you could have an additional segment part that stores
post-processed images in binary format (you already have the original
ones in content/). Another example: we could convert PDF/DOC/PPT files
to HTML, and store this output in the "HTML preview" part.


>
> So there are 3 choices for moving forward with an image search,
>
> 1. All image data can be encoded as strings. I really don't like that choice
> since the indexer will index huge amounts of junk.
> 2. The fetcher can be modified to allow another output for binary data. This
> I think is the better choice although it will be a lot more work. I am not
> sure that this is possible with MapReduce since MapRunnable has only 1
> output.

No, not really - the number of output files is defined in the
implementation of OutputFormat - but it's true that you can only set a
single output location (and then you have to figure out how you want to
put various stuff relative to that single location). There are existing
implementations of OutputFormat-s that create more than 1 file at the
same time - see ParseOutputFormat.


> 3. Images can be written into another directory for processing. This would
> need more work to automate but is probably non-issue.
>
> I want to do the right thing so that the image search can eventually be in
> the trunk. I don't want to have to change the way a lot of things work in
> the process. Let me know what you all think.

I think we should work together on a proposed API changes to this
"extensible part" interface, plus probably some changes to the Parse
API. I can create a JIRA issue and provide some initial patches.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

RE: Image Search Engine Input

sseveran
Hey guys. Thanks for the replies.

> -----Original Message-----
> From: Andrzej Bialecki [mailto:[hidden email]]
> Sent: Tuesday, March 27, 2007 3:52 AM
> To: [hidden email]
> Subject: Re: Image Search Engine Input
>
> Steve Severance wrote:
> > So now that I have spent a few hours looking into how this works a
> lot more
> > deeply I am even more of a conundrum. The fetcher passes the contents
> of the
> > page to the parsers. It assumes that text will be output from the
> parsers.
> > For instance even the SWF parser returns text. For all binary data,
> images,
> > videos, music, etc... this is problematic. Potentially confounding
> the
> > problem even further in the case of music is that text and binary
> data can
> > come from the same file. Even if that is a problem I am not going to
> tackle
> > it.
>
>
> Well, Nutch was originally intended as a text search engine. Lucene is
> a
> text search library, too - so all it knows is the plain text. If you
> want to use Nutch/Lucene for searching you will need to bring your data
> to a plain text format - at least the parts that you want to search
> against.
>
> Now, when it comes to metadata, or other associated binary data, I'm
> sure we can figure out a way to store it outside the Lucene index, in a
> similar way the original content and parseData is already stored
> outside
> Lucene indexes.

I am not looking to really make an image retrieval engine. During indexing referencing docs will be analyzed and text content will be associated with the image. Currently I want to keep this in a separate index. So despite the fact that images will be returned the search will be against text data.

>
> -------
>
> I've been thinking about an extension to the current "segment" format,
> which would allow arbitrary parts to be created (and retrieved) - this
> is actually needed to support a real-life application. It's a simple
> extension of the current model. Currently segments consist of a fixed
> number of pre-defined parts (content, crawl_generate, crawl_fetch,
> parse_data, parse_text). But it shouldn't be too difficult to extend
> segment tools and NutchBean to handle segments consisting of these
> basic
> parts plus other arbitrary parts.
>
> In your case: you could have an additional segment part that stores
> post-processed images in binary format (you already have the original
> ones in content/). Another example: we could convert PDF/DOC/PPT files
> to HTML, and store this output in the "HTML preview" part.
>

Then it would be possible for plugins to talk to additional directories. That would be great.

>
> >
> > So there are 3 choices for moving forward with an image search,
> >
> > 1. All image data can be encoded as strings. I really don't like that
> choice
> > since the indexer will index huge amounts of junk.
> > 2. The fetcher can be modified to allow another output for binary
> data. This
> > I think is the better choice although it will be a lot more work. I
> am not
> > sure that this is possible with MapReduce since MapRunnable has only
> 1
> > output.
>
> No, not really - the number of output files is defined in the
> implementation of OutputFormat - but it's true that you can only set a
> single output location (and then you have to figure out how you want to
> put various stuff relative to that single location). There are existing
> implementations of OutputFormat-s that create more than 1 file at the
> same time - see ParseOutputFormat.

Yeah I got that. I just don't want there to be another implementation that has to be maintained or add images directly into the output format. What happens when someone wants to do music or videos? Are we going to add those as well? I don't think that we should go down that road. But if I am wrong let me know.

>
>
> > 3. Images can be written into another directory for processing. This
> would
> > need more work to automate but is probably non-issue.
> >
> > I want to do the right thing so that the image search can eventually
> be in
> > the trunk. I don't want to have to change the way a lot of things
> work in
> > the process. Let me know what you all think.
>
> I think we should work together on a proposed API changes to this
> "extensible part" interface, plus probably some changes to the Parse
> API. I can create a JIRA issue and provide some initial patches.
>

I like Mathijs's suggestion about using a DB for holding thumbnails. I just want access to be in constant time since I am going to probably need to grab at least 10 and maybe 50 for each query. That can be kept in the plugin as an option or something like that. Does that have any ramifications for being run on Hadoop?

To sum up I think we are going to make an extensible interface to allow parse plugins to write to different directories other than the ones that currently exist. Please correct me if that is wrong.

Regards,

Steve

>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Image Search Engine Input

Doug Cutting
In reply to this post by sseveran
Steve Severance wrote:
> I am not looking to really make an image retrieval engine. During indexing referencing docs will be analyzed and text content will be associated with the image. Currently I want to keep this in a separate index. So despite the fact that images will be returned the search will be against text data.

So do you just want to be able to reference the cached images?  In that
case, I think the images should stay in the content directory and be
accessed like cached pages.  The parse should just contain enough
metadata to index so that the images can be located in the cache.  I
don't see a reason to keep this in a separate index, but perhaps a
separate field instead?  Then when displaying hits you can look up
associated images and display them too.  Does that work?

Steve Severance wrote:
> I like Mathijs's suggestion about using a DB for holding thumbnails. I just want access to be in constant time since I am going to probably need to grab at least 10 and maybe 50 for each query. That can be kept in the plugin as an option or something like that. Does that have any ramifications for being run on Hadoop?

I'm not sure how a database solves scalability issues.  It seems to me
that thumbnails should be handled similarly to summaries.  They should
be retrieved in parallel from segment data in a separate pass once the
final set of hits to be displayed has been determined.  Thumbnails could
be placed in a directory per segment as a separate mapreduce pass.  I
don't see this as a parser issue, although perhaps it could be
piggybacked on that mapreduce pass, which also processes content.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Image Search Engine Input

Trey Spiva-2
Hello I am new to Nutch and Hadoop, so sorry if this question is very basic.  

If you store the image in the content directory how is a web page able to reference the image?  From what I understand when you Hadoop the files are spread out through DataNode machines.  All communication goes through the NameNode which redirects to the DataNodes (I may not be correct here, but this is how I understand how things work).  So, how do you get a path to the image to reference in the web page?

Again sorry if I am completely off base in my understanding how how things work.