Image Search

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Image Search

ocramp
Hi Everybody,

 I've got nutch to index images searching it's url and alt and title tags.
 But the problem comes when storing the thumbnails.
 I`ve indexed 3million images for a national search engine.
 I was in doubt wheter I use a file system scheme or a database to store the
thumbnails.
 The thumbnails are created with a script that gets the image urls from
nutch index doing a search for http (search.jsp?query=http).

 Do you have any tips, ideas on this?

Thanks you,
Marco
Reply | Threaded
Open this post in threaded view
|

Re: Image Search

Stefan Groschupf-2
Hi,
using search http is a bad idea, since you get many but not all pages.
Just write a hadoop map reduce job that process the fetched content  
in your segments, that should be easy.
Storing images in a file system will be very slow as soon you have  
too many.
I personal don't like databases since compared to nutch they are slow  
as a snail.
For a other project also related to images I had created a own  
ImageWritable that contained the binary data of a compressed image  
compared with some meta data.
If you use a MapFile finding a image based on a key should be very  
fast. I think much faster than a database with binary content.

HTH
Stefan




Am 02.06.2006 um 21:10 schrieb Marco Pereira:

> Hi Everybody,
>
> I've got nutch to index images searching it's url and alt and title  
> tags.
> But the problem comes when storing the thumbnails.
> I`ve indexed 3million images for a national search engine.
> I was in doubt wheter I use a file system scheme or a database to  
> store the
> thumbnails.
> The thumbnails are created with a script that gets the image urls from
> nutch index doing a search for http (search.jsp?query=http).
>
> Do you have any tips, ideas on this?
>
> Thanks you,
> Marco

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com


Reply | Threaded
Open this post in threaded view
|

Re: Image Search

Thomas Delnoij-3
I am interested in developing such a solution as well.

I am currently storing the thumbnails on the file system under a
system generated name. My indexing plugin stores the filename in the
index. Thumbnails are later served to the client by seperate Apache
HTTP server. This required some changes but is otherwise pretty
straight forward and performs very well for my current 300.000+
images, around 15kb each.

If you are developing the more "Nutch-like" solution I could
contribute to that. For instance; I have some code that generates the
thumbs using ImageJ that yields very good results.

But I would definitely need some guidance in writing the hadoop map
reduce job. we could even contribute this back and base a small
tutorial on this work.

What do you think?

Rgrds, Thomas

On 6/2/06, Stefan Groschupf <[hidden email]> wrote:

> Hi,
> using search http is a bad idea, since you get many but not all pages.
> Just write a hadoop map reduce job that process the fetched content
> in your segments, that should be easy.
> Storing images in a file system will be very slow as soon you have
> too many.
> I personal don't like databases since compared to nutch they are slow
> as a snail.
> For a other project also related to images I had created a own
> ImageWritable that contained the binary data of a compressed image
> compared with some meta data.
> If you use a MapFile finding a image based on a key should be very
> fast. I think much faster than a database with binary content.
>
> HTH
> Stefan
>
>
>
>
> Am 02.06.2006 um 21:10 schrieb Marco Pereira:
>
> > Hi Everybody,
> >
> > I've got nutch to index images searching it's url and alt and title
> > tags.
> > But the problem comes when storing the thumbnails.
> > I`ve indexed 3million images for a national search engine.
> > I was in doubt wheter I use a file system scheme or a database to
> > store the
> > thumbnails.
> > The thumbnails are created with a script that gets the image urls from
> > nutch index doing a search for http (search.jsp?query=http).
> >
> > Do you have any tips, ideas on this?
> >
> > Thanks you,
> > Marco
>
> ---------------------------------------------
> blog: http://www.find23.org
> company: http://www.media-style.com
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Image Search

Thomas Delnoij-3
BTW: the generation and storing of the thumbnails is done in the
ParseFilter. It is quite easy to retrieve the URLs to Image files from
the Outlinks using regular expressions. Then the generated file name
is added to the metadata to be later retrieved by the IndexingFilter.
No need for any seperate scripts.

Rgrds Thomas

On 6/3/06, TDLN <[hidden email]> wrote:

> I am interested in developing such a solution as well.
>
> I am currently storing the thumbnails on the file system under a
> system generated name. My indexing plugin stores the filename in the
> index. Thumbnails are later served to the client by seperate Apache
> HTTP server. This required some changes but is otherwise pretty
> straight forward and performs very well for my current 300.000+
> images, around 15kb each.
>
> If you are developing the more "Nutch-like" solution I could
> contribute to that. For instance; I have some code that generates the
> thumbs using ImageJ that yields very good results.
>
> But I would definitely need some guidance in writing the hadoop map
> reduce job. we could even contribute this back and base a small
> tutorial on this work.
>
> What do you think?
>
> Rgrds, Thomas
>
> On 6/2/06, Stefan Groschupf <[hidden email]> wrote:
> > Hi,
> > using search http is a bad idea, since you get many but not all pages.
> > Just write a hadoop map reduce job that process the fetched content
> > in your segments, that should be easy.
> > Storing images in a file system will be very slow as soon you have
> > too many.
> > I personal don't like databases since compared to nutch they are slow
> > as a snail.
> > For a other project also related to images I had created a own
> > ImageWritable that contained the binary data of a compressed image
> > compared with some meta data.
> > If you use a MapFile finding a image based on a key should be very
> > fast. I think much faster than a database with binary content.
> >
> > HTH
> > Stefan
> >
> >
> >
> >
> > Am 02.06.2006 um 21:10 schrieb Marco Pereira:
> >
> > > Hi Everybody,
> > >
> > > I've got nutch to index images searching it's url and alt and title
> > > tags.
> > > But the problem comes when storing the thumbnails.
> > > I`ve indexed 3million images for a national search engine.
> > > I was in doubt wheter I use a file system scheme or a database to
> > > store the
> > > thumbnails.
> > > The thumbnails are created with a script that gets the image urls from
> > > nutch index doing a search for http (search.jsp?query=http).
> > >
> > > Do you have any tips, ideas on this?
> > >
> > > Thanks you,
> > > Marco
> >
> > ---------------------------------------------
> > blog: http://www.find23.org
> > company: http://www.media-style.com
> >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Image Search

Stefan Groschupf-2
In reply to this post by Thomas Delnoij-3
Having a image search component for nutch would be nice.
However I think we need to implement this as a kind of separated tool  
outside of the nutch code itself, since it is not 100 % integrateable  
into the nutch code.
(E.G. Nutch define one url == one index document.)
May be this would be a nice project for a nutch sandbox.
If you like you can open an issue to request a nutch sandbox project  
"image search".
If we got enough people vote for this issue we may have a chance to  
got it created.

Stefan

Am 03.06.2006 um 10:38 schrieb TDLN:

> I am interested in developing such a solution as well.
>
> I am currently storing the thumbnails on the file system under a
> system generated name. My indexing plugin stores the filename in the
> index. Thumbnails are later served to the client by seperate Apache
> HTTP server. This required some changes but is otherwise pretty
> straight forward and performs very well for my current 300.000+
> images, around 15kb each.
>
> If you are developing the more "Nutch-like" solution I could
> contribute to that. For instance; I have some code that generates the
> thumbs using ImageJ that yields very good results.
>
> But I would definitely need some guidance in writing the hadoop map
> reduce job. we could even contribute this back and base a small
> tutorial on this work.
>
> What do you think?
>
> Rgrds, Thomas
>
> On 6/2/06, Stefan Groschupf <[hidden email]> wrote:
>> Hi,
>> using search http is a bad idea, since you get many but not all  
>> pages.
>> Just write a hadoop map reduce job that process the fetched content
>> in your segments, that should be easy.
>> Storing images in a file system will be very slow as soon you have
>> too many.
>> I personal don't like databases since compared to nutch they are slow
>> as a snail.
>> For a other project also related to images I had created a own
>> ImageWritable that contained the binary data of a compressed image
>> compared with some meta data.
>> If you use a MapFile finding a image based on a key should be very
>> fast. I think much faster than a database with binary content.
>>
>> HTH
>> Stefan
>>
>>
>>
>>
>> Am 02.06.2006 um 21:10 schrieb Marco Pereira:
>>
>> > Hi Everybody,
>> >
>> > I've got nutch to index images searching it's url and alt and title
>> > tags.
>> > But the problem comes when storing the thumbnails.
>> > I`ve indexed 3million images for a national search engine.
>> > I was in doubt wheter I use a file system scheme or a database to
>> > store the
>> > thumbnails.
>> > The thumbnails are created with a script that gets the image  
>> urls from
>> > nutch index doing a search for http (search.jsp?query=http).
>> >
>> > Do you have any tips, ideas on this?
>> >
>> > Thanks you,
>> > Marco
>>
>> ---------------------------------------------
>> blog: http://www.find23.org
>> company: http://www.media-style.com
>>
>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re[2]: Image Search

Nuther
Hi,Stefan.

That would be great!!!
I think many people would vote for this.
Since nutch is really  powerfull  search engine, it would be nice to
see several types of search in it.

You wrote 3 июня 2006 г., 20:17:06:

> Having a image search component for nutch would be nice.
> However I think we need to implement this as a kind of separated tool
> outside of the nutch code itself, since it is not 100 % integrateable
> into the nutch code.
> (E.G. Nutch define one url == one index document.)
> May be this would be a nice project for a nutch sandbox.
> If you like you can open an issue to request a nutch sandbox project
> "image search".
> If we got enough people vote for this issue we may have a chance to
> got it created.

> Stefan

> Am 03.06.2006 um 10:38 schrieb TDLN:

>> I am interested in developing such a solution as well.
>>
>> I am currently storing the thumbnails on the file system under a
>> system generated name. My indexing plugin stores the filename in the
>> index. Thumbnails are later served to the client by seperate Apache
>> HTTP server. This required some changes but is otherwise pretty
>> straight forward and performs very well for my current 300.000+
>> images, around 15kb each.
>>
>> If you are developing the more "Nutch-like" solution I could
>> contribute to that. For instance; I have some code that generates the
>> thumbs using ImageJ that yields very good results.
>>
>> But I would definitely need some guidance in writing the hadoop map
>> reduce job. we could even contribute this back and base a small
>> tutorial on this work.
>>
>> What do you think?
>>
>> Rgrds, Thomas
>>
>> On 6/2/06, Stefan Groschupf <[hidden email]> wrote:
>>> Hi,
>>> using search http is a bad idea, since you get many but not all  
>>> pages.
>>> Just write a hadoop map reduce job that process the fetched content
>>> in your segments, that should be easy.
>>> Storing images in a file system will be very slow as soon you have
>>> too many.
>>> I personal don't like databases since compared to nutch they are slow
>>> as a snail.
>>> For a other project also related to images I had created a own
>>> ImageWritable that contained the binary data of a compressed image
>>> compared with some meta data.
>>> If you use a MapFile finding a image based on a key should be very
>>> fast. I think much faster than a database with binary content.
>>>
>>> HTH
>>> Stefan
>>>
>>>
>>>
>>>
>>> Am 02.06.2006 um 21:10 schrieb Marco Pereira:
>>>
>>> > Hi Everybody,
>>> >
>>> > I've got nutch to index images searching it's url and alt and title
>>> > tags.
>>> > But the problem comes when storing the thumbnails.
>>> > I`ve indexed 3million images for a national search engine.
>>> > I was in doubt wheter I use a file system scheme or a database to
>>> > store the
>>> > thumbnails.
>>> > The thumbnails are created with a script that gets the image  
>>> urls from
>>> > nutch index doing a search for http (search.jsp?query=http).
>>> >
>>> > Do you have any tips, ideas on this?
>>> >
>>> > Thanks you,
>>> > Marco
>>>
>>> ---------------------------------------------
>>> blog: http://www.find23.org
>>> company: http://www.media-style.com
>>>
>>>
>>>
>>




> __________ NOD32 1.1576 (20060602) Information __________

> This message was checked by NOD32 antivirus system.
> http://www.eset.com




--
Regards,
 Dima                          mailto:[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Re[2]: Image Search

Zaheed Haque
Yes! I am very interested.

Regards


On 6/3/06, Dima Mazmanov <[hidden email]> wrote:

> Hi,Stefan.
>
> That would be great!!!
> I think many people would vote for this.
> Since nutch is really  powerfull  search engine, it would be nice to
> see several types of search in it.
>
> You wrote 3 июня 2006 г., 20:17:06:
>
> > Having a image search component for nutch would be nice.
> > However I think we need to implement this as a kind of separated tool
> > outside of the nutch code itself, since it is not 100 % integrateable
> > into the nutch code.
> > (E.G. Nutch define one url == one index document.)
> > May be this would be a nice project for a nutch sandbox.
> > If you like you can open an issue to request a nutch sandbox project
> > "image search".
> > If we got enough people vote for this issue we may have a chance to
> > got it created.
>
> > Stefan
>
> > Am 03.06.2006 um 10:38 schrieb TDLN:
>
> >> I am interested in developing such a solution as well.
> >>
> >> I am currently storing the thumbnails on the file system under a
> >> system generated name. My indexing plugin stores the filename in the
> >> index. Thumbnails are later served to the client by seperate Apache
> >> HTTP server. This required some changes but is otherwise pretty
> >> straight forward and performs very well for my current 300.000+
> >> images, around 15kb each.
> >>
> >> If you are developing the more "Nutch-like" solution I could
> >> contribute to that. For instance; I have some code that generates the
> >> thumbs using ImageJ that yields very good results.
> >>
> >> But I would definitely need some guidance in writing the hadoop map
> >> reduce job. we could even contribute this back and base a small
> >> tutorial on this work.
> >>
> >> What do you think?
> >>
> >> Rgrds, Thomas
> >>
> >> On 6/2/06, Stefan Groschupf <[hidden email]> wrote:
> >>> Hi,
> >>> using search http is a bad idea, since you get many but not all
> >>> pages.
> >>> Just write a hadoop map reduce job that process the fetched content
> >>> in your segments, that should be easy.
> >>> Storing images in a file system will be very slow as soon you have
> >>> too many.
> >>> I personal don't like databases since compared to nutch they are slow
> >>> as a snail.
> >>> For a other project also related to images I had created a own
> >>> ImageWritable that contained the binary data of a compressed image
> >>> compared with some meta data.
> >>> If you use a MapFile finding a image based on a key should be very
> >>> fast. I think much faster than a database with binary content.
> >>>
> >>> HTH
> >>> Stefan
> >>>
> >>>
> >>>
> >>>
> >>> Am 02.06.2006 um 21:10 schrieb Marco Pereira:
> >>>
> >>> > Hi Everybody,
> >>> >
> >>> > I've got nutch to index images searching it's url and alt and title
> >>> > tags.
> >>> > But the problem comes when storing the thumbnails.
> >>> > I`ve indexed 3million images for a national search engine.
> >>> > I was in doubt wheter I use a file system scheme or a database to
> >>> > store the
> >>> > thumbnails.
> >>> > The thumbnails are created with a script that gets the image
> >>> urls from
> >>> > nutch index doing a search for http (search.jsp?query=http).
> >>> >
> >>> > Do you have any tips, ideas on this?
> >>> >
> >>> > Thanks you,
> >>> > Marco
> >>>
> >>> ---------------------------------------------
> >>> blog: http://www.find23.org
> >>> company: http://www.media-style.com
> >>>
> >>>
> >>>
> >>
>
>
>
>
> > __________ NOD32 1.1576 (20060602) Information __________
>
> > This message was checked by NOD32 antivirus system.
> > http://www.eset.com
>
>
>
>
> --
> Regards,
>  Dima                          mailto:[hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Re[2]: Image Search

Dan Morrill-3
Sounds like everyone, even me is interested in being able to provide this
service.

If the process requires that we break it off of nutch code, what all would
be required to make this happen?

r/d

-----Original Message-----
From: Zaheed Haque [mailto:[hidden email]]
Sent: Saturday, June 03, 2006 9:28 AM
To: [hidden email]
Subject: Re: Re[2]: Image Search

Yes! I am very interested.

Regards


On 6/3/06, Dima Mazmanov <[hidden email]> wrote:

> Hi,Stefan.
>
> That would be great!!!
> I think many people would vote for this.
> Since nutch is really  powerfull  search engine, it would be nice to
> see several types of search in it.
>
> You wrote 3 июня 2006 г., 20:17:06:
>
> > Having a image search component for nutch would be nice.
> > However I think we need to implement this as a kind of separated tool
> > outside of the nutch code itself, since it is not 100 % integrateable
> > into the nutch code.
> > (E.G. Nutch define one url == one index document.)
> > May be this would be a nice project for a nutch sandbox.
> > If you like you can open an issue to request a nutch sandbox project
> > "image search".
> > If we got enough people vote for this issue we may have a chance to
> > got it created.
>
> > Stefan
>
> > Am 03.06.2006 um 10:38 schrieb TDLN:
>
> >> I am interested in developing such a solution as well.
> >>
> >> I am currently storing the thumbnails on the file system under a
> >> system generated name. My indexing plugin stores the filename in the
> >> index. Thumbnails are later served to the client by seperate Apache
> >> HTTP server. This required some changes but is otherwise pretty
> >> straight forward and performs very well for my current 300.000+
> >> images, around 15kb each.
> >>
> >> If you are developing the more "Nutch-like" solution I could
> >> contribute to that. For instance; I have some code that generates the
> >> thumbs using ImageJ that yields very good results.
> >>
> >> But I would definitely need some guidance in writing the hadoop map
> >> reduce job. we could even contribute this back and base a small
> >> tutorial on this work.
> >>
> >> What do you think?
> >>
> >> Rgrds, Thomas
> >>
> >> On 6/2/06, Stefan Groschupf <[hidden email]> wrote:
> >>> Hi,
> >>> using search http is a bad idea, since you get many but not all
> >>> pages.
> >>> Just write a hadoop map reduce job that process the fetched content
> >>> in your segments, that should be easy.
> >>> Storing images in a file system will be very slow as soon you have
> >>> too many.
> >>> I personal don't like databases since compared to nutch they are slow
> >>> as a snail.
> >>> For a other project also related to images I had created a own
> >>> ImageWritable that contained the binary data of a compressed image
> >>> compared with some meta data.
> >>> If you use a MapFile finding a image based on a key should be very
> >>> fast. I think much faster than a database with binary content.
> >>>
> >>> HTH
> >>> Stefan
> >>>
> >>>
> >>>
> >>>
> >>> Am 02.06.2006 um 21:10 schrieb Marco Pereira:
> >>>
> >>> > Hi Everybody,
> >>> >
> >>> > I've got nutch to index images searching it's url and alt and title
> >>> > tags.
> >>> > But the problem comes when storing the thumbnails.
> >>> > I`ve indexed 3million images for a national search engine.
> >>> > I was in doubt wheter I use a file system scheme or a database to
> >>> > store the
> >>> > thumbnails.
> >>> > The thumbnails are created with a script that gets the image
> >>> urls from
> >>> > nutch index doing a search for http (search.jsp?query=http).
> >>> >
> >>> > Do you have any tips, ideas on this?
> >>> >
> >>> > Thanks you,
> >>> > Marco
> >>>
> >>> ---------------------------------------------
> >>> blog: http://www.find23.org
> >>> company: http://www.media-style.com
> >>>
> >>>
> >>>
> >>
>
>
>
>
> > __________ NOD32 1.1576 (20060602) Information __________
>
> > This message was checked by NOD32 antivirus system.
> > http://www.eset.com
>
>
>
>
> --
> Regards,
>  Dima                          mailto:[hidden email]
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Re[2]: Image Search

Thomas Delnoij-3
Ok, I created a Jira Issue for this:

http://issues.apache.org/jira/browse/NUTCH-296

I did not assign the Issue to any component. Maybe we can have a
"Sandbox" component?

Now, the question is how we can support several people working on this
from a "project management" or code management perspective?

I mean, if we want the Sandbox to flourish, we need some kind of
infrastructure, right?

Rgrds, Thomas Delnoij



On 6/3/06, Dan Morrill <[hidden email]> wrote:

> Sounds like everyone, even me is interested in being able to provide this
> service.
>
> If the process requires that we break it off of nutch code, what all would
> be required to make this happen?
>
> r/d
>
> -----Original Message-----
> From: Zaheed Haque [mailto:[hidden email]]
> Sent: Saturday, June 03, 2006 9:28 AM
> To: [hidden email]
> Subject: Re: Re[2]: Image Search
>
> Yes! I am very interested.
>
> Regards
>
>
> On 6/3/06, Dima Mazmanov <[hidden email]> wrote:
> > Hi,Stefan.
> >
> > That would be great!!!
> > I think many people would vote for this.
> > Since nutch is really  powerfull  search engine, it would be nice to
> > see several types of search in it.
> >
> > You wrote 3 июня 2006 г., 20:17:06:
> >
> > > Having a image search component for nutch would be nice.
> > > However I think we need to implement this as a kind of separated tool
> > > outside of the nutch code itself, since it is not 100 % integrateable
> > > into the nutch code.
> > > (E.G. Nutch define one url == one index document.)
> > > May be this would be a nice project for a nutch sandbox.
> > > If you like you can open an issue to request a nutch sandbox project
> > > "image search".
> > > If we got enough people vote for this issue we may have a chance to
> > > got it created.
> >
> > > Stefan
> >
> > > Am 03.06.2006 um 10:38 schrieb TDLN:
> >
> > >> I am interested in developing such a solution as well.
> > >>
> > >> I am currently storing the thumbnails on the file system under a
> > >> system generated name. My indexing plugin stores the filename in the
> > >> index. Thumbnails are later served to the client by seperate Apache
> > >> HTTP server. This required some changes but is otherwise pretty
> > >> straight forward and performs very well for my current 300.000+
> > >> images, around 15kb each.
> > >>
> > >> If you are developing the more "Nutch-like" solution I could
> > >> contribute to that. For instance; I have some code that generates the
> > >> thumbs using ImageJ that yields very good results.
> > >>
> > >> But I would definitely need some guidance in writing the hadoop map
> > >> reduce job. we could even contribute this back and base a small
> > >> tutorial on this work.
> > >>
> > >> What do you think?
> > >>
> > >> Rgrds, Thomas
> > >>
> > >> On 6/2/06, Stefan Groschupf <[hidden email]> wrote:
> > >>> Hi,
> > >>> using search http is a bad idea, since you get many but not all
> > >>> pages.
> > >>> Just write a hadoop map reduce job that process the fetched content
> > >>> in your segments, that should be easy.
> > >>> Storing images in a file system will be very slow as soon you have
> > >>> too many.
> > >>> I personal don't like databases since compared to nutch they are slow
> > >>> as a snail.
> > >>> For a other project also related to images I had created a own
> > >>> ImageWritable that contained the binary data of a compressed image
> > >>> compared with some meta data.
> > >>> If you use a MapFile finding a image based on a key should be very
> > >>> fast. I think much faster than a database with binary content.
> > >>>
> > >>> HTH
> > >>> Stefan
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> Am 02.06.2006 um 21:10 schrieb Marco Pereira:
> > >>>
> > >>> > Hi Everybody,
> > >>> >
> > >>> > I've got nutch to index images searching it's url and alt and title
> > >>> > tags.
> > >>> > But the problem comes when storing the thumbnails.
> > >>> > I`ve indexed 3million images for a national search engine.
> > >>> > I was in doubt wheter I use a file system scheme or a database to
> > >>> > store the
> > >>> > thumbnails.
> > >>> > The thumbnails are created with a script that gets the image
> > >>> urls from
> > >>> > nutch index doing a search for http (search.jsp?query=http).
> > >>> >
> > >>> > Do you have any tips, ideas on this?
> > >>> >
> > >>> > Thanks you,
> > >>> > Marco
> > >>>
> > >>> ---------------------------------------------
> > >>> blog: http://www.find23.org
> > >>> company: http://www.media-style.com
> > >>>
> > >>>
> > >>>
> > >>
> >
> >
> >
> >
> > > __________ NOD32 1.1576 (20060602) Information __________
> >
> > > This message was checked by NOD32 antivirus system.
> > > http://www.eset.com
> >
> >
> >
> >
> > --
> > Regards,
> >  Dima                          mailto:[hidden email]
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Re[2]: Image Search

Dan Morrill-3
Well I can do the project management side of it, and can volunteer some
time, but have never done this in an open source model before. But I can do
documentation, project management support, and make a decent cheer leader as
well.

Let me know.
r/d

-----Original Message-----
From: TDLN [mailto:[hidden email]]
Sent: Saturday, June 03, 2006 9:59 AM
To: [hidden email]
Subject: Re: Re[2]: Image Search

Ok, I created a Jira Issue for this:

http://issues.apache.org/jira/browse/NUTCH-296

I did not assign the Issue to any component. Maybe we can have a
"Sandbox" component?

Now, the question is how we can support several people working on this
from a "project management" or code management perspective?

I mean, if we want the Sandbox to flourish, we need some kind of
infrastructure, right?

Rgrds, Thomas Delnoij



On 6/3/06, Dan Morrill <[hidden email]> wrote:

> Sounds like everyone, even me is interested in being able to provide this
> service.
>
> If the process requires that we break it off of nutch code, what all would
> be required to make this happen?
>
> r/d
>
> -----Original Message-----
> From: Zaheed Haque [mailto:[hidden email]]
> Sent: Saturday, June 03, 2006 9:28 AM
> To: [hidden email]
> Subject: Re: Re[2]: Image Search
>
> Yes! I am very interested.
>
> Regards
>
>
> On 6/3/06, Dima Mazmanov <[hidden email]> wrote:
> > Hi,Stefan.
> >
> > That would be great!!!
> > I think many people would vote for this.
> > Since nutch is really  powerfull  search engine, it would be nice to
> > see several types of search in it.
> >
> > You wrote 3 июня 2006 г., 20:17:06:
> >
> > > Having a image search component for nutch would be nice.
> > > However I think we need to implement this as a kind of separated tool
> > > outside of the nutch code itself, since it is not 100 % integrateable
> > > into the nutch code.
> > > (E.G. Nutch define one url == one index document.)
> > > May be this would be a nice project for a nutch sandbox.
> > > If you like you can open an issue to request a nutch sandbox project
> > > "image search".
> > > If we got enough people vote for this issue we may have a chance to
> > > got it created.
> >
> > > Stefan
> >
> > > Am 03.06.2006 um 10:38 schrieb TDLN:
> >
> > >> I am interested in developing such a solution as well.
> > >>
> > >> I am currently storing the thumbnails on the file system under a
> > >> system generated name. My indexing plugin stores the filename in the
> > >> index. Thumbnails are later served to the client by seperate Apache
> > >> HTTP server. This required some changes but is otherwise pretty
> > >> straight forward and performs very well for my current 300.000+
> > >> images, around 15kb each.
> > >>
> > >> If you are developing the more "Nutch-like" solution I could
> > >> contribute to that. For instance; I have some code that generates the
> > >> thumbs using ImageJ that yields very good results.
> > >>
> > >> But I would definitely need some guidance in writing the hadoop map
> > >> reduce job. we could even contribute this back and base a small
> > >> tutorial on this work.
> > >>
> > >> What do you think?
> > >>
> > >> Rgrds, Thomas
> > >>
> > >> On 6/2/06, Stefan Groschupf <[hidden email]> wrote:
> > >>> Hi,
> > >>> using search http is a bad idea, since you get many but not all
> > >>> pages.
> > >>> Just write a hadoop map reduce job that process the fetched content
> > >>> in your segments, that should be easy.
> > >>> Storing images in a file system will be very slow as soon you have
> > >>> too many.
> > >>> I personal don't like databases since compared to nutch they are
slow

> > >>> as a snail.
> > >>> For a other project also related to images I had created a own
> > >>> ImageWritable that contained the binary data of a compressed image
> > >>> compared with some meta data.
> > >>> If you use a MapFile finding a image based on a key should be very
> > >>> fast. I think much faster than a database with binary content.
> > >>>
> > >>> HTH
> > >>> Stefan
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> Am 02.06.2006 um 21:10 schrieb Marco Pereira:
> > >>>
> > >>> > Hi Everybody,
> > >>> >
> > >>> > I've got nutch to index images searching it's url and alt and
title

> > >>> > tags.
> > >>> > But the problem comes when storing the thumbnails.
> > >>> > I`ve indexed 3million images for a national search engine.
> > >>> > I was in doubt wheter I use a file system scheme or a database to
> > >>> > store the
> > >>> > thumbnails.
> > >>> > The thumbnails are created with a script that gets the image
> > >>> urls from
> > >>> > nutch index doing a search for http (search.jsp?query=http).
> > >>> >
> > >>> > Do you have any tips, ideas on this?
> > >>> >
> > >>> > Thanks you,
> > >>> > Marco
> > >>>
> > >>> ---------------------------------------------
> > >>> blog: http://www.find23.org
> > >>> company: http://www.media-style.com
> > >>>
> > >>>
> > >>>
> > >>
> >
> >
> >
> >
> > > __________ NOD32 1.1576 (20060602) Information __________
> >
> > > This message was checked by NOD32 antivirus system.
> > > http://www.eset.com
> >
> >
> >
> >
> > --
> > Regards,
> >  Dima                          mailto:[hidden email]
> >
> >
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Image Search

Thomas Delnoij-3
In reply to this post by Stefan Groschupf-2
> (E.G. Nutch define one url == one index document.)

Why can't we create a document for every image that is found?

Then it is as if we will have a parse-image plugin just like we have a
parse-html and parse-pdf plugin, with the only difference that it will
be run after all the pages in the segment have been fetched?

Rgrds, Thomas
Reply | Threaded
Open this post in threaded view
|

Re: Re[2]: Image Search

Thomas Delnoij-3
In reply to this post by Dan Morrill-3
Dan - this sounds really good! Participation in an Open Source project
is new to me as well, but hey, that's why we get to start in the
sandbox :)

I was also thinking about source control. We definitely need a
repository, don't you think?

Rgrds, Thomas

On 6/3/06, Dan Morrill <[hidden email]> wrote:

> Well I can do the project management side of it, and can volunteer some
> time, but have never done this in an open source model before. But I can do
> documentation, project management support, and make a decent cheer leader as
> well.
>
> Let me know.
> r/d
>
> -----Original Message-----
> From: TDLN [mailto:[hidden email]]
> Sent: Saturday, June 03, 2006 9:59 AM
> To: [hidden email]
> Subject: Re: Re[2]: Image Search
>
> Ok, I created a Jira Issue for this:
>
> http://issues.apache.org/jira/browse/NUTCH-296
>
> I did not assign the Issue to any component. Maybe we can have a
> "Sandbox" component?
>
> Now, the question is how we can support several people working on this
> from a "project management" or code management perspective?
>
> I mean, if we want the Sandbox to flourish, we need some kind of
> infrastructure, right?
>
> Rgrds, Thomas Delnoij
>
>
>
> On 6/3/06, Dan Morrill <[hidden email]> wrote:
> > Sounds like everyone, even me is interested in being able to provide this
> > service.
> >
> > If the process requires that we break it off of nutch code, what all would
> > be required to make this happen?
> >
> > r/d
> >
> > -----Original Message-----
> > From: Zaheed Haque [mailto:[hidden email]]
> > Sent: Saturday, June 03, 2006 9:28 AM
> > To: [hidden email]
> > Subject: Re: Re[2]: Image Search
> >
> > Yes! I am very interested.
> >
> > Regards
> >
> >
> > On 6/3/06, Dima Mazmanov <[hidden email]> wrote:
> > > Hi,Stefan.
> > >
> > > That would be great!!!
> > > I think many people would vote for this.
> > > Since nutch is really  powerfull  search engine, it would be nice to
> > > see several types of search in it.
> > >
> > > You wrote 3 июня 2006 г., 20:17:06:
> > >
> > > > Having a image search component for nutch would be nice.
> > > > However I think we need to implement this as a kind of separated tool
> > > > outside of the nutch code itself, since it is not 100 % integrateable
> > > > into the nutch code.
> > > > (E.G. Nutch define one url == one index document.)
> > > > May be this would be a nice project for a nutch sandbox.
> > > > If you like you can open an issue to request a nutch sandbox project
> > > > "image search".
> > > > If we got enough people vote for this issue we may have a chance to
> > > > got it created.
> > >
> > > > Stefan
> > >
> > > > Am 03.06.2006 um 10:38 schrieb TDLN:
> > >
> > > >> I am interested in developing such a solution as well.
> > > >>
> > > >> I am currently storing the thumbnails on the file system under a
> > > >> system generated name. My indexing plugin stores the filename in the
> > > >> index. Thumbnails are later served to the client by seperate Apache
> > > >> HTTP server. This required some changes but is otherwise pretty
> > > >> straight forward and performs very well for my current 300.000+
> > > >> images, around 15kb each.
> > > >>
> > > >> If you are developing the more "Nutch-like" solution I could
> > > >> contribute to that. For instance; I have some code that generates the
> > > >> thumbs using ImageJ that yields very good results.
> > > >>
> > > >> But I would definitely need some guidance in writing the hadoop map
> > > >> reduce job. we could even contribute this back and base a small
> > > >> tutorial on this work.
> > > >>
> > > >> What do you think?
> > > >>
> > > >> Rgrds, Thomas
> > > >>
> > > >> On 6/2/06, Stefan Groschupf <[hidden email]> wrote:
> > > >>> Hi,
> > > >>> using search http is a bad idea, since you get many but not all
> > > >>> pages.
> > > >>> Just write a hadoop map reduce job that process the fetched content
> > > >>> in your segments, that should be easy.
> > > >>> Storing images in a file system will be very slow as soon you have
> > > >>> too many.
> > > >>> I personal don't like databases since compared to nutch they are
> slow
> > > >>> as a snail.
> > > >>> For a other project also related to images I had created a own
> > > >>> ImageWritable that contained the binary data of a compressed image
> > > >>> compared with some meta data.
> > > >>> If you use a MapFile finding a image based on a key should be very
> > > >>> fast. I think much faster than a database with binary content.
> > > >>>
> > > >>> HTH
> > > >>> Stefan
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> Am 02.06.2006 um 21:10 schrieb Marco Pereira:
> > > >>>
> > > >>> > Hi Everybody,
> > > >>> >
> > > >>> > I've got nutch to index images searching it's url and alt and
> title
> > > >>> > tags.
> > > >>> > But the problem comes when storing the thumbnails.
> > > >>> > I`ve indexed 3million images for a national search engine.
> > > >>> > I was in doubt wheter I use a file system scheme or a database to
> > > >>> > store the
> > > >>> > thumbnails.
> > > >>> > The thumbnails are created with a script that gets the image
> > > >>> urls from
> > > >>> > nutch index doing a search for http (search.jsp?query=http).
> > > >>> >
> > > >>> > Do you have any tips, ideas on this?
> > > >>> >
> > > >>> > Thanks you,
> > > >>> > Marco
> > > >>>
> > > >>> ---------------------------------------------
> > > >>> blog: http://www.find23.org
> > > >>> company: http://www.media-style.com
> > > >>>
> > > >>>
> > > >>>
> > > >>
> > >
> > >
> > >
> > >
> > > > __________ NOD32 1.1576 (20060602) Information __________
> > >
> > > > This message was checked by NOD32 antivirus system.
> > > > http://www.eset.com
> > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >  Dima                          mailto:[hidden email]
> > >
> > >
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re[2]: Image Search

Nuther
In reply to this post by Thomas Delnoij-3
Hi,TDLN.

But how image data will be stored in nutch database?
Would it affect on rest data in it?
>> (E.G. Nutch define one url == one index document.)

> Why can't we create a document for every image that is found?

> Then it is as if we will have a parse-image plugin just like we have a
> parse-html and parse-pdf plugin, with the only difference that it will
> be run after all the pages in the segment have been fetched?

> Rgrds, Thomas




--
Regards,
 Dima                          mailto:[hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Re[2]: Image Search

Dan Morrill-3
In reply to this post by Thomas Delnoij-3
We can use source forge as the cvs, but again it would have to be blessed to
make sure we are not crowding territory here. Can we get an official
blessing?

r/d

-----Original Message-----
From: TDLN [mailto:[hidden email]]
Sent: Saturday, June 03, 2006 10:25 AM
To: [hidden email]
Subject: Re: Re[2]: Image Search

Dan - this sounds really good! Participation in an Open Source project
is new to me as well, but hey, that's why we get to start in the
sandbox :)

I was also thinking about source control. We definitely need a
repository, don't you think?

Rgrds, Thomas

On 6/3/06, Dan Morrill <[hidden email]> wrote:
> Well I can do the project management side of it, and can volunteer some
> time, but have never done this in an open source model before. But I can
do
> documentation, project management support, and make a decent cheer leader
as

> well.
>
> Let me know.
> r/d
>
> -----Original Message-----
> From: TDLN [mailto:[hidden email]]
> Sent: Saturday, June 03, 2006 9:59 AM
> To: [hidden email]
> Subject: Re: Re[2]: Image Search
>
> Ok, I created a Jira Issue for this:
>
> http://issues.apache.org/jira/browse/NUTCH-296
>
> I did not assign the Issue to any component. Maybe we can have a
> "Sandbox" component?
>
> Now, the question is how we can support several people working on this
> from a "project management" or code management perspective?
>
> I mean, if we want the Sandbox to flourish, we need some kind of
> infrastructure, right?
>
> Rgrds, Thomas Delnoij
>
>
>
> On 6/3/06, Dan Morrill <[hidden email]> wrote:
> > Sounds like everyone, even me is interested in being able to provide
this
> > service.
> >
> > If the process requires that we break it off of nutch code, what all
would

> > be required to make this happen?
> >
> > r/d
> >
> > -----Original Message-----
> > From: Zaheed Haque [mailto:[hidden email]]
> > Sent: Saturday, June 03, 2006 9:28 AM
> > To: [hidden email]
> > Subject: Re: Re[2]: Image Search
> >
> > Yes! I am very interested.
> >
> > Regards
> >
> >
> > On 6/3/06, Dima Mazmanov <[hidden email]> wrote:
> > > Hi,Stefan.
> > >
> > > That would be great!!!
> > > I think many people would vote for this.
> > > Since nutch is really  powerfull  search engine, it would be nice to
> > > see several types of search in it.
> > >
> > > You wrote 3 июня 2006 г., 20:17:06:
> > >
> > > > Having a image search component for nutch would be nice.
> > > > However I think we need to implement this as a kind of separated
tool
> > > > outside of the nutch code itself, since it is not 100 %
integrateable

> > > > into the nutch code.
> > > > (E.G. Nutch define one url == one index document.)
> > > > May be this would be a nice project for a nutch sandbox.
> > > > If you like you can open an issue to request a nutch sandbox project
> > > > "image search".
> > > > If we got enough people vote for this issue we may have a chance to
> > > > got it created.
> > >
> > > > Stefan
> > >
> > > > Am 03.06.2006 um 10:38 schrieb TDLN:
> > >
> > > >> I am interested in developing such a solution as well.
> > > >>
> > > >> I am currently storing the thumbnails on the file system under a
> > > >> system generated name. My indexing plugin stores the filename in
the
> > > >> index. Thumbnails are later served to the client by seperate Apache
> > > >> HTTP server. This required some changes but is otherwise pretty
> > > >> straight forward and performs very well for my current 300.000+
> > > >> images, around 15kb each.
> > > >>
> > > >> If you are developing the more "Nutch-like" solution I could
> > > >> contribute to that. For instance; I have some code that generates
the

> > > >> thumbs using ImageJ that yields very good results.
> > > >>
> > > >> But I would definitely need some guidance in writing the hadoop map
> > > >> reduce job. we could even contribute this back and base a small
> > > >> tutorial on this work.
> > > >>
> > > >> What do you think?
> > > >>
> > > >> Rgrds, Thomas
> > > >>
> > > >> On 6/2/06, Stefan Groschupf <[hidden email]> wrote:
> > > >>> Hi,
> > > >>> using search http is a bad idea, since you get many but not all
> > > >>> pages.
> > > >>> Just write a hadoop map reduce job that process the fetched
content

> > > >>> in your segments, that should be easy.
> > > >>> Storing images in a file system will be very slow as soon you have
> > > >>> too many.
> > > >>> I personal don't like databases since compared to nutch they are
> slow
> > > >>> as a snail.
> > > >>> For a other project also related to images I had created a own
> > > >>> ImageWritable that contained the binary data of a compressed image
> > > >>> compared with some meta data.
> > > >>> If you use a MapFile finding a image based on a key should be very
> > > >>> fast. I think much faster than a database with binary content.
> > > >>>
> > > >>> HTH
> > > >>> Stefan
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> Am 02.06.2006 um 21:10 schrieb Marco Pereira:
> > > >>>
> > > >>> > Hi Everybody,
> > > >>> >
> > > >>> > I've got nutch to index images searching it's url and alt and
> title
> > > >>> > tags.
> > > >>> > But the problem comes when storing the thumbnails.
> > > >>> > I`ve indexed 3million images for a national search engine.
> > > >>> > I was in doubt wheter I use a file system scheme or a database
to

> > > >>> > store the
> > > >>> > thumbnails.
> > > >>> > The thumbnails are created with a script that gets the image
> > > >>> urls from
> > > >>> > nutch index doing a search for http (search.jsp?query=http).
> > > >>> >
> > > >>> > Do you have any tips, ideas on this?
> > > >>> >
> > > >>> > Thanks you,
> > > >>> > Marco
> > > >>>
> > > >>> ---------------------------------------------
> > > >>> blog: http://www.find23.org
> > > >>> company: http://www.media-style.com
> > > >>>
> > > >>>
> > > >>>
> > > >>
> > >
> > >
> > >
> > >
> > > > __________ NOD32 1.1576 (20060602) Information __________
> > >
> > > > This message was checked by NOD32 antivirus system.
> > > > http://www.eset.com
> > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >  Dima                          mailto:[hidden email]
> > >
> > >
> >
> >
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Re[2]: Image Search

Stefan Groschupf-2

> We can use source forge as the cvs,

In worst case we can use sf. However I would love to wait what Doug  
is thinking about having a sandbox repository in the nutch svn with  
limited access.

>
> -----Original Message-----
> From: TDLN [mailto:[hidden email]]
> Sent: Saturday, June 03, 2006 10:25 AM
> To: [hidden email]
> Subject: Re: Re[2]: Image Search
>
> Dan - this sounds really good! Participation in an Open Source project
> is new to me as well, but hey, that's why we get to start in the
> sandbox :)
>
> I was also thinking about source control. We definitely need a
> repository, don't you think?
>
> Rgrds, Thomas
>
> On 6/3/06, Dan Morrill <[hidden email]> wrote:
>> Well I can do the project management side of it, and can volunteer  
>> some
>> time, but have never done this in an open source model before. But  
>> I can
> do
>> documentation, project management support, and make a decent cheer  
>> leader
> as
>> well.
>>
>> Let me know.
>> r/d
>>
>> -----Original Message-----
>> From: TDLN [mailto:[hidden email]]
>> Sent: Saturday, June 03, 2006 9:59 AM
>> To: [hidden email]
>> Subject: Re: Re[2]: Image Search
>>
>> Ok, I created a Jira Issue for this:
>>
>> http://issues.apache.org/jira/browse/NUTCH-296
>>
>> I did not assign the Issue to any component. Maybe we can have a
>> "Sandbox" component?
>>
>> Now, the question is how we can support several people working on  
>> this
>> from a "project management" or code management perspective?
>>
>> I mean, if we want the Sandbox to flourish, we need some kind of
>> infrastructure, right?
>>
>> Rgrds, Thomas Delnoij
>>
>>
>>
>> On 6/3/06, Dan Morrill <[hidden email]> wrote:
>>> Sounds like everyone, even me is interested in being able to provide
> this
>>> service.
>>>
>>> If the process requires that we break it off of nutch code, what all
> would
>>> be required to make this happen?
>>>
>>> r/d
>>>
>>> -----Original Message-----
>>> From: Zaheed Haque [mailto:[hidden email]]
>>> Sent: Saturday, June 03, 2006 9:28 AM
>>> To: [hidden email]
>>> Subject: Re: Re[2]: Image Search
>>>
>>> Yes! I am very interested.
>>>
>>> Regards
>>>
>>>
>>> On 6/3/06, Dima Mazmanov <[hidden email]> wrote:
>>>> Hi,Stefan.
>>>>
>>>> That would be great!!!
>>>> I think many people would vote for this.
>>>> Since nutch is really  powerfull  search engine, it would be  
>>>> nice to
>>>> see several types of search in it.
>>>>
>>>> You wrote 3 июня 2006 г., 20:17:06:
>>>>
>>>>> Having a image search component for nutch would be nice.
>>>>> However I think we need to implement this as a kind of separated
> tool
>>>>> outside of the nutch code itself, since it is not 100 %
> integrateable
>>>>> into the nutch code.
>>>>> (E.G. Nutch define one url == one index document.)
>>>>> May be this would be a nice project for a nutch sandbox.
>>>>> If you like you can open an issue to request a nutch sandbox  
>>>>> project
>>>>> "image search".
>>>>> If we got enough people vote for this issue we may have a  
>>>>> chance to
>>>>> got it created.
>>>>
>>>>> Stefan
>>>>
>>>>> Am 03.06.2006 um 10:38 schrieb TDLN:
>>>>
>>>>>> I am interested in developing such a solution as well.
>>>>>>
>>>>>> I am currently storing the thumbnails on the file system under a
>>>>>> system generated name. My indexing plugin stores the filename in
> the
>>>>>> index. Thumbnails are later served to the client by seperate  
>>>>>> Apache
>>>>>> HTTP server. This required some changes but is otherwise pretty
>>>>>> straight forward and performs very well for my current 300.000+
>>>>>> images, around 15kb each.
>>>>>>
>>>>>> If you are developing the more "Nutch-like" solution I could
>>>>>> contribute to that. For instance; I have some code that generates
> the
>>>>>> thumbs using ImageJ that yields very good results.
>>>>>>
>>>>>> But I would definitely need some guidance in writing the  
>>>>>> hadoop map
>>>>>> reduce job. we could even contribute this back and base a small
>>>>>> tutorial on this work.
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> Rgrds, Thomas
>>>>>>
>>>>>> On 6/2/06, Stefan Groschupf <[hidden email]> wrote:
>>>>>>> Hi,
>>>>>>> using search http is a bad idea, since you get many but not all
>>>>>>> pages.
>>>>>>> Just write a hadoop map reduce job that process the fetched
> content
>>>>>>> in your segments, that should be easy.
>>>>>>> Storing images in a file system will be very slow as soon you  
>>>>>>> have
>>>>>>> too many.
>>>>>>> I personal don't like databases since compared to nutch they are
>> slow
>>>>>>> as a snail.
>>>>>>> For a other project also related to images I had created a own
>>>>>>> ImageWritable that contained the binary data of a compressed  
>>>>>>> image
>>>>>>> compared with some meta data.
>>>>>>> If you use a MapFile finding a image based on a key should be  
>>>>>>> very
>>>>>>> fast. I think much faster than a database with binary content.
>>>>>>>
>>>>>>> HTH
>>>>>>> Stefan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Am 02.06.2006 um 21:10 schrieb Marco Pereira:
>>>>>>>
>>>>>>>> Hi Everybody,
>>>>>>>>
>>>>>>>> I've got nutch to index images searching it's url and alt and
>> title
>>>>>>>> tags.
>>>>>>>> But the problem comes when storing the thumbnails.
>>>>>>>> I`ve indexed 3million images for a national search engine.
>>>>>>>> I was in doubt wheter I use a file system scheme or a database
> to
>>>>>>>> store the
>>>>>>>> thumbnails.
>>>>>>>> The thumbnails are created with a script that gets the image
>>>>>>> urls from
>>>>>>>> nutch index doing a search for http (search.jsp?query=http).
>>>>>>>>
>>>>>>>> Do you have any tips, ideas on this?
>>>>>>>>
>>>>>>>> Thanks you,
>>>>>>>> Marco
>>>>>>>
>>>>>>> ---------------------------------------------
>>>>>>> blog: http://www.find23.org
>>>>>>> company: http://www.media-style.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> __________ NOD32 1.1576 (20060602) Information __________
>>>>
>>>>> This message was checked by NOD32 antivirus system.
>>>>> http://www.eset.com
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>>  Dima                          mailto:[hidden email]
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Re[2]: Image Search

Thomas Delnoij-3
In reply to this post by Nuther
Dima.

I think there are several issues that need to be thought through
thoroughly before we can implement this.

I created a Wiki page to discuss the design:

http://wiki.apache.org/nutch/Image_Search_Design

Writing a map reduce job is completely new for me, so with my limited
knowledge in this area I cannot answer your question.

Anyway, now I think is time to read hadoop MapReduce code :)

Rgrds, Thomas



On 6/3/06, Dima Mazmanov <[hidden email]> wrote:

> Hi,TDLN.
>
> But how image data will be stored in nutch database?
> Would it affect on rest data in it?
> >> (E.G. Nutch define one url == one index document.)
>
> > Why can't we create a document for every image that is found?
>
> > Then it is as if we will have a parse-image plugin just like we have a
> > parse-html and parse-pdf plugin, with the only difference that it will
> > be run after all the pages in the segment have been fetched?
>
> > Rgrds, Thomas
>
>
>
>
> --
> Regards,
>  Dima                          mailto:[hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Image Search

srampl
In reply to this post by ocramp
I have knowledge abt how to create text seach engine using nutch. But now i need to create image search index and image search engine like google, any body having idea about this plz guide how to do this, i need very urgently,,,,Plzzzzzzzzzzz

I am waiting for ur reply

thanks in advance
ocramp wrote
Hi Everybody,

 I've got nutch to index images searching it's url and alt and title tags.
 But the problem comes when storing the thumbnails.
 I`ve indexed 3million images for a national search engine.
 I was in doubt wheter I use a file system scheme or a database to store the
thumbnails.
 The thumbnails are created with a script that gets the image urls from
nutch index doing a search for http (search.jsp?query=http).

 Do you have any tips, ideas on this?

Thanks you,
Marco
Reply | Threaded
Open this post in threaded view
|

Re: Image Search

sumittyagi
In reply to this post by ocramp
Hi Everybody
I have to parse and index the alt tags using nutch, but i am not getting how to do that...
 please help me regarding this...

thanks
sumit tyagi
ping.sumit@gmail.com


ocramp wrote
Hi Everybody,

 I've got nutch to index images searching it's url and alt and title tags.
 But the problem comes when storing the thumbnails.
 I`ve indexed 3million images for a national search engine.
 I was in doubt wheter I use a file system scheme or a database to store the
thumbnails.
 The thumbnails are created with a script that gets the image urls from
nutch index doing a search for http (search.jsp?query=http).

 Do you have any tips, ideas on this?

Thanks you,
Marco