Index all the files in a directory

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Index all the files in a directory

Kostas Vel
Hi, I have a problem that has to do more with java than with lucene.
I have a folder that has about 524 text files (.txt) that I want to index.
I have made a program that works very well. It does indexing searching
etc...
          ...
          IndexWriter writer;
          GreekAnalyzer anal = new GreekAnalyzer();
          try {
            writer = new IndexWriter(indexPath,anal, true);
            for (int i = 1; i <=524; i++) {

              File indexFile = new File("/database/" +i+ ".txt");
              InputStream fis = new FileInputStream(indexFile);
              /*
               * We use a Document with four fields: the file path, the last
modified date
               * the length of the file  and one the file's contents.
               */
              Document doc = new Document();

              doc.add(Field.UnIndexed("path", indexFile.getPath()));
              doc.add(Field.UnIndexed("length",Long.toString(
indexFile.length())));
              doc.add(Field.UnIndexed("modified",new Date(
indexFile.lastModified()).toString()));
              doc.add(Field.Text("text", (Reader) new
InputStreamReader(fis)));

                writer.addDocument(doc);
              fis.close();
            }
            ....
My problem is that I want to delete the for loop and write the program in a
way to read all the files I want to index and there are in a folder.
I have other folders as well that have txt files as well but they are not
524 so I can not use the for loop.
I hope that you understand what I mean because I don't use the English
language very well.

Thank you very much
Kostas
Reply | Threaded
Open this post in threaded view
|

RE: Index all the files in a directory

Nick Vincent-2
Hi Kostas,

I am assuming you need to find all of the text files in the directory.
Assuming you're using JDK1.5 you need to do something like this:

File directory = new File("c:\my\directory");
FilenameFilter findTextFiles = new FilenameFilter() {
        public boolean accept(File dir, String name)
        {
                return name.endsWith(".txt");
        }
}
for( File indexFile : directory.listFiles(findTextFiles) )
{
        ...do Lucene indexing
}

If you are using JDK1.5 you can do the same using normal for loops.

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Index all the files in a directory

Kostas Vel
Dear Nick,
Thank you for your reply, but I still face some problems.
I saw the API for FilenameFilter... but I can't instantiate it because it's
abstract.

The accept method has a parameter name? Where is this? And how I call this
method? Finally what are those  " : "  in the for loop.
Sorry but if you could be more specific you could help me a lot.

Thank you in advance

Kostas




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Index all the files in a directory

Erick Erickson
Maybe something like this will work.... It just recursively descends from
the root passed to parseAll, processing each .txt file it finds.


    // Parameter is just the file path of the parent directory.
    public void parseAll(String sParentDir) throws Exception {
        indexTree(new File(sParentDir));
    }

    private void indexTree(File file) throws Exception {
        if (file.canRead()) {
            if (file.isDirectory()) {
                String[] files = file.list();
                if (files != null) {
                    for (int idx = 0; idx < files.length; idx++) {
                        indexTree(new File(file, files[idx]));
                    }
                }
            } else {
                if (file.getName().toLowerCase().indexOf(".txt") == -1)
                    return;
                try {
                    index your file here.
                } catch (Exception e) {
                    handle error here.
            }
        }
    }



Erick
Reply | Threaded
Open this post in threaded view
|

Re: Index all the files in a directory

Erick Erickson
P.S.

if you're not using 1.5......
Reply | Threaded
Open this post in threaded view
|

RE: Index all the files in a directory

Kostas Vel
Dear Erick (and everyone who might read my post) hi and thank you very much
for your help.
I've tried your program but I do something wrong and it is only index the
last txt in the directory. If you could see my problem because I am new to
Java and Lucene as well.
Thanks in advance

Kostas


public void parseAll() throws Exception {
            String sParentDir = "/database";

        indexTree(new File(sParentDir));
        }

        private void indexTree(File file) throws Exception {
           
        if (file.canRead()) {
        if (file.isDirectory()) {
        String[] files = file.list();
        if (files != null) {
        for (int idx = 0; idx < files.length; idx++) {
        indexTree(new File(file, files[idx]));
        }
        }
        } else {
        if (file.getName().toLowerCase().indexOf(".txt") == -1)
        return;
        String indexPath = "/portal-index";
        IndexWriter writer;
        GreekAnalyzer ana = new GreekAnalyzer();
        try {
            writer = new IndexWriter(indexPath, ana, true);
           
                 File indexFile = new File(""+file);
                 InputStream fis = new FileInputStream(indexFile);
                 
                Document doc = new Document();

                doc.add(Field.UnIndexed("path", indexFile.getPath()));
                doc.add(Field.UnIndexed("length",
                                        Long.toString(indexFile.length())));
                doc.add(Field.UnIndexed("modified",
                                        new Date(indexFile.lastModified()).
                                        toString()));
                doc.add(Field.Text("text", (Reader)new    
                               InputStreamReader(fis)));

                writer.addDocument(doc);
                writer.optimize();
                fis.close();
           
            writer.close();

    } catch (IOException ioe) {
            System.out.println(ioe);

        }
        }
        }
        }


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Index all the files in a directory

Erik Hatcher
You are opening a new overwriting IndexWriter for every file, so only  
the last one is indexed.  You should open your IndexWriter outside  
the loop and close it afterwards.

        Erik


On Apr 30, 2006, at 6:42 AM, Kostas V. wrote:

> Dear Erick (and everyone who might read my post) hi and thank you  
> very much
> for your help.
> I've tried your program but I do something wrong and it is only  
> index the
> last txt in the directory. If you could see my problem because I am  
> new to
> Java and Lucene as well.
> Thanks in advance
>
> Kostas
>
>
> public void parseAll() throws Exception {
>             String sParentDir = "/database";
>
>         indexTree(new File(sParentDir));
>         }
>
>         private void indexTree(File file) throws Exception {
>
>         if (file.canRead()) {
>         if (file.isDirectory()) {
>         String[] files = file.list();
>         if (files != null) {
>         for (int idx = 0; idx < files.length; idx++) {
>         indexTree(new File(file, files[idx]));
>         }
>         }
>         } else {
>         if (file.getName().toLowerCase().indexOf(".txt") == -1)
>         return;
>         String indexPath = "/portal-index";
>         IndexWriter writer;
>         GreekAnalyzer ana = new GreekAnalyzer();
>         try {
>             writer = new IndexWriter(indexPath, ana, true);
>
>                  File indexFile = new File(""+file);
>                  InputStream fis = new FileInputStream(indexFile);
>
>                 Document doc = new Document();
>
>                 doc.add(Field.UnIndexed("path", indexFile.getPath()));
>                 doc.add(Field.UnIndexed("length",
>                                         Long.toString
> (indexFile.length())));
>                 doc.add(Field.UnIndexed("modified",
>                                         new Date
> (indexFile.lastModified()).
>                                         toString()));
>                 doc.add(Field.Text("text", (Reader)new
>                                InputStreamReader(fis)));
>
>                 writer.addDocument(doc);
>                 writer.optimize();
>                 fis.close();
>
>             writer.close();
>
>     } catch (IOException ioe) {
>             System.out.println(ioe);
>
>         }
>         }
>         }
>         }
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]