[jira] [Commented] (TIKA-2672) Upgrade dl4j to 1.0.0-beta

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-2672) Upgrade dl4j to 1.0.0-beta

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537286#comment-16537286 ]

ASF GitHub Bot commented on TIKA-2672:
--------------------------------------

chrismattmann commented on issue #241: Fix for TIKA-2672
URL: https://github.com/apache/tika/pull/241#issuecomment-403553789
 
 
   OK, tested VGG16, looks awesome, and works, FYI (note I had to `rm $HOME/.tika-dl/` and folks may also want to `rm -rf $HOME/.deeplearning4j*`):
   
   ## VGG16 server outputs:
   
   ```nonas:tika2.0.0 mattmann$ tika --config=tika-dl/src/test/resources/org/apache/tika/dl/imagerec/dl4j-vgg16-config.xml
   Jul 09, 2018 10:12:35 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
   WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
   See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
   for optional dependencies.
   
   Jul 09, 2018 10:12:35 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
   WARNING: Tesseract OCR is installed and will be automatically applied to image files unless
   you've excluded the TesseractOCRParser from the default parser.
   Tesseract may dramatically slow down content extraction (TIKA-2359).
   As of Tika 1.15 (and prior versions), Tesseract is automatically called.
   In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
   Jul 09, 2018 10:12:35 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
   WARNING: org.xerial's sqlite-jdbc is not loaded.
   Please provide the jar on your classpath to parse sqlite files.
   See tika-parsers/pom.xml for the correct version.
   INFO  Starting Apache Tika 2.0.0-SNAPSHOT server
   INFO  Using custom config: tika-dl/src/test/resources/org/apache/tika/dl/imagerec/dl4j-vgg16-config.xml
   INFO  Loaded [CpuBackend] backend
   INFO  Number of threads used for NativeOps: 2
   INFO  Number of threads used for BLAS: 2
   INFO  Backend used: [CPU]; OS: [Mac OS X]
   INFO  Cores: [4]; Memory: [3.6GB];
   INFO  Blas vendor: [MKL]
   WARN  java.io.UTFDataFormatException: malformed input around byte 11
   java.lang.RuntimeException: java.io.UTFDataFormatException: malformed input around byte 11
    at org.nd4j.linalg.api.buffer.BaseDataBuffer.read(BaseDataBuffer.java:1509)
    at org.nd4j.linalg.compression.CompressedDataBuffer.readUnknown(CompressedDataBuffer.java:83)
    at org.nd4j.linalg.factory.Nd4j.read(Nd4j.java:2725)
    at org.deeplearning4j.util.ModelSerializer.restoreComputationGraph(ModelSerializer.java:564)
    at org.deeplearning4j.util.ModelSerializer.restoreComputationGraph(ModelSerializer.java:476)
    at org.apache.tika.dl.imagerec.DL4JVGG16Net.initialize(DL4JVGG16Net.java:95)
    at org.apache.tika.parser.recognition.ObjectRecognitionParser.initialize(ObjectRecognitionParser.java:94)
    at org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:644)
    at org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:554)
    at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:191)
    at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:172)
    at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:165)
    at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:129)
    at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:124)
    at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:156)
   Caused by: java.io.UTFDataFormatException: malformed input around byte 11
    at java.io.DataInputStream.readUTF(DataInputStream.java:656)
    at java.io.DataInputStream.readUTF(DataInputStream.java:564)
    at org.nd4j.linalg.api.buffer.BaseDataBuffer.read(BaseDataBuffer.java:1450)
    ... 14 more
   ERROR Can't start
   org.apache.tika.exception.TikaConfigException: java.io.UTFDataFormatException: malformed input around byte 11
    at org.apache.tika.dl.imagerec.DL4JVGG16Net.initialize(DL4JVGG16Net.java:115)
    at org.apache.tika.parser.recognition.ObjectRecognitionParser.initialize(ObjectRecognitionParser.java:94)
    at org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:644)
    at org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:554)
    at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:191)
    at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:172)
    at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:165)
    at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:129)
    at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:124)
    at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:156)
   Caused by: java.lang.RuntimeException: java.io.UTFDataFormatException: malformed input around byte 11
    at org.nd4j.linalg.api.buffer.BaseDataBuffer.read(BaseDataBuffer.java:1509)
    at org.nd4j.linalg.compression.CompressedDataBuffer.readUnknown(CompressedDataBuffer.java:83)
    at org.nd4j.linalg.factory.Nd4j.read(Nd4j.java:2725)
    at org.deeplearning4j.util.ModelSerializer.restoreComputationGraph(ModelSerializer.java:564)
    at org.deeplearning4j.util.ModelSerializer.restoreComputationGraph(ModelSerializer.java:476)
    at org.apache.tika.dl.imagerec.DL4JVGG16Net.initialize(DL4JVGG16Net.java:95)
    ... 9 more
   Caused by: java.io.UTFDataFormatException: malformed input around byte 11
    at java.io.DataInputStream.readUTF(DataInputStream.java:656)
    at java.io.DataInputStream.readUTF(DataInputStream.java:564)
    at org.nd4j.linalg.api.buffer.BaseDataBuffer.read(BaseDataBuffer.java:1450)
    ... 14 more
   nonas:tika2.0.0 mattmann$ cat tika-dl/src/test/resources/org/apache/tika/dl/imagerec/dl4j-vgg16-config.xml
   <?xml version="1.0" encoding="UTF-8"?>
   
   <!--
     ~ Licensed to the Apache Software Foundation (ASF) under one or more
     ~ contributor license agreements.  See the NOTICE file distributed with
     ~ this work for additional information regarding copyright ownership.
     ~ The ASF licenses this file to You under the Apache License, Version 2.0
     ~ (the "License"); you may not use this file except in compliance with
     ~ the License.  You may obtain a copy of the License at
     ~
     ~    http://www.apache.org/licenses/LICENSE-2.0
     ~
     ~ Unless required by applicable law or agreed to in writing, software
     ~ distributed under the License is distributed on an "AS IS" BASIS,
     ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
     ~ See the License for the specific language governing permissions and
     ~ limitations under the License.
     -->
   <properties>
       <parsers>
           <parser class="org.apache.tika.parser.recognition.ObjectRecognitionParser">
               <mime>image/jpeg</mime>
               <params>
                   <param name="topN" type="int">3</param>
                   <param name="minConfidence" type="double">0.015</param>
                   <param name="class" type="string">org.apache.tika.dl.imagerec.DL4JVGG16Net</param>
                   <param name="modelType" type="string">VGG16</param>
                   <param name="serialize" type="bool">true</param>
               </params>
           </parser>
       </parsers>
   </properties>
   nonas:tika2.0.0 mattmann$ ls /Users/mattmann/.tika-dl/
   models
   nonas:tika2.0.0 mattmann$ rm -rf $HOME/.tika-dl/
   nonas:tika2.0.0 mattmann$ tika --config=tika-dl/src/test/resources/org/apache/tika/dl/imagerec/dl4j-vgg16-config.xml
   Jul 09, 2018 10:13:56 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
   WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
   See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
   for optional dependencies.
   
   Jul 09, 2018 10:13:56 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
   WARNING: Tesseract OCR is installed and will be automatically applied to image files unless
   you've excluded the TesseractOCRParser from the default parser.
   Tesseract may dramatically slow down content extraction (TIKA-2359).
   As of Tika 1.15 (and prior versions), Tesseract is automatically called.
   In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
   Jul 09, 2018 10:13:56 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
   WARNING: org.xerial's sqlite-jdbc is not loaded.
   Please provide the jar on your classpath to parse sqlite files.
   See tika-parsers/pom.xml for the correct version.
   INFO  Starting Apache Tika 2.0.0-SNAPSHOT server
   INFO  Using custom config: tika-dl/src/test/resources/org/apache/tika/dl/imagerec/dl4j-vgg16-config.xml
   INFO  Loaded [CpuBackend] backend
   INFO  Number of threads used for NativeOps: 2
   INFO  Number of threads used for BLAS: 2
   INFO  Backend used: [CPU]; OS: [Mac OS X]
   INFO  Cores: [4]; Memory: [3.6GB];
   INFO  Blas vendor: [MKL]
   WARN  Preprocessed Model doesn't exist at /Users/mattmann/.tika-dl/models/dl4j/vgg-16/vgg16.zip
   INFO  Using cached model at /Users/mattmann/.deeplearning4j/models/vgg16/vgg16_dl4j_inference.zip
   INFO  Verifying download...
   INFO  Checksum local is 3501732770, expecting 3501732770
   INFO  Starting ComputationGraph with WorkspaceModes set to [training: NONE; inference: SINGLE], cacheMode set to [NONE]
   INFO  Saving the Loaded model for future use. Saved models are more optimised to consume less resources.
   INFO  Recogniser = org.apache.tika.dl.imagerec.DL4JVGG16Net
   INFO  Recogniser Available = true
   INFO  Setting the server's publish address to be http://localhost:9998/
   INFO  jetty-8.y.z-SNAPSHOT
   INFO  Started SelectChannelConnector@localhost:9998
   INFO  Started Apache Tika server at http://localhost:9998/
   INFO  rmeta (autodetecting type)
   INFO  Time taken 1427ms
   INFO  Add RecognisedObject{label='lion' (eng), id='lion', confidence=0.9999885559082031}
   INFO  Add RecognisedObject{label='chow' (eng), id='chow', confidence=1.1340579476382118E-5}
   INFO  Add RecognisedObject{label='dhole' (eng), id='dhole', confidence=8.046561106311856E-8}
   ```
   
   ## VGG16 client
   
   ```
   nonas:imagerec mattmann$ curl -T lion.jpg http://localhost:9998/rmeta | python -mjson.tool
     % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                    Dload  Upload   Total   Spent    Left  Speed
   100 45971    0  1530  100 44441    916  26617  0:00:01  0:00:01 --:--:-- 26627
   [
       {
           "Content-Type": "image/jpeg",
           "OBJECT": [
               "lion (0.99999)",
               "chow (0.00001)",
               "dhole (0.00000)"
           ],
           "X-Parsed-By": [
               "org.apache.tika.parser.CompositeParser",
               "org.apache.tika.parser.recognition.ObjectRecognitionParser"
           ],
           "X-TIKA:content": "<html xmlns=\"http://www.w3.org/1999/xhtml\">\n<head>\n<meta name=\"org.apache.tika.parser.recognition.object.rec.impl\" content=\"org.apache.tika.dl.imagerec.DL4JVGG16Net\" />\n<meta name=\"X-Parsed-By\" content=\"org.apache.tika.parser.CompositeParser\" />\n<meta name=\"X-Parsed-By\" content=\"org.apache.tika.parser.recognition.ObjectRecognitionParser\" />\n<meta name=\"OBJECT\" content=\"lion (0.99999)\" />\n<meta name=\"OBJECT\" content=\"chow (0.00001)\" />\n<meta name=\"OBJECT\" content=\"dhole (0.00000)\" />\n<meta name=\"Content-Type\" content=\"image/jpeg\" />\n<title></title>\n</head>\n<body><ol id=\"objects\">\t<li id=\"lion\"> lion [eng](confidence = 0.999989)</li>\n\t<li id=\"chow\"> chow [eng](confidence = 0.000011)</li>\n\t<li id=\"dhole\"> dhole [eng](confidence = 0.000000)</li>\n</ol>\n</body></html>",
           "X-TIKA:parse_time_millis": "1495",
           "org.apache.tika.parser.recognition.object.rec.impl": "org.apache.tika.dl.imagerec.DL4JVGG16Net"
       }
   ]
   nonas:imagerec mattmann$
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


> Upgrade dl4j to 1.0.0-beta
> --------------------------
>
>                 Key: TIKA-2672
>                 URL: https://issues.apache.org/jira/browse/TIKA-2672
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: TIKA-2672.patch
>
>
> Let's try to upgrade dl4j.  I think I got us most of the way there, but I got this error when reading the json config file.  Can someone with more knowledge of layer specs help ([~thammegowda], perhaps :))?
> {noformat}
> org.deeplearning4j.exception.DL4JInvalidConfigException: Invalid configuration for layer (idx=-1, name=convolution2d_2, type=ConvolutionLayer) for width dimension:  Invalid input configuration for kernel width. Require 0 < kW <= inWidth + 2*padW; got (kW=3, inWidth=1, padW=0)
> Input type = InputTypeConvolutional(h=149,w=1,c=32), kernel = [3, 3], strides = [1, 1], padding = [0, 0], layer size (output channels) = 32, convolution mode = Truncate
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)