PDFBoxのPDFファイルの内容に特殊文字列を読み取る方法

トピック、著者、抄録、その他の情報を論文に取り込むプログラムを作成したいのですが、pdfboxできますか？行う？PDFBoxのPDFファイルの内容に特殊文字列を読み取る方法

2017-11-30 ZJR1994

は、あなたが以下の、プロジェクトにpdfbox jarを追加があると、あなたがより多くの文書のプロパティについてはPDF

import java.io.File; 
import java.io.IOException; 

import org.apache.pdfbox.pdmodel.PDDocument; 
import org.apache.pdfbox.pdmodel.PDDocumentInformation; 

    public class readPdf { 
     public static void main(String args[]) throws IOException { 

      //Loading an existing document 

      File file = new File("C:/Users/user1/Desktop/test.pdf"); 

      PDDocument document = PDDocument.load(file); 
      //Getting the PDDocumentInformation object 
      PDDocumentInformation pdd = document.getDocumentInformation(); 

      //Retrieving the info of a PDF document 
      System.out.println("Author of the document is :"+ pdd.getAuthor()); 
      System.out.println("Title of the document is :"+ pdd.getTitle()); 
      System.out.println("Subject of the document is :"+ pdd.getSubject()); 

      System.out.println("Creator of the document is :"+ pdd.getCreator()); 
      System.out.println("Creation date of the document is :"+ pdd.getCreationDate()); 
      System.out.println("Modification date of the document is :"+ 
      pdd.getModificationDate()); 
      System.out.println("Keywords of the document are :"+ pdd.getKeywords()); 

      //Closing the document 
      document.close();   
     } 
    }

のための基本的な文書の属性の一部を取得するためのコードであるhereを参照してください。 HTH。

出典

2017-11-30 04:15:05 Akjun

私はそれを知っていますが、私が得るデータのほとんどはこのメソッドではnullです。すべて同じです。 – ZJR1994

あなたがロードしたpdfに値が含まれていない可能性があります。確認してください。私はこのコードで試して、それは必要な値を取得することができます。 – Akjun

このコードでは、これらのメタデータを使用してドキュメントを抽出できますが、抽出したいドキュメントは正式ではなく、メタデータはほとんどnullです。今、正規表現からいくつかの正規フィールドを抽出しました。 – ZJR1994

PDFBoxのPDFファイルの内容に特殊文字列を読み取る方法

答えて

関連する問題