PDFBox：テキストを抽出するときにPDF構造を維持する

テーブルからいっぱいのPDFからテキストを抽出しようとしています。場合によっては、列が空です。 PDFからテキストを抽出すると、emptysの列がスキップされ、空白に置き換えられます。したがって、私の正規表現では、この場所に情報のない列があることがわかりません。PDFBox：テキストを抽出するときにPDF構造を維持する

画像をよりよく理解する：

：

私たちは列をPDFからテキストを抽出する私のコードの抽出されたテキスト

サンプルで尊重されていないことがわかります

PDFTextStripper reader = new PDFTextStripper(); 
      reader.setSortByPosition(true); 
      reader.setStartPage(page); 
      reader.setEndPage(page); 
      String st = reader.getText(document); 
      List<String> lines = Arrays.asList(st.split(System.getProperty("line.separator")));

元のPDFのテキストを抽出する際、元のPDFの完全な構造を維持するにはどうすればよいですか？

ありがとうございます。

出典

2017-08-23 Leor

tablob javaのようなツールを試してください。これはPDFBoxの上にあります。 PDFBoxはテーブルを識別しようとしません。 –

Leor、PDFTextStripperの変種がPDFに余裕がある余分なスペースを挿入しようとすると興味がある場合は、[削除した質問に回答した]をコピーします（https ：//stackoverflow.com/a/28370692/1729265）をご利用ください。 – mkl

@mklあなたの解決策が役立つかもしれません。追加された余分なスペースが常に（文字数の点で）同じ場合は、ジョブを実行できます。 – Leor

（元はthe answer (dated Feb 6 '15) to another questionです）すべての回答を含めOPが削除されましたが、回答のコードはまだPDFBox 1.8.xに基づいているため、PDFBox 2.0で実行するにはいくつかの変更が必要な場合があります.X。）コメントはOPで

は手での質問の場合に役立つ可能性があるPDFファイルのレイアウトを反映しようとすると、テキストの行を返すようにPDFBox PDFTextStripperを拡張するためのソリューションに興味を示しました。

概念実証そのためには、このクラスのようになります。

public class LayoutTextStripper extends PDFTextStripper { public LayoutTextStripper() throws IOException { super(); } @Override protected void startPage(PDPage page) throws IOException { super.startPage(page); cropBox = page.findCropBox(); pageLeft = cropBox.getLowerLeftX(); beginLine(); } @Override protected void writeString(String text, List<TextPosition> textPositions) throws IOException { float recentEnd = 0; for (TextPosition textPosition: textPositions) { String textHere = textPosition.getCharacter(); if (textHere.trim().length() == 0) continue; float start = textPosition.getTextPos().getXPosition(); boolean spacePresent = endsWithWS | textHere.startsWith(" "); if (needsWS | spacePresent | Math.abs(start - recentEnd) > 1) { int spacesToInsert = insertSpaces(chars, start, needsWS & !spacePresent); for (; spacesToInsert > 0; spacesToInsert--) { writeString(" "); chars++; } } writeString(textHere); chars += textHere.length(); needsWS = false; endsWithWS = textHere.endsWith(" "); try { recentEnd = getEndX(textPosition); } catch (IllegalArgumentException | IllegalAccessException | NoSuchFieldException | SecurityException e) { throw new IOException("Failure retrieving endX of TextPosition", e); } } } @Override protected void writeLineSeparator() throws IOException { super.writeLineSeparator(); beginLine(); } @Override protected void writeWordSeparator() throws IOException { needsWS = true; } void beginLine() { endsWithWS = true; needsWS = false; chars = 0; } int insertSpaces(int charsInLineAlready, float chunkStart, boolean spaceRequired) { int indexNow = charsInLineAlready; int indexToBe = (int)((chunkStart - pageLeft)/fixedCharWidth); int spacesToInsert = indexToBe - indexNow; if (spacesToInsert < 1 && spaceRequired) spacesToInsert = 1; return spacesToInsert; } float getEndX(TextPosition textPosition) throws IllegalArgumentException, IllegalAccessException, NoSuchFieldException, SecurityException { Field field = textPosition.getClass().getDeclaredField("endX"); field.setAccessible(true); return field.getFloat(textPosition); } public float fixedCharWidth = 3; boolean endsWithWS = true; boolean needsWS = false; int chars = 0; PDRectangle cropBox = null; float pageLeft = 0; }

それは、このように使用されます。

PDDocument document = PDDocument.load(PDF); LayoutTextStripper stripper = new LayoutTextStripper(); stripper.setSortByPosition(true); stripper.fixedCharWidth = charWidth; // e.g. 5 String text = stripper.getText(document);

fixedCharWidthが想定文字幅です。問題のPDFの記述に応じて、異なる値がより適切かもしれません。私のサンプル文書では、3..6の値が重要でした。

本質的に、this answerのiTextの類似ソリューションをエミュレートします。 iTextテキスト抽出はテキストチャンクを転送し、PDFBoxテキスト抽出は個々の文字を転送するため、結果は少し異なります。

これは単なる概念の証明であることにご注意ください。特に考慮していない

出典

2017-08-23 14:27:12 mkl

あなたのソリューションはかなりうまくいきます。それは私が使用されたPDBoxのバージョンと一致するように少し変換する必要があったが、最初の実行は有望である。構造は元のPDFとほぼ同じです。これがあれば私はこの解決法を使用します。ありがとうございます – Leor

PDFBox：テキストを抽出するときにPDF構造を維持する

答えて

関連する問題