pdfbox

2017-11-17 31 views
0

Link to pdfpdfbox

を使用してPDFから目に見えないテキストを削除し、私は上記のPDFファイルからテキストを抽出しようとすると、私が見えた示すビューアーで見えなかったテキストの混合物と同様にテキストを取得します。さらに、所望のテキストの中には、「FALCONS」の「S」や多くの「1/2」文字が欠けているなど、視聴者に欠けていない文字が欠落しているものがあります。これは見えないテキストからの干渉によるものだと信じています。なぜなら、ビューア内のpdfをハイライト表示すると、目に見えないテキストが見えるテキストと重なって見えるからです。

不可視テキストを削除する方法はありますか?それとも別の解決策がありますか?

コード:

import java.io.File; 
import java.io.IOException; 

import org.apache.pdfbox.pdmodel.PDDocument; 
import org.apache.pdfbox.text.PDFTextStripper; 


public class App { 

    public static String getPdfText(String pdfPath) throws IOException { 
     File file = new File(pdfPath); 
     PDDocument document = null; 
     PDFTextStripper textStripper = null; 
     String text = null; 

     try { 
      document = PDDocument.load(file); 
      textStripper = new PDFTextStripper(); 
      textStripper.setEndPage(1); 
      text = textStripper.getText(document); 
     } catch (IOException e) { 
      throw new IOException("Could not load file and strip text.", e); 
     } finally { 
      try { 
       if (document != null) 
        document.close(); 
      } catch (IOException e) { 
       System.out.println("Could not close document"); 
      } 
     } 

     return text; 
    } 

    public static void main(String[] args) { 
     String filename = "RevTeaser09072016.pdf"; 
     String text = null; 

     try { 
      text = getPdfText(filename); 
     } catch (IOException e) { 
      e.printStackTrace(); 
      System.exit(1); 
     } 

     System.out.println(text); 
    } 
} 

アウトプット(太字テキストである所望のテキスト):

 
145 
143 
159 
144 
160 
141 
157155 156154150 153149 152148 151147 
142 
158 
500 
146 
Selections 
Number of Teams 
Amount Bet 
REVERSE tEaSER caRd 
mark box as shown 
 denotes home team 
PRO FOOTBALL - THURSDAY, NOVEMBER 15, 2012 
1 BILLS ★ NFL PM8:25 2 DOLPHINS7– ½ 6– ½ 
PRO FOOTBALL - SUNDAY, NOVEMBER 18, 2012 
3 REDSKINS ★ PM1:00 4 EAGLES10– ½ 3– ½ 
5 PACKERS PM1:00 6 LIONS ★10– ½ 3– ½ 
7 FALCONS ★ PM1:00 8 CARDINALS17– ½ 3+ ½ 
9 BUCCANEERS PM1:00 10 PANTHERS ★7– ½ 6– ½ 
11 COWBOYS ★ PM1:00 12 BROWNS14– ½ + ½ 
13 RAMS ★ PM1:00 14 JETS10– ½ 3– ½ 
15 PATRIOTS ★ PM4:25 16 COLTS17– ½ 3+ ½ 
17 TEXANS ★ PM1:00 18 JAGUARS23– ½ 9+ ½ 
19 BENGALS PM1:00 20 CHIEFS ★10– ½ 3– ½ 
21 SAINTS PM4:05 22 RAIDERS ★12– ½ 1– ½ 
23 BRONCOS ★ PM4:25 24 CHARGERS14– ½ + ½ 
25 RAVENS NBC PM8:30 26 STEELERS ★7– ½ 6– ½ 
PRO FOOTBALL - MONDAY, NOVEMBER 19, 2012 
27 49ERS ★ ESPN PM8:40 28 BEARS10– ½ 3– ½ 
1,000 
145 
143 
159 
144 
160 
141 
157155 156154150 153149 152148 151147 
142 
158 
500 
146 
Selections 
Number of Teams 
Amount Bet 
REVERSE tEaSER caRd 
mark box as hown 
 denotes home team 
PRO FOOTBALL - THURSDAY, NOVEMBER 15, 2012 
1 BILLS ★ NFL PM8:25 2 DOLPHINS7– ½ 6– ½ 
PRO FOOTBALL - SUNDAY, NOVEMBER 18, 2012 
3 REDSKINS ★ PM1:00 4 EAGLES10– ½ 3– ½ 
5 PACKERS PM1:00 6 LIONS ★10– ½ 3– ½ 
7 FALCONS ★ PM1:00 8 CARDINALS17– ½ 3+ ½ 
9 BUCCANEERS PM1:00 10 PANTHERS ★7– ½ 6– ½ 
11 COWBOYS ★ PM1:00 12 BROWNS14– ½ + ½ 
13 RAMS ★ PM1:00 14 JETS10– ½ 3– ½ 
15 PATRIOTS ★ PM4:25 16 COLTS17– ½ 3+ ½ 
17 TEXANS ★ PM1:00 18 JAGUARS23– ½ 9+ ½ 
19 BENGALS PM1:00 20 CHIEFS ★10– ½ 3– ½ 
21 SAINTS PM4:05 22 RAIDERS ★12– ½ 1– ½ 
23 BRONCOS ★ PM4:25 24 CHARGERS14– ½ + ½ 
25 RAVENS NBC PM8:30 26 STEEL RS ★7– ½ 6– ½ 
PRO FOOTBALL - MONDAY, NOVEMBER 19, 2012 
27 49ERS ★ ESPN PM8:40 28 BEARS10– ½ 3– ½ 
1,000 
145 
143 
159 
14 
160 
41 
15715 156154150 153149 152148 51147 
142 
158 
50 
146 
S lections 
Number of Teams 
Amount Bet 

ark box as sho n 
 denotes home team 
PRO F OTBALL - THURSDAY, NOVEMBER 15, 2012 
1 BILLS ★ NFL PM8:25 2 DOLPHINS7– ½ 6– ½ 
PRO F OTBALL - SUNDAY, NOVEMBER 18, 2012 
3 REDSKINS ★ PM1:0 4 EAGLES10– ½ 3– ½ 
5 PACKERS PM1:0 6 LIONS ★10– ½ 3– ½ 
7 FALCONS ★ PM1:0 8 CARDINALS17– ½ 3+ ½ 
9 BU CANEERS PM1:0 10 PANTHERS ★7– ½ 6– ½ 
11 COWBOYS ★ PM1:0 12 BROWNS14– ½ + ½ 
13 RAMS ★ PM1:0 14 JETS10– ½ 3– ½ 
15 PATRIOTS ★ PM4:25 16 COLTS17– ½ 3+ ½ 
17 TEXANS ★ PM1:0 18 JAGUARS23– ½ 9+ ½ 
19 BENGALS PM1:0 20 CHIEFS ★10– ½ 3– ½ 
21 SAINTS PM4:05 22 RAIDERS ★12– ½ 1– ½ 
23 BRONCOS ★ PM4:25 24 CHARGERS14– ½ + ½ 
25 RAVENS NBC PM8:30 26 STEELERS ★7– ½ 6– ½ 
PRO F OTBALL - MONDAY, NOVEMBER 19, 2012 
27 49ERS ★ ESPN PM8:40 28 BEARS10– ½ 3– ½ 
1,0 
MARK BOX AS SHOWN  
DENOTES HOME TEAM 
PRO FOOTBALL - THURSDAY, SEPTEMBER 8, 2016 
1 PANTHERS nbc - 10½ 8:30p 2 BRONCOS  - 3½ 
PRO FOOTBALL - SUNDAY, SEPTEMBER 11, 2016 
    FALCON  - 9 1:00p 4 BUCCANEERS - 4½ 
5 VIKINGS - 9½ 1:00p 6 TITANS  - 4½ 
7 EAGLES  - 10½ 1:00p 8 BROWNS - 3½ 
9 BENGALS - 9½ 1:00p 10 JETS  - 4½ 
11 SAINTS  - 7½ 1:00p 12 RAIDERS - 6½ 
13 CHIEFS  - 14½ 1:00p 14 CHARGERS + ½ 
15 RAVENS  - 10½ 1:00p 16 BILLS - 3½ 
17 TEXANS  - 14 1:00p 18 BEARS + ½ 
19 PACKERS - 12 1:00p 20 JAGUARS  - 1½ 
21 SEAHAWKS  - 17½ 4:05p 22 DOLPHINS + 3½ 
23 COWBOYS  - 7½ 4:25p 24 GIANTS - 6½ 
25 COLTS  - 10½ 4:25p 26 LIONS - 3½ 
27 CARDINALS  nbc - 14½ 8:30p 28 PATRIOTS + ½ 
PRO FOOTBALL - MONDAY, SEPTEMBER 12, 2016 
29 STEELERS espn - 10½ 7:10p 30 REDSKINS  - 3½ 
31 RAMS espn - 9 10:20p 32 49ERS  - 4½ 

答えて

3

クリップパスを定義することによって大部分不可視なるOPのサンプルPDFで不可視テキスト (テキストの境界線の外に)、パスを塗りつぶす(テキストを隠す)ようにします。したがって、テキスト抽出中のパス関連命令を無視して、の不可視テキストを無視する必要があります。

残念ながら、これらの命令用に設計されたコールバックは、PDFTextStripperまたはその親クラスLegacyPDFStreamEngineおよびPDFStreamEngineで宣言されていません。

しかし、それらは他のメジャーPDFStreamEngineサブクラスPDFGraphicsStreamEngineで宣言されており、それらはPageDrawerで分かりやすく実装されています。そのため、&ペーストをコピーすることができ&例えば、PDFTextStripperのサブクラスにPageDrawer実装を適応させる、この私たちを利用するために

このように:

public class PDFVisibleTextStripper extends PDFTextStripper { 
    public PDFVisibleTextStripper() throws IOException { 
     addOperator(new AppendRectangleToPath()); 
     addOperator(new ClipEvenOddRule()); 
     addOperator(new ClipNonZeroRule()); 
     addOperator(new ClosePath()); 
     addOperator(new CurveTo()); 
     addOperator(new CurveToReplicateFinalPoint()); 
     addOperator(new CurveToReplicateInitialPoint()); 
     addOperator(new EndPath()); 
     addOperator(new FillEvenOddAndStrokePath()); 
     addOperator(new FillEvenOddRule()); 
     addOperator(new FillNonZeroAndStrokePath()); 
     addOperator(new FillNonZeroRule()); 
     addOperator(new LineTo()); 
     addOperator(new MoveTo()); 
     addOperator(new StrokePath()); 
    } 

    @Override 
    protected void processTextPosition(TextPosition text) { 
     Matrix textMatrix = text.getTextMatrix(); 
     Vector start = textMatrix.transform(new Vector(0, 0)); 
     Vector end = new Vector(start.getX() + text.getWidth(), start.getY()); 

     PDGraphicsState gs = getGraphicsState(); 
     Area area = gs.getCurrentClippingPath(); 
     if (area == null || (area.contains(start.getX(), start.getY()) && area.contains(end.getX(), end.getY()))) 
      super.processTextPosition(text); 
    } 

    private GeneralPath linePath = new GeneralPath(); 

    void deleteCharsInPath() { 
     for (List<TextPosition> list : charactersByArticle) { 
      List<TextPosition> toRemove = new ArrayList<>(); 
      for (TextPosition text : list) { 
       Matrix textMatrix = text.getTextMatrix(); 
       Vector start = textMatrix.transform(new Vector(0, 0)); 
       Vector end = new Vector(start.getX() + text.getWidth(), start.getY()); 
       if (linePath.contains(start.getX(), start.getY()) || linePath.contains(end.getX(), end.getY())) { 
        toRemove.add(text); 
       } 
      } 
      if (toRemove.size() != 0) { 
       System.out.println(toRemove.size()); 
       list.removeAll(toRemove); 
      } 
     } 
    } 

    public final class AppendRectangleToPath extends OperatorProcessor { 
     @Override 
     public void process(Operator operator, List<COSBase> operands) throws IOException { 
      if (operands.size() < 4) { 
       throw new MissingOperandException(operator, operands); 
      } 
      if (!checkArrayTypesClass(operands, COSNumber.class)) { 
       return; 
      } 
      COSNumber x = (COSNumber) operands.get(0); 
      COSNumber y = (COSNumber) operands.get(1); 
      COSNumber w = (COSNumber) operands.get(2); 
      COSNumber h = (COSNumber) operands.get(3); 

      float x1 = x.floatValue(); 
      float y1 = y.floatValue(); 

      // create a pair of coordinates for the transformation 
      float x2 = w.floatValue() + x1; 
      float y2 = h.floatValue() + y1; 

      Point2D p0 = context.transformedPoint(x1, y1); 
      Point2D p1 = context.transformedPoint(x2, y1); 
      Point2D p2 = context.transformedPoint(x2, y2); 
      Point2D p3 = context.transformedPoint(x1, y2); 

      // to ensure that the path is created in the right direction, we have to create 
      // it by combining single lines instead of creating a simple rectangle 
      linePath.moveTo((float) p0.getX(), (float) p0.getY()); 
      linePath.lineTo((float) p1.getX(), (float) p1.getY()); 
      linePath.lineTo((float) p2.getX(), (float) p2.getY()); 
      linePath.lineTo((float) p3.getX(), (float) p3.getY()); 

      // close the subpath instead of adding the last line so that a possible set line 
      // cap style isn't taken into account at the "beginning" of the rectangle 
      linePath.closePath(); 
     } 

     @Override 
     public String getName() { 
      return "re"; 
     } 
    } 

    public final class StrokePath extends OperatorProcessor { 
     @Override 
     public void process(Operator operator, List<COSBase> operands) throws IOException { 
      linePath.reset(); 
     } 

     @Override 
     public String getName() { 
      return "S"; 
     } 
    } 

    public final class FillEvenOddRule extends OperatorProcessor { 
     @Override 
     public void process(Operator operator, List<COSBase> operands) throws IOException { 
      linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD); 
      deleteCharsInPath(); 
      linePath.reset(); 
     } 

     @Override 
     public String getName() { 
      return "f*"; 
     } 
    } 

    public class FillNonZeroRule extends OperatorProcessor { 
     @Override 
     public final void process(Operator operator, List<COSBase> operands) throws IOException { 
      linePath.setWindingRule(GeneralPath.WIND_NON_ZERO); 
      deleteCharsInPath(); 
      linePath.reset(); 
     } 

     @Override 
     public String getName() { 
      return "f"; 
     } 
    } 

    public final class FillEvenOddAndStrokePath extends OperatorProcessor { 
     @Override 
     public void process(Operator operator, List<COSBase> operands) throws IOException { 
      linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD); 
      deleteCharsInPath(); 
      linePath.reset(); 
     } 

     @Override 
     public String getName() { 
      return "B*"; 
     } 
    } 

    public class FillNonZeroAndStrokePath extends OperatorProcessor { 
     @Override 
     public void process(Operator operator, List<COSBase> operands) throws IOException { 
      linePath.setWindingRule(GeneralPath.WIND_NON_ZERO); 
      deleteCharsInPath(); 
      linePath.reset(); 
     } 

     @Override 
     public String getName() { 
      return "B"; 
     } 
    } 

    public final class ClipEvenOddRule extends OperatorProcessor { 
     @Override 
     public void process(Operator operator, List<COSBase> operands) throws IOException { 
      linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD); 
      getGraphicsState().intersectClippingPath(linePath); 
     } 

     @Override 
     public String getName() { 
      return "W*"; 
     } 
    } 

    public class ClipNonZeroRule extends OperatorProcessor { 
     @Override 
     public void process(Operator operator, List<COSBase> operands) throws IOException { 
      linePath.setWindingRule(GeneralPath.WIND_NON_ZERO); 
      getGraphicsState().intersectClippingPath(linePath); 
     } 

     @Override 
     public String getName() { 
      return "W"; 
     } 
    } 

    public final class MoveTo extends OperatorProcessor { 
     @Override 
     public void process(Operator operator, List<COSBase> operands) throws IOException { 
      if (operands.size() < 2) { 
       throw new MissingOperandException(operator, operands); 
      } 
      COSBase base0 = operands.get(0); 
      if (!(base0 instanceof COSNumber)) { 
       return; 
      } 
      COSBase base1 = operands.get(1); 
      if (!(base1 instanceof COSNumber)) { 
       return; 
      } 
      COSNumber x = (COSNumber) base0; 
      COSNumber y = (COSNumber) base1; 
      Point2D.Float pos = context.transformedPoint(x.floatValue(), y.floatValue()); 
      linePath.moveTo(pos.x, pos.y); 
     } 

     @Override 
     public String getName() { 
      return "m"; 
     } 
    } 

    public class LineTo extends OperatorProcessor { 
     @Override 
     public void process(Operator operator, List<COSBase> operands) throws IOException { 
      if (operands.size() < 2) { 
       throw new MissingOperandException(operator, operands); 
      } 
      COSBase base0 = operands.get(0); 
      if (!(base0 instanceof COSNumber)) { 
       return; 
      } 
      COSBase base1 = operands.get(1); 
      if (!(base1 instanceof COSNumber)) { 
       return; 
      } 
      // append straight line segment from the current point to the point 
      COSNumber x = (COSNumber) base0; 
      COSNumber y = (COSNumber) base1; 

      Point2D.Float pos = context.transformedPoint(x.floatValue(), y.floatValue()); 

      linePath.lineTo(pos.x, pos.y); 
     } 

     @Override 
     public String getName() { 
      return "l"; 
     } 
    } 

    public class CurveTo extends OperatorProcessor { 
     @Override 
     public void process(Operator operator, List<COSBase> operands) throws IOException { 
      if (operands.size() < 6) { 
       throw new MissingOperandException(operator, operands); 
      } 
      if (!checkArrayTypesClass(operands, COSNumber.class)) { 
       return; 
      } 
      COSNumber x1 = (COSNumber) operands.get(0); 
      COSNumber y1 = (COSNumber) operands.get(1); 
      COSNumber x2 = (COSNumber) operands.get(2); 
      COSNumber y2 = (COSNumber) operands.get(3); 
      COSNumber x3 = (COSNumber) operands.get(4); 
      COSNumber y3 = (COSNumber) operands.get(5); 

      Point2D.Float point1 = context.transformedPoint(x1.floatValue(), y1.floatValue()); 
      Point2D.Float point2 = context.transformedPoint(x2.floatValue(), y2.floatValue()); 
      Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue()); 

      linePath.curveTo(point1.x, point1.y, point2.x, point2.y, point3.x, point3.y); 
     } 

     @Override 
     public String getName() { 
      return "c"; 
     } 
    } 

    public final class CurveToReplicateFinalPoint extends OperatorProcessor { 
     @Override 
     public void process(Operator operator, List<COSBase> operands) throws IOException { 
      if (operands.size() < 4) { 
       throw new MissingOperandException(operator, operands); 
      } 
      if (!checkArrayTypesClass(operands, COSNumber.class)) { 
       return; 
      } 
      COSNumber x1 = (COSNumber) operands.get(0); 
      COSNumber y1 = (COSNumber) operands.get(1); 
      COSNumber x3 = (COSNumber) operands.get(2); 
      COSNumber y3 = (COSNumber) operands.get(3); 

      Point2D.Float point1 = context.transformedPoint(x1.floatValue(), y1.floatValue()); 
      Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue()); 

      linePath.curveTo(point1.x, point1.y, point3.x, point3.y, point3.x, point3.y); 
     } 

     @Override 
     public String getName() { 
      return "y"; 
     } 
    } 

    public class CurveToReplicateInitialPoint extends OperatorProcessor { 
     @Override 
     public void process(Operator operator, List<COSBase> operands) throws IOException { 
      if (operands.size() < 4) { 
       throw new MissingOperandException(operator, operands); 
      } 
      if (!checkArrayTypesClass(operands, COSNumber.class)) { 
       return; 
      } 
      COSNumber x2 = (COSNumber) operands.get(0); 
      COSNumber y2 = (COSNumber) operands.get(1); 
      COSNumber x3 = (COSNumber) operands.get(2); 
      COSNumber y3 = (COSNumber) operands.get(3); 

      Point2D currentPoint = linePath.getCurrentPoint(); 

      Point2D.Float point2 = context.transformedPoint(x2.floatValue(), y2.floatValue()); 
      Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue()); 

      linePath.curveTo((float) currentPoint.getX(), (float) currentPoint.getY(), point2.x, point2.y, point3.x, point3.y); 
     } 

     @Override 
     public String getName() { 
      return "v"; 
     } 
    } 

    public final class ClosePath extends OperatorProcessor { 
     @Override 
     public void process(Operator operator, List<COSBase> operands) throws IOException { 
      linePath.closePath(); 
     } 

     @Override 
     public String getName() { 
      return "h"; 
     } 
    } 

    public final class EndPath extends OperatorProcessor { 
     @Override 
     public void process(Operator operator, List<COSBase> operands) throws IOException { 
      linePath.reset(); 
     } 

     @Override 
     public String getName() { 
      return "n"; 
     } 
    } 
} 

PDFVisibleTextStripper

必ず同じ名前を持つPageDrawerで使用されるクラス、PDFVisibleTextStripperコンストラクタで内部の演算子クラスを使用しないでください。単にコードの下にあるリンクに従ってください。

これは、ほとんどの不要なデータの低下

REVERSE tEaSER caRd 
500 
elections 
er of Teams 
t Bet 
1,000 
MARK BOX AS SHOWN  
DENOTES HOME TEAM 
PRO FOOTBALL - THURSDAY, SEPTEMBER 8, 2016 
1 PANTHERS nbc - 10½ 8:30p 2 BRONCOS  - 3½ 
PRO FOOTBALL - SUNDAY, SEPTEMBER 11, 2016 
3 FALCONS  - 9½ 1:00p 4 BUCCANEERS - 4½ 
5 VIKINGS - 9½ 1:00p 6 TITANS  - 4½ 
7 EAGLES  - 10½ 1:00p 8 BROWNS - 3½ 
9 BENGALS - 9½ 1:00p 10 JETS  - 4½ 
11 SAINTS  - 7½ 1:00p 12 RAIDERS - 6½ 
13 CHIEFS  - 14½ 1:00p 14 CHARGERS + ½ 
15 RAVENS  - 10½ 1:00p 16 BILLS - 3½ 
17 TEXANS  - 14½ 1:00p 18 BEARS + ½ 
19 PACKERS - 12½ 1:00p 20 JAGUARS  - 1½ 
21 SEAHAWKS  - 17½ 4:05p 22 DOLPHINS + 3½ 
23 COWBOYS  - 7½ 4:25p 24 GIANTS - 6½ 
25 COLTS  - 10½ 4:25p 26 LIONS - 3½ 
27 CARDINALS  nbc - 14½ 8:30p 28 PATRIOTS + ½ 
PRO FOOTBALL - MONDAY, SEPTEMBER 12, 2016 
29 STEELERS espn - 10½ 7:10p 30 REDSKINS  - 3½ 
31 RAMS espn - 9½ 10:20p 32 49ERS  - 4½ 

への出力を低減します。 this questionの文脈では


それは仕方processTextPositiondeleteCharsInPathは、文字のベースラインの終わりを計算することが明らかになった暗黙のうちに、ページ回転せずに水平方向のテキストを前提としています。しかし、「可視性」の基準を緩めると、ベースラインの開始が見える場合に、文字が表示されると見なすことができます。その場合は、計算されたVector endがもう必要なく、回転したページでもコードは正常に動作します。

+0

このような包括的なソリューションを書く時間をとれていただきありがとうございます! – Jay