出現ストリームのテキストを読むには？

注釈（Adobe Readerでレンダリングされたもの）に表示されているテキストが、/内容および/RCの項目と異なる場合は、PDFがあります。代わりに、注釈の内容を一致させるために、外観を変更するので、私は反対のことをやりたい、この場合、出現ストリームのテキストを読むには？

Can't change /Contents of annotation

：：これは、私はこの質問で扱った問題に関連して、外観のテキストを取得します/内容と/RCの値を一致させるように変更してください。例えば、注釈を表示し、「外観」と/内容が「コンテンツ」に設定されている場合、私のような何かをしたい：

void setContent(PdfDictionary dict) 
{ 
PdfString str = dict.GetAsString(new PdfName("KeyForAppearanceText")); 
dict.Put(PdfName.CONTENTS,str); 
}

をしかし、外観のテキストが格納されている場所、私は見つけることができません。

private PdfDictionary getAPAnnot(PdfArray annotArray,PdfDictionary annot) 
     { 
      PdfDictionary apDict = annot.GetAsDict(PdfName.AP); 
      if (apDict!=null) 
      { 
       PdfIndirectReference ap = (PdfIndirectReference)apDict.Get(PdfName.N); 
       PdfDictionary apRefDict = (PdfDictionary)pdfController.pdfReader.GetPdfObject(ap.Number); 
       return apRefDict; 
      } 
      else 
      { 
       return null; 
      } 
     }

この辞書には、以下のHashMapを持っている：

{[/BBox, [-38.7578, -144.058, 62.0222, 1]]} 
{[/Filter, /FlateDecode]} 
{[/Length, 172]}  
{[/Matrix, [1, 0, 0, 1, 0, 0]]} 
{[/Resources, Dictionary]}

/リソースがフォントへの間接参照を持っていますが、無内容私は、このコードで/APが参照する辞書を得ました。そのため、アピアランスストリームにはコンテンツデータが含まれていないようです。 /内容と/RC以外

、どこでもコンテンツデータを格納する注釈のデータ構造ではないようです。出品内容はどこで調べるべきですか？

出典

2016-05-03 sigil

残念ながら、OPはサンプルPDFを提供していません。しかし彼の前の質問を考えれば、彼はフリーテキスト注釈に興味がある可能性が最も高い。したがって、ここでは例としてthis example PDFを使用します。それはタイプライターフリーテキスト注釈はこのように見ていると1ページがあります。

OPは/内容とは/ RC、「はdoesnのより

その他
を尋ねましたコンテンツデータを格納するアノテーションのデータ構造内のどこにでもあるように見える。出品内容はどこで調べるべきですか？

OPのコードの主な欠点は、彼が唯一のPdfDictionaryとして正常な外観を検討していることである：

PdfIndirectReference ap = (PdfIndirectReference)apDict.Get(PdfName.N); 
PdfDictionary apRefDict = (PdfDictionary)pdfController.pdfReader.GetPdfObject(ap.Number);

それは実際にPdfStreamある、すなわちデータストリームと辞書、およびこのデータストリームがありますここでは外観描画命令が配置される。

しかし、たとえ手でこのデータストリームと、OPによって想像ほど単純ではない。

PdfString str = dict.GetAsString(new PdfName("KeyForAppearanceText"));

実際外観ストリーム内のテキストは、例えば、片で描画することができます私のサンプルファイル内のストリームデータは次のようになります。さらに

0 w 
131.2646 564.8243 180.008 30.984 re 
n 
q 
1 0 0 1 0 0 cm 
131.2646 564.8243 180.008 30.984 re 
W 
n 
0 g 
1 w 
BT 
/Cour 12 Tf 
0 g 
131.265 587.96 Td 
(This) Tj 
35.999 0 Td 
(is) Tj 
21.6 0 Td 
(written) Tj 
57.599 0 Td 
(using) Tj 
43.2 0 Td 
(the) Tj 
-158.398 -16.3 Td 
(typewriter) Tj 
79.199 0 Td 
(tool.) Tj 
ET 
Q

、エンコーディングはここのようないくつかの標準エンコーディングである必要はありませんが、代わりにオンザフライ埋め込みフォントのために定義することができます。

したがって、本格的なテキスト抽出を適用する必要があります。

このすべてはこのように実現することができる：上記のサンプルファイルの場合

for (int page = 1; page <= pdfReader.NumberOfPages; page++) 
{ 
    Console.Write("\nPage {0}\n", page); 
    PdfDictionary pageDictionary = pdfReader.GetPageNRelease(page); 
    PdfArray annotsArray = pageDictionary.GetAsArray(PdfName.ANNOTS); 
    if (annotsArray == null || annotsArray.IsEmpty()) 
    { 
     Console.Write(" No annotations.\n"); 
     continue; 
    } 
    foreach (PdfObject pdfObject in annotsArray) 
    { 
     PdfObject direct = PdfReader.GetPdfObject(pdfObject); 
     if (direct.IsDictionary()) 
     { 
      PdfDictionary annotDictionary = (PdfDictionary)direct; 
      Console.Write(" SubType: {0}\n", annotDictionary.GetAsName(PdfName.SUBTYPE)); 
      PdfDictionary appearancesDictionary = annotDictionary.GetAsDict(PdfName.AP); 
      if (appearancesDictionary == null) 
      { 
       Console.Write(" No appearances.\n"); 
       continue; 
      } 
      foreach (PdfName key in appearancesDictionary.Keys) 
      { 
       Console.Write(" Appearance: {0}\n", key); 
       PdfStream value = appearancesDictionary.GetAsStream(key); 
       if (value != null) 
       { 
        String text = ExtractAnnotationText(value); 
        Console.Write(" Text:\n---\n{0}\n---\n", text); 
       } 
      } 
     } 
    } 
}

このヘルパーメソッドと

public String ExtractAnnotationText(PdfStream xObject) 
{ 
    PdfDictionary resources = xObject.GetAsDict(PdfName.RESOURCES); 
    ITextExtractionStrategy strategy = new LocationTextExtractionStrategy(); 

    PdfContentStreamProcessor processor = new PdfContentStreamProcessor(strategy); 
    processor.ProcessContent(ContentByteUtils.GetContentBytesFromContentObject(xObject), resources); 
    return strategy.GetResultantText(); 
}

、コードの出力は

Page 1 
    SubType: /FreeText 
    Appearance: /N 
    Text: 
--- 
This is written using the 
typewriter tool. 
---

あります

いくつかの注釈があります。特に、チェックボックスとラジオボタンのウィジェットアノテーションがあります。ここのコードが予想していたより深い構造です。

出典

2016-05-04 08:24:48 mkl

はい、これをFreeText注釈に使用します。特に吹き出し。この解決法はどちらにしてもうまく機能します。ありがとう！ – sigil

出現ストリームのテキストを読むには？

答えて

関連する問題