スキャンされたpdfをテキストのpythonに変換する

スキャンしたpdfファイルがあり、そこからテキストを抽出しようとしています。は、私はそれにOCRを作るためにpypdfocrを使用しようとしましたが、私はエラーがあります：私は、このソリューションLinking Ghostscript to pypdfocr in Windows Platformを発見し、私はGhostscriptをダウンロードして、環境変数の中に入れてみましたが、それはまだ同じエラーを持って検索した後スキャンされたpdfをテキストのpythonに変換する

"could not found ghostscript in the usual place"

を。

スキャンしたpdfファイルのテキストをpythonでどのように検索できますか？

ありがとうございました。

編集は：ここに私のコードサンプルです：

import os 
import sys 
import re 
import json 
import shutil 
import glob 
from pypdfocr import pypdfocr_gs 
from pypdfocr import pypdfocr_tesseract 
from PIL import Image 

path = PATH_TO_MY_SCANNED_PDF 
mainL = [] 
kk = {} 


def new_init(self, kk): 
    self.lang = 'heb' 
    self.binary = "tesseract" 
    self.msgs = { 
      'TS_MISSING': """ 
       Could not execute %s 
       Please make sure you have Tesseract installed correctly 
       """ % self.binary, 
      'TS_VERSION':'Tesseract version is too old', 
      'TS_img_MISSING':'Cannot find specified tiff file', 
      'TS_FAILED': 'Tesseract-OCR execution failed!', 
     } 

pypdfocr_tesseract.PyTesseract.__init__ = new_init 

wow = pypdfocr_gs.PyGs(kk) 
tt = pypdfocr_tesseract.PyTesseract(kk) 


def secFile(filename,oldfilename): 
    wow.make_img_from_pdf(filename) 


    files = glob.glob("X:/e206333106/ocr-114/balagan/" + '*.jpg') 
    for file in files: 
     im = Image.open(file) 
     im.save(file + ".tiff") 

    files = glob.glob("PATH" + '*.tiff') 
    for file in files: 
     tt.make_hocr_from_pnm(file) 
    pdftxt = ""  
    files = glob.glob("PATH" + '*.html') 
    for file in files: 
     with open(file) as myfile: 
      pdftxt = pdftxt + "#" + "".join(line.rstrip() for line in myfile) 
    findNum(pdftxt,oldfilename) 

    folder ="PATH" 

    for the_file in os.listdir(folder): 
     file_path = os.path.join(folder, the_file) 
     try: 
      if os.path.isfile(file_path): 
       os.unlink(file_path) 
     except Exception, e: 
      print e 

def pdf2ocr(filename): 
    pdffile = filename 
    os.system('pypdfocr -l heb ' + pdffile) 

def ocr2txt(filename): 
    pdffile = filename 


    output1 = pdffile.replace(".pdf","_ocr.txt") 
    output1 = "PATH" + os.path.basename(output1) 

    input1 = pdffile.replace(".pdf","_ocr.pdf") 

    os.system("pdf2txt" -o + output1 + " " + input1) 

    with open(output1) as myfile: 
     pdftxt="".join(line.rstrip() for line in myfile) 
    findNum(pdftxt,filename) 


def findNum(pdftxt,pdffile): 
    l = re.findall(r'\b\d+\b', pdftxt) 


    output = open('PATH' + os.path.basename(pdffile) + '.txt', 'w') 
    for i in l: 
     output.write(",") 
     output.write(i) 
    output.close()  

def is_ascii(s): 
    return all(ord(c) < 128 for c in s) 

i = 0  
files = glob.glob(path + '\\*.pdf') 
print path 
print files 
for file in files: 
    if file.endswith(".pdf"): 
     if is_ascii(file): 
      print file 
      pdf2ocr(file)  
      ocr2txt(file) 
     else: 
      newname = "PATH" + str(i) + ".pdf" 
      shutil.copyfile(file, newname) 
      print newname 
      secFile(newname,file) 
     i = i + 1 

files = glob.glob(path + '\\' + '*_ocr.pdf')   

for file in files: 
    print file 
    shutil.copyfile(file, "PATH" + os.path.basename(file)) 
    os.remove(file)

出典

2017-08-03 Michal

コードサンプルを提供できますか？ – Keeper

私はこれを私の質問で編集します – Michal

は、多分あなたはページの内容を分析することができ、またその中に画像を持つことができ、このライブラリhttps://pypi.python.org/pypi/pypdfocr が、PDFファイルを見てみましょういくつかのスキャナブレークストリームあなたがゴーストスクリプトでテキストを手に入れることはありません。

出典

2017-08-03 09:40:10 ghovat

同じエラーですが、私は** pypdfocr filename.pdf **をコマンドラインに書きました。エラー：**エラー：通常の場所でGhostscriptを見つけることができませんでした。設定ファイル** – Michal

を使用して指定してください。 – ghovat

私はWindows 64ビットを使用しています – Michal

OpenCVをPythonに使用できます。テキストを検出するための多くの例があります。ここにリンクがありますenter link description here

出典

2017-08-03 09:50:59

私はどのようにpdfファイルのためにそれを使用することができませんでした。 – Michal

pdfをイメージ（pngまたはjpeg）として印刷し、OpenCV OCRを使用することができます。 –

私はopenCVを見ようとしましたが、 'import numpy'を実行すると' AttributeError： 'module'オブジェクトに属性 'einsum''がありません – Michal

スキャンされたpdfをテキストのpythonに変換する

答えて

関連する問題