PDFファイルから表データを抽出するためのソリューション

多くのPDF文書から多数のページで表データを抽出する必要がありました。 AdobeのAcrobat Readerからの内蔵テキストのエクスポート機能を使用することは役に立たず、そのように抽出されたテキストはテーブルによって確立された空間的関係を失います。他にも多くの疑問が提起されてきましたが、私が試したこの問題には多くの解決策がありましたが、結果は貧弱で恐ろしいものになりました。だから自分のソリューションを開発しようとしました。それはここで分かち合う準備が整っていると思います。PDFファイルから表データを抽出するためのソリューション

出典

2017-08-23 Big_Al_Tx

まず、行と列の区切りがどこにあるのかを確認するために、（ページのx &という位置で）テキストの分布を調べようとしました。 Pythonモジュール 'pdfminer'を使用して、テキストとBoundingBoxの各パラメータを抽出し、それぞれのテキストを篩い分けし、与えられたxまたはyの値に対してページ上にいくつのテキストが配置されているかをマップしました。そのアイデアは、テキストの分布（行区切りの場合は水平方向、縦列区切りの場合は垂直方向）を調べ、密度がゼロの場合（つまり、表の上または上に明確なギャップがあることを意味する）行または列の区切り。

このアイデアはうまくいきますが、ときどきしかありません。これは、テーブルのセル数とセルの縦横が同じであること（単純なグリッド）と、隣接するセルのテキスト間に明確なギャップがあることを前提としています。また、複数の列にまたがるテキスト（表の上のタイトル、表の下のフッター、結合されたセルなど）がある場合は、列区切りの識別が難しくなります。テーブルの下を無視する必要がありますが、私はマージされたセルを扱うための良いアプローチを見つけることができませんでした。

行区切りを識別するために水平に見る時間が来たとき、いくつかの他の課題がありました。まず、pdfminerは、テーブル内の1つ以上のセルにまたがっている場合でも、お互いに近いテキストを自動的にグループ化しようとします。そのような場合、そのテキストオブジェクトのBoundingBoxには複数の行が含まれ、交差している可能性のある行区切りが隠されます。すべてのテキスト行が別々に抽出されたとしても、連続する行のテキストを区切る通常のスペースと行ブレイクを区別することが課題です。

さまざまな回避策を検討し、いくつかのテストを実施した後、もう一度取り組み、別の方法を試すことにしました。

すべてを抽出するために必要なデータを持っているテーブルには境界線があるので、それらの線を描画するPDFファイルの要素を見つけることができるはずです。しかし、ソースファイルから抽出できる要素を調べると、驚くべき結果が得られました。

あなたは、行が "行オブジェクト"として表されると思いますが、少なくとも私が見ていたファイルについては間違っています。それらが "線"でない場合は、各セルの長方形を単に描き、必要な線の太さを得るために線幅の属性を調整します。いいえ。線は実際には、非常に小さい次元（幅の狭い幅、縦線を作るための幅の狭さ、または水平線を作るための短い高さ）の「矩形オブジェクト」として描画されていました。そして、線がコーナーで会うように見えるところでは、四角形はそうではありません - それらはギャップを埋める非常に小さな四角形を持っています。

私は何を探すべきかを認識できたら、太い線を作るために互いに隣接して配置された複数の矩形と対立しなければならなかった。最終的には、同様の値をグループ化し、後で使用する行と列の区切りに使用する平均値を計算するルーチンを作成しました。

今、テーブルからテキストを処理することでした。 PDFファイルのテキストを保存、分析、再編成するためにSQLiteデータベースを使用することにしました。他の "pythonic"オプションがあることは知っていますが、これらのアプローチは使い慣れた使い方が簡単かもしれませんが、実際のデータベースファイルを使用して処理するデータの量が最も効果的です。

前述したように、pdfminerは互いに近くに位置するテキストをグループ化し、セルの境界を越えることがあります。これらのテキストグループの1つに含まれる別々の行に表示されたテキストを分割しようとする試みは、部分的にしか成功しませんでした。それは私がさらに開発しようとしている分野の1つです（すなわち、pdfminer LTTextboxルーチンをバイパスして、個別に入手する方法）。

縦書きテキストの場合、pdfminerモジュールのもう1つの欠点があります。私は、テキストが垂直であるとき、またはテキストがどの角度（たとえば+90または-90度）で表示されるかを識別する属性を特定できませんでした。そして、テキストグループ化ルーチンは、テキストが+90度回転した（すなわち、文字が下から上に読み取られるCCWを回転させた）ので、改行文字で区切られた逆の順序で文字を連結するか、

以下のルーチンは、状況によってはかなりうまく機能します。まだまだ荒いのですが、いくつかの機能拡張があり、広く配布されるようにパッケージ化されていませんが、PDFファイルから表形式のデータを抽出する方法については「コードが壊れている」ようですほとんどの部分）。うまくいけば、他の人が自分の目的のためにこれを使うことができるかもしれないし、多分それを改善するかもしれない。

ご意見、ご提案、またはご提案をお待ちしております。

EDIT：テーブルの特定のセルに属するテキストのアルゴリズムを調整するのに役立つ追加パラメータ（cell_htol_upなど）を含む改訂版を投稿しました。

# This was written for use w/Python 2. Use w/Python 3 hasn't been tested & proper execution is not guaranteed. 

import os             # Library of Operating System routines 
import sys             # Library of System routines 
import sqlite3            # Library of SQLite dB routines 
import re             # Library for Regular Expressions 
import csv             # Library to output as Comma Separated Values 
import codecs            # Library of text Codec types 
import cStringIO           # Library of String manipulation routines 

from pdfminer.pdfparser import PDFParser     # Library of PDF text extraction routines 
from pdfminer.pdfdocument import PDFDocument, PDFNoOutlines 
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed 
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter 
from pdfminer.pdfdevice import PDFDevice 
from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTFigure, LTImage, LTLine, LTRect, LTTextBoxVertical 
from pdfminer.converter import PDFPageAggregator 

#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 

def add_new_value (new_value, list_values=[]): 
    # Used to exclude duplicate values in a list 
    not_in_list = True 
    for list_value in list_values: 
     # if list_value == new_value: 
     if abs(list_value - new_value) < 1: 
      not_in_list = False 

    if not_in_list: 
     list_values.append(new_value) 

    return list_values 

#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 

def condense_list (list_values, grp_tolerance = 1): 
    # Group values & eliminate duplicate/close values 
    tmp_list = [] 
    for n, list_value in enumerate(list_values): 
     if sum(1 for val in tmp_list if abs(val - list_values[n]) < grp_tolerance) == 0: 
      tmp_val = sum(list_values[n] for val in list_values if abs(val - list_values[n]) < grp_tolerance)/\ 
       sum(1 for val in list_values if abs(val - list_values[n]) < grp_tolerance) 
      tmp_list.append(int(round(tmp_val))) 

    return tmp_list 

#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 

class UnicodeWriter: 
    """ 
    A CSV writer which will write rows to CSV file "f", 
    which is encoded in the given encoding. 
    """ 

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds): 
     # Redirect output to a queue 
     self.queue = cStringIO.StringIO() 
     self.writer = csv.writer(self.queue, dialect=dialect, quotechar = '"', quoting=csv.QUOTE_ALL, **kwds) 
     self.stream = f 
     self.encoder = codecs.getincrementalencoder(encoding)() 

    def writerow(self, row): 
     self.writer.writerow([unicode(s).encode("utf-8") for s in row]) 
     # Fetch UTF-8 output from the queue ... 
     data = self.queue.getvalue() 
     data = data.decode("utf-8") 
     # ... and reencode it into the target encoding 
     data = self.encoder.encode(data) 
     # write to the target stream 
     self.stream.write(data) 
     # empty queue 
     self.queue.truncate(0) 

    def writerows(self, rows): 
     for row in rows: 
      self.writerow(row) 

#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 

# In case a connection to the database can't be created, set 'conn' to 'None' 
conn = None 

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
# Define variables for use later on 
#_______________________________________________________________________________________________________________________ 

sqlite_file = "pdf_table_text.sqlite"      # Name of the sqlite database file 
brk_tol = 3             # Tolerance for grouping LTRect values as line break points 
                  # *** This may require tuning to get optimal results *** 

cell_htol_lf = -2           # Horizontal & Vertical tolerances (up/down/left/right) 
cell_htol_rt = 2           # for over-scanning table cell bounding boxes 
cell_vtol_up = 8           # i.e., how far outside cell bounds to look for text to include 
cell_vtol_dn = 0           # *** This may require tuning to get optimal results *** 

replace_newlines = True          # Switch for replacing newline codes (\n) with spaces 
replace_multspaces = True         # Switch for replacing multiple spaces with a single space 

# txt_concat_str = "' '"         # Concatenate cell data with a single space 
txt_concat_str = "char(10)"         # Concatenate cell data with a line feed 

#======================================================================================================================= 
# Default values for sample input & output files (path, filename, pagelist, etc.) 

filepath = ""            # Path of the source PDF file (default = current folder) 
srcfile = ""            # Name of the source PDF file (quit if left blank) 
pagelist = [1, ]           # Pages to extract table data (Make an interactive input?) 
                  # --> THIS MUST BE IN THE FORM OF A LIST OR TUPLE! 

#======================================================================================================================= 
# Impose required conditions & abort execution if they're not met 

# Should check if files are locked: sqlite database, input & output files, etc. 

if filepath + srcfile == "" or pagelist == None: 
    print "Source file not specified and/or page list is blank! Execution aborted!" 
    sys.exit() 

dmp_pdf_data = "pdf_data.csv" 
dmp_tbl_data = "tbl_data.csv" 
destfile = srcfile[:-3]+"csv" 

#======================================================================================================================= 
# First test to see if this file already exists & delete it if it does 

if os.path.isfile(sqlite_file): 
    os.remove(sqlite_file) 

#======================================================================================================================= 
try: 

    #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
    # Open or Create the SQLite database file 
    #___________________________________________________________________________________________________________________ 

    print "-" * 120 
    print "Creating SQLite Database & working tables ..." 

    # Connecting to the database file 
    conn = sqlite3.connect(sqlite_file) 
    curs = conn.cursor() 

    qry_create_table = "CREATE TABLE {tn} ({nf} {ft} PRIMARY KEY)" 
    qry_alter_add_column = "ALTER TABLE {0} ADD COLUMN {1}" 

    #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
    # Create 1st Table 
    #___________________________________________________________________________________________________________________ 

    tbl_pdf_elements = "tbl_pdf_elements"     # Name of the 1st table to be created 
    new_field = "idx"          # Name of the index column 
    field_type = "INTEGER"         # Column data type 

    # Delete the table if it exists so old data is cleared out 
    curs.execute("DROP TABLE IF EXISTS " + tbl_pdf_elements) 

    # Create output table for PDF text w/1 column (index) & set it as PRIMARY KEY 
    curs.execute(qry_create_table.format(tn=tbl_pdf_elements, nf=new_field, ft=field_type)) 

    # Table fields: index, text_string, pg, x0, y0, x1, y1, orient 
    cols = ("'pdf_text' TEXT", 
      "'pg' INTEGER", 
      "'x0' INTEGER", 
      "'y0' INTEGER", 
      "'x1' INTEGER", 
      "'y1' INTEGER", 
      "'orient' INTEGER") 

    # Add other columns 
    for col in cols: 
     curs.execute(qry_alter_add_column.format(tbl_pdf_elements, col)) 

    # Committing changes to the database file 
    conn.commit() 

    #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
    # Create 2nd Table 
    #___________________________________________________________________________________________________________________ 

    tbl_table_data = "tbl_table_data"      # Name of the 2nd table to be created 
    new_field = "idx"          # Name of the index column 
    field_type = "INTEGER"         # Column data type 

    # Delete the table if it exists so old data is cleared out 
    curs.execute("DROP TABLE IF EXISTS " + tbl_table_data) 

    # Create output table for Table Data w/1 column (index) & set it as PRIMARY KEY 
    curs.execute(qry_create_table.format(tn=tbl_table_data, nf=new_field, ft=field_type)) 

    # Table fields: index, text_string, pg, row, column 
    cols = ("'tbl_text' TEXT", 
      "'pg' INTEGER", 
      "'row' INTEGER", 
      "'col' INTEGER") 

    # Add other columns 
    for col in cols: 
     curs.execute(qry_alter_add_column.format(tbl_table_data, col)) 

    # Committing changes to the database file 
    conn.commit() 

    #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
    # Start PDF text extraction code here 
    #___________________________________________________________________________________________________________________ 

    print "Opening PDF file & preparing for text extraction:" 
    print " -- " + filepath + srcfile 

    # Open a PDF file. 
    fp = open(filepath + srcfile, "rb") 

    # Create a PDF parser object associated with the file object. 
    parser = PDFParser(fp) 

    # Create a PDF document object that stores the document structure. 

    # Supply the password for initialization (if needed) 
    # document = PDFDocument(parser, password) 
    document = PDFDocument(parser) 

    # Check if the document allows text extraction. If not, abort. 
    if not document.is_extractable: 
     raise PDFTextExtractionNotAllowed 

    # Create a PDF resource manager object that stores shared resources. 
    rsrcmgr = PDFResourceManager() 

    # Create a PDF device object. 
    device = PDFDevice(rsrcmgr) 

    # Create a PDF interpreter object. 
    interpreter = PDFPageInterpreter(rsrcmgr, device) 

    # Set parameters for analysis. 
    laparams = LAParams() 

    # Create a PDF page aggregator object. 
    device = PDFPageAggregator(rsrcmgr, laparams=laparams) 
    interpreter = PDFPageInterpreter(rsrcmgr, device) 

    #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
    # Extract text & location data from PDF file (examine & process only pages in the page list) 
    #___________________________________________________________________________________________________________________ 

    # Initialize variables 
    idx1 = 0 
    idx2 = 0 
    lastpg = max(pagelist) 

    print "Starting text extraction ..." 

    qry_insert_pdf_txt = "INSERT INTO " + tbl_pdf_elements + " VALUES(?, ?, ?, ?, ?, ?, ?, ?)" 
    qry_get_pdf_txt = "SELECT group_concat(pdf_text, " + txt_concat_str + \ 
     ") FROM {0} WHERE pg=={1} AND x0>={2} AND x1<={3} AND y0>={4} AND y1<={5} ORDER BY y0 DESC, x0 ASC;" 
    qry_insert_tbl_data = "INSERT INTO " + tbl_table_data + " VALUES(?, ?, ?, ?, ?)" 

    # Process each page contained in the document. 
    for i, page in enumerate(PDFPage.create_pages(document)): 

     interpreter.process_page(page) 

     # Get the LTPage object for the page. 
     lt_objs = device.get_result() 
     pg = device.pageno - 1        # Must subtract 1 to correct 'pageno' 

     # Exit the loop if past last page to parse 
     if pg > lastpg: 
      break 

     #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
     # If it finds a page in the pagelist, process the contents 

     if pg in pagelist: 
      print "- Processing page {0} ...".format(pg) 

      xbreaks = [] 
      ybreaks = [] 

      #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
      # Iterate thru list of pdf layout elements (LT* objects) then capture the text & attributes of each 

      for lt_obj in lt_objs: 

       # Examine LT objects & get parameters for text strings 
       if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine): 
        # Increment index 
        idx1 += 1 

        # Assign PDF LTText object parameters to variables 
        pdftext = lt_obj.get_text()    # Need to convert escape codes & unicode characters! 
        pdftext = pdftext.strip()    # Remove leading & trailing whitespaces 

        # Save integer bounding box coordinates: round down @ start, round up @ end 
        # (x0, y0, x1, y1) = lt_obj.bbox 
        x0 = int(lt_obj.bbox[0]) 
        y0 = int(lt_obj.bbox[1]) 
        x1 = int(lt_obj.bbox[2] + 1) 
        y1 = int(lt_obj.bbox[3] + 1) 

        orient = 0        # What attribute gets this value? 

        #---- These approaches don't work for identifying vertical text ... -------------------------------- 

        # orient = lt_obj.rotate 
        # orient = lt_obj.char_disp 

        # if lt_obj.get_writing_mode == "tb-rl": 
         # orient = 90 

        # if isinstance(lt_obj, LTTextBoxVertical): # vs LTTextBoxHorizontal 
         # orient = 90 

        # if LAParams(lt_obj).detect_vertical: 
         # orient = 90 

        #--------------------------------------------------------------------------------------------------- 
        # Split text strings at line feeds 

        if "\n" in pdftext: 
         substrs = pdftext.split("\n") 
         lineheight = (y1-y0)/(len(substrs) + 1) 
         # y1 = y0 + lineheight 
         y0 = y1 - lineheight 
         for substr in substrs: 
          substr = substr.strip()   # Remove leading & trailing whitespaces 
          if substr != "": 
           # Insert values into tuple for uploading into dB 
           pdf_txt_export = [(idx1, substr, pg, x0, y0, x1, y1, orient)] 

           # Insert values into dB 
           curs.executemany(qry_insert_pdf_txt, pdf_txt_export) 
           conn.commit() 

          idx1 += 1 
          # y0 = y1 
          # y1 = y0 + lineheight 
          y1 = y0 
          y0 = y1 - lineheight 

        else: 
         # Insert values into tuple for uploading into dB 
         pdf_txt_export = [(idx1, pdftext, pg, x0, y0, x1, y1, orient)] 

         # Insert values into dB 
         curs.executemany(qry_insert_pdf_txt, pdf_txt_export) 
         conn.commit() 

       elif isinstance(lt_obj, LTLine): 
        # LTLine - Lines drawn to define tables 
        pass 

       elif isinstance(lt_obj, LTRect): 
        # LTRect - Borders drawn to define tables 

        # Grab the lt_obj.bbox values 
        x0 = round(lt_obj.bbox[0], 2) 
        y0 = round(lt_obj.bbox[1], 2) 
        x1 = round(lt_obj.bbox[2], 2) 
        y1 = round(lt_obj.bbox[3], 2) 
        xmid = round((x0 + x1)/2, 2) 
        ymid = round((y0 + y1)/2, 2) 

        # rectline = lt_obj.linewidth 

        # If width less than tolerance, assume it's used as a vertical line 
        if (x1 - x0) < brk_tol:     # Vertical Line or Corner 
         xbreaks = add_new_value(xmid, xbreaks) 

        # If height less than tolerance, assume it's used as a horizontal line 
        if (y1 - y0) < brk_tol:     # Horizontal Line or Corner 
         ybreaks = add_new_value(ymid, ybreaks) 

       elif isinstance(lt_obj, LTImage): 
        # An image, so do nothing 
        pass 

       elif isinstance(lt_obj, LTFigure): 
        # LTFigure objects are containers for other LT* objects which shouldn't matter, so do nothing 
        pass 

      col_breaks = condense_list(xbreaks, brk_tol) # Group similar values & eliminate duplicates 
      row_breaks = condense_list(ybreaks, brk_tol) 

      col_breaks.sort() 
      row_breaks.sort() 

      #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
      # Regroup the text into table 'cells' 
      #___________________________________________________________________________________________________________ 

      print " -- Text extraction complete. Grouping data for table ..." 

      row_break_prev = 0 
      col_break_prev = 0 

      table_data = [] 
      table_rows = len(row_breaks) 
      for i, row_break in enumerate(row_breaks): 
       if row_break_prev == 0:        # Skip the rest the first time thru 
        row_break_prev = row_break 
       else: 
        for j, col_break in enumerate(col_breaks): 
         if col_break_prev == 0:      # Skip query the first time thru 
          col_break_prev = col_break 
         else: 
          # Run query to get all text within cell lines (+/- htol & vtol values) 
          curs.execute(qry_get_pdf_txt.format(tbl_pdf_elements, pg, col_break_prev + cell_htol_lf, \ 
           col_break + cell_htol_rt, row_break_prev + cell_vtol_dn, row_break + cell_vtol_up)) 

          rows = curs.fetchall()     # Retrieve all rows 

          for row in rows: 
           if row[0] != None:     # Skip null results 
            idx2 += 1 
            table_text = row[0] 
            if replace_newlines:   # Option - Replace newline codes (\n) with spaces 
             table_text = table_text.replace("\n", " ") 

            if replace_multspaces:   # Option - Replace multiple spaces w/single space 
             table_text = re.sub(" +", " ", table_text) 

            table_data.append([idx2, table_text, pg, table_rows - i, j]) 

         col_break_prev = col_break 

       row_break_prev = row_break 

      curs.executemany(qry_insert_tbl_data, table_data) 
      conn.commit() 

    #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
    # Export the regrouped table data: 

    # Determine the number of columns needed for the output file 
    # -- Should the data be extracted all at once or one page at a time? 

    print "Saving exported table data ..." 

    qry_col_count = "SELECT MIN([col]) AS colmin, MAX([col]) AS colmax, MIN([row]) AS rowmin, MAX([row]) AS rowmax, " + \ 
     "COUNT([row]) AS rowttl FROM [{0}] WHERE [pg] = {1} AND [tbl_text]!=' ';" 

    qry_sql_export = "SELECT * FROM [{0}] WHERE [pg] = {1} AND [row] = {2} AND [tbl_text]!=' ' ORDER BY [col];" 

    f = open(filepath + destfile, "wb") 
    writer = UnicodeWriter(f) 

    for pg in pagelist: 
     curs.execute(qry_col_count.format(tbl_table_data, pg)) 
     rows = curs.fetchall() 

     if len(rows) > 1: 
      print "Error retrieving row & column counts! More that one record returned!" 
      print " -- ", qry_col_count.format(tbl_table_data, pg) 
      print rows 
      sys.exit() 

     for row in rows: 
      (col_min, col_max, row_min, row_max, row_ttl) = row 

     # Insert a page separator 
     writer.writerow(["Data for Page {0}:".format(pg), ]) 

     if row_ttl == 0: 
      writer.writerow(["Unable to export text from PDF file. No table structure found.", ]) 

     else: 
      k = 0 
      for j in range(row_min, row_max + 1): 
       curs.execute(qry_sql_export.format(tbl_table_data, pg, j)) 

       rows = curs.fetchall() 

       if rows == None:       # No records match the given criteria 
        pass 

       else: 
        i = 1 
        k += 1 
        column_data = [k, ]      # 1st column as an Index 

        for row in rows: 
         (idx, tbl_text, pg_num, row_num, col_num) = row 

         if pg_num != pg:     # Exit the loop if Page # doesn't match 
          break 

         while i < col_num: 
          column_data.append("") 
          i += 1 
          if i >= col_num or i == col_max: break 

         column_data.append(unicode(tbl_text)) 
         i += 1 

        writer.writerow(column_data) 

    f.close() 

    #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
    # Dump the SQLite regrouped data (for error checking): 

    print "Dumping SQLite table of regrouped (table) text ..." 

    qry_sql_export = "SELECT * FROM [{0}] WHERE [tbl_text]!=' ' ORDER BY [pg], [row], [col];" 
    curs.execute(qry_sql_export.format(tbl_table_data)) 
    rows = curs.fetchall() 

    # Output data with Unicode intact as CSV 
    with open(dmp_tbl_data, "wb") as f: 
     writer = UnicodeWriter(f) 
     writer.writerow(["idx", "tbl_text", "pg", "row", "col"]) 
     writer.writerows(rows) 

    f.close() 

    #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
    # Dump the SQLite temporary PDF text data (for error checking): 

    print "Dumping SQLite table of extracted PDF text ..." 

    qry_sql_export = "SELECT * FROM [{0}] WHERE [pdf_text]!=' ' ORDER BY pg, y0 DESC, x0 ASC;" 
    curs.execute(qry_sql_export.format(tbl_pdf_elements)) 
    rows = curs.fetchall() 

    # Output data with Unicode intact as CSV 
    with open(dmp_pdf_data, "wb") as f: 
     writer = UnicodeWriter(f) 
     writer.writerow(["idx", "pdf_text", "pg", "x0", "y0", "x1", "y2", "orient"]) 
     writer.writerows(rows) 

    f.close() 

    #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
    print "Conversion complete." 
    print "-" * 120 

except sqlite3.Error, e: 

    #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
    # Rollback the last database transaction if the connection fails 
    #___________________________________________________________________________________________________________________ 

    if conn: 
     conn.rollback() 

    print "Error '{0}':".format(e.args[0]) 
    sys.exit(1) 

finally: 

    #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
    # Close the connection to the database file 
    #___________________________________________________________________________________________________________________ 

    if conn: 
     conn.close()

出典

2017-08-23 23:38:12

PDFファイルから表データを抽出するためのソリューション

答えて

関連する問題