内部ボーダーの削除

テーブルイメージから切り抜かれたイメージがたくさんあります。 OCRには、テーブル境界の「残り物」のためにテキスト検出に問題があります。実際に私はそれらを削除する方法を探しています（私はテキストのみをピックアップしたい）。ここではそれらのいくつかの例は以下のとおりです。内部ボーダーの削除

first image example

second image example

ありがとう！

出典

2017-05-05 sebbz

だけ各境界上の_x_画素の画像をクロッピングしない理由（すなわち5ピクセル）？ –

黒い枠線を持たない画像もあるので、時には非常に小さくなります。もしあなたが言ったように切り抜くと、テキストも切り取られます。 – sebbz

このコード（opencvに基づいています）は、2つの例の問題を解決します。手順は以下の通りである：

閾値画像
比場合バイナリオブジェクトからライン
- 計算比=（対象の面積）/（バウンディングボックスの面積）
  - を除去が小さすぎると、オブジェクトがラインの組み合わせとみなされます
  - 比率が大きい場合、オブジェクトは単一行とみなされます

Pythonコード：

import cv2 
import matplotlib.pylab as plt 
import numpy as np 

# load image 
img = cv2.imread('om9gN.jpg',0) 

# blur and apply otsu threshold 
img = cv2.blur(img, (3,3)) 
_, img = cv2.threshold(img,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU) 

# invert image 
img = (img == 0).astype(np.uint8) 


img_new = np.zeros_like(img) 

# find contours 
_,contours,_ = cv2.findContours(img, 1, 2) 

for idx, cnt in enumerate(contours): 

    # get area of contour 
    temp = np.zeros_like(img) 
    cv2.drawContours(temp, contours , idx, 1, -1) 
    area_cnt = np.sum(temp) 

    # get number of pixels of bounding box of contour 
    x,y,w,h = cv2.boundingRect(cnt) 
    area_box = w * h 

    # get ratio of cnt-area and box-area 
    ratio = float(area_cnt)/area_box 

    # only draw contour if: 
    # - 1.) ratio is not too big (line fills whole bounding box) 
    # - 2.) ratio is not too small (combination of lines fill very 
    #         small ratio of bounding box) 
    if 0.9 > ratio > 0.2: 
     cv2.drawContours(img_new, contours , idx, 1, -1) 

plt.figure() 
plt.subplot(1,2,1) 
plt.imshow(img_new) 
plt.axis("off") 
plt.show()

出典

2017-05-05 15:35:41

ほとんどの場合に効果があります。ただし、使用するフォントに応じて、** ** **（L）、** ** **、** **などの文字も削除されます。 –

は真です。これを避けるための1つのオプションは、バウンディングボックスのアスペクト比（box_width/box_lenght）のアスペクト比にもしきい値を設定することです。アスペクト比が小さすぎる場合は、I、l、 - 、... –

ではなく、線でなければなりません。ここで、バイナリイメージから行を削除する別の方法が見つかりました：http://docs.opencv.org/trunk /d1/dee/tutorial_moprh_lines_detection.html –

答えて

関連する問題