SEDを使用してHTMLコンテンツを取り除く

私はSEDが指定されたツールであるタスクに取り組んでいます。タスクは、任意のWebページファイル（* .htmまたは* .html）の内容を取り除き、目的のデータを新しいファイルに挿入することです。SEDを使用してHTMLコンテンツを取り除く

<body>タグを含むすべてのタグを削除します。
</body>タグを含むすべてのものを削除します。以下は

は<div>タグ、およびそれらの間に何、保持される一例である：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml"> 
<head> 
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> 
<title>SED Challange</title> 
</head> 
<body style="background-color:black;"><div style="width:100%; height:150px; margin-top:150px; text-align:center"> 
<img src="pic.png" width="50" height="50" alt="Pic alt text" /> 
</div></body></html>

はしかし、私は<body>を削除すると問題が発生したことだし、何が来る前に：

sed 's/.*body.*>//' ./index.html > ./index.html.nobody

希望の結果の代わりに、<body>と</body>を含む2つの別々の行が削除されます。

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml"> 
<head> 
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> 
<title>SED Challange</title> 
</head> 

<img src="pic.png" width="50" height="50" alt="Pic alt text" />

私はなぜそうであっても見えません。私はフィードバックを感謝します。

編集：

#!/bin/bash 

#Search location as user provided argument. 
target="$1" 

#Recursive, case insensitive search for file extension like htm(l). 
hit=$(find $target -type f -iname '*.htm' -or -iname '*.html') 

for h in $hit 
do 
    hp=$(realpath $h) #Absolute path of file (hit path). 
    echo "Stripping performed on $hp" #Informing what file(s) found. 
    nobody="${hp}_nobody" #File to contain desired data ending with "_nobody". 

    #Remove file contents from start to and including head-tag, 
    #Remove body-tag, 
    #Remove end html-tag, 
    #Removee blank lines, 
    #Insert data from file to file_nobody. 
    sed '1,/<\/head>/d;s/<\/*body[^>]*>//g;s/<\/html>//;/^$/d' $h > $nobody 
done

出典

2017-03-14 henrix

ですので、結果は「

」でしょうか？ – RomanPerekhrest

この特定の内容については正しいローマンです。タスクが宣言されると、収集する望ましいデータはBODYタグの間です。別のソースファイルの場合、DIVタグがない可能性があります。 – henrix

これは、与えられたコードで動作するはずのsed：

sed '1,/<\/head>/d;s/<\/*body[^>]*>//g;s/<\/html>//' ./index.html > ./index.html.nobody

それは削除します。

おかげで、これは私の完全なスクリプトです

1行目から行目タグ

を閉じタグ

<body>と</body>タグ

</html>しかし、sedのは、htmlファイルを解析するためではないことに注意してください。代わりにxmlパーサを使用してください（例：xmllint、XMLStarlet、...）

出典

2017-03-14 12:31:50 SLePort

SLePort、入力いただきありがとうございます。それは私にとっては効果がなかった、私は恐れている。私は、スクリプトがBODYタグへのすべてを明示的に削除するようにしようとしています。別のソース資料にDIVタグがない可能性があります。 – henrix

@henrix私は編集しました。それがあなたが望むものなら教えてください。 – SLePort

SLePortに感謝します。遅い返信のお詫び、私はスクリプトを完了するために取り組んできました。 – henrix

SEDを使用してHTMLコンテンツを取り除く

答えて

関連する問題