搜尋 2 個短語的 HTML(忽略所有標籤) 並剝離所有其他內容

問題描述

我有 html 程式碼儲存在一個字串中，例如：

$html = '
        <html>
        <body>
        <p>Hello <em> 進撃の巨人</em>!</p>
        random code
        random code
        <p>Lorem <span>ipsum<span>.</p>
        </body>
        </html>
        ';

然後我有兩個句子儲存在變數中：

$begin = 'Hello 進撃の巨人!';
$end = 'Lorem ipsum.';

我想搜尋 $html 這兩個句子，並剝離他們之前和之後的一切。所以 $html 將成為：

$html = 'Hello <em> 進撃の巨人</em>!</p>
        random code
        random code
        <p>Lorem <span>ipsum<span>.';

我該如何實現？請注意，$begin 和 $end 變數沒有 html 標籤，但 $html 中的句子很可能具有如上所示的標籤。

也許是正規表示式？

到目前為止我已經嘗試過

strpos()方法。問題是 $html 包含句子中的標籤，使得 $begin 和 $end 句子不匹配。在執行 strpos()之前，我可以 strip_tags($html)，但是我明顯會得到沒有標籤的 $html 。
搜尋變數的一部分，如 Hello，但這是永遠不會安全的，並會給出許多匹配。

最佳解決辦法

這是一個基於懶惰點匹配正規表示式的簡短但我相信的工作解決方案 (可以透過建立一個更長的，展開的正規表示式來改進，但應該足夠，除非你有很大的文字塊) 。

$html = "<html>n<body>n<p><p>H<div>ello</div><script></script> <em> 進&nbsp;&nbsp;&nbsp; 撃の巨人</em>!</p>nrandom codenrandom coden<p>Lorem <span>ipsum<span>.</p>n</body>n </html>";
$begin = 'Hello     進撃の巨人!';
$end = 'Lorem ipsum.';
$begin = preg_replace_callback('~s++(?!z)|(s++z)~u', function ($m) { return !empty($m[1]) ? '' : ' '; }, $begin);
$end = preg_replace_callback('~s++(?!z)|(s++z)~u', function ($m) { return !empty($m[1]) ? '' : ' '; }, $end);
$begin_arr = preg_split('~(?=X)~u', $begin, -1, PREG_SPLIT_NO_EMPTY);
$end_arr = preg_split('~(?=X)~u', $end, -1, PREG_SPLIT_NO_EMPTY);
$reg = "(?s)(?:<[^<>]+>)?(?:&#?\w+;)*\s*" .  implode("", array_map(function($x, $k) use ($begin_arr) { return ($k < count($begin_arr) - 1 ? preg_quote($x, "~") . "(?:s*(?:<[^<>]+>|&#?\w+;))*" : preg_quote($x, "~"));}, $begin_arr, array_keys($begin_arr)))
        . "(.*?)" . 
        implode("", array_map(function($x, $k) use ($end_arr) { return ($k < count($end_arr) - 1 ? preg_quote($x, "~") . "(?:s*(?:<[^<>]+>|&#?\w+;))*" : preg_quote($x, "~"));}, $end_arr, array_keys($end_arr))); 
echo $reg .PHP_EOL;
preg_match('~' . $reg . '~u', $html, $m);
print_r($m[0]);

參見 IDEONE demo

演演算法：

透過將分隔符字串分割為單個字形來建立動態正規表示式模式 (因為這些可以是 Unicode 字元，我建議使用 preg_split('~(?<!^)(?=X)~u', $end))，並透過新增可選標記匹配模式 (?:<[^<>]+>)? 來回滾。
然後，當. 與包含換行符的任何字元匹配時，(?s)啟用 DOTALL 模式，.*? 將匹配從前到尾分隔符的 0+個字元。

正規表示式詳細資訊：

'~(?<!^)(?=X)~u 匹配每個字母之前的字串開頭以外的每個位置
(樣本最終正規表示式)(?s)(?:<[^<>]+>)?(?:&#?w+;)*s*H(?:s*(?:<[^<>]+>|&#?w+;))*e(?:s*(?:<[^<>]+>|&#?w+;))*l(?:s*(?:<[^<>]+>|&#?w+;))*l(?:s*(?:<[^<>]+>|&#?w+;))*o(?:s*(?:<[^<>]+>|&#?w+;))* (?:s*(?:<[^<>]+>|&#?w+;))*進 (?:s*(?:<[^<>]+>|&#?w+;))*撃 (?:s*(?:<[^<>]+>|&#?w+;))*の(?:s*(?:<[^<>]+>|&#?w+;))*巨 (?:s*(?:<[^<>]+>|&#?w+;))*人 (?:s*(?:<[^<>]+>|&#?w+;))*!(?:s*(?:<[^<>]+>|&#?w+;))* + (.*?) + L(?:s*(?:<[^<>]+>|&#?w+;))*o(?:s*(?:<[^<>]+>|&#?w+;))*r(?:s*(?:<[^<>]+>|&#?w+;))*e(?:s*(?:<[^<>]+>|&#?w+;))*m(?:s*(?:<[^<>]+>|&#?w+;))* (?:s*(?:<[^<>]+>|&#?w+;))*i(?:s*(?:<[^<>]+>|&#?w+;))*p(?:s*(?:<[^<>]+>|&#?w+;))*s(?:s*(?:<[^<>]+>|&#?w+;))*u(?:s*(?:<[^<>]+>|&#?w+;))*m(?:s*(?:<[^<>]+>|&#?w+;))*. – 帶有用於標籤匹配的可選子模式的前導和後跟分隔符，以及 (.*?)(可能不需要捕獲) 。
要處理 Unicode 字串，~u 修飾符是必需的。
更新：為了佔用 1+空格，begin 和 end 模式中的空格可以用 s+子模式替換，以匹配輸入字串中任何種類的空白字元。
更新 2：輔助 $begin = preg_replace('~s+~u', ' ', $begin); 和 $end = preg_replace('~s+~u', ' ', $end); 需要在輸入字串中佔用 1 +空格。
要佔用 HTML 實體，可以在&#?\w+; 的可選部分新增另一個子模式，它也將與  和{ 相似。它也預先新增了 s*來匹配可選的空格，並用*進行量化 (可以為零或更多) 。

次佳解決辦法

我真的想寫一個正規表示式的解決方案。但是我有一些很好的和複雜的解決方案。所以，這裡是一個 non-regex 解決方案。

簡短說明：主要問題是保留 HTML 標籤。如果 HTML 標籤被剝離，我們可以輕鬆搜尋文字。所以：剝離這些！我們可以輕鬆地在剝離的內容中進行搜尋，並生成一個我們要剪下的子字串。然後，嘗試在保留標籤的同時從 HTML 剪下此子字串。

優點：

搜尋是很容易和獨立於 HTML，你可以搜尋正規表示式，如果你需要
要求是可擴充套件的：您可以輕鬆新增完整的多位元組支援，支援實體和 white-space 崩潰等
相對快 (有可能直接正規表示式可以更快)
不接觸原始 HTML，並適應其他標記語言

此方案的靜態實用程式類：

class HtmlExtractUtil
{

    const FAKE_MARKUP = '<>';
    const MARKUP_PATTERN = '#<[^>]+>#u';

    static public function extractBetween($html, $startTextToFind, $endTextToFind)
    {
        $strippedHtml = preg_replace(self::MARKUP_PATTERN, '', $html);
        $startPos = strpos($strippedHtml, $startTextToFind);
        $lastPos = strrpos($strippedHtml, $endTextToFind);

        if ($startPos === false || $lastPos === false) {
            return "";
        }

        $endPos = $lastPos + strlen($endTextToFind);
        if ($endPos <= $startPos) {
            return "";
        }

        return self::extractSubstring($html, $startPos, $endPos);
    }

    static public function extractSubstring($html, $startPos, $endPos)
    {
        preg_match_all(self::MARKUP_PATTERN, $html, $matches, PREG_OFFSET_CAPTURE);
        $start = -1;
        $end = -1;
        $previousEnd = 0;
        $stripPos = 0;
        $matchArray = $matches[0];
        $matchArray[] = [self::FAKE_MARKUP, strlen($html)];
        foreach ($matchArray as $match) {
            $diff = $previousEnd - $stripPos;
            $textLength = $match[1] - $previousEnd;
            if ($start == (-1)) {
                if ($startPos >= $stripPos && $startPos < $stripPos + $textLength) {
                    $start = $startPos + $diff;
                }
            }
            if ($end == (-1)) {
                if ($endPos > $stripPos && $endPos <= $stripPos + $textLength) {
                    $end = $endPos + $diff;
                    break;
                }
            }
            $tagLength = strlen($match[0]);
            $previousEnd = $match[1] + $tagLength;
            $stripPos += $textLength;
        }

        if ($start == (-1)) {
            return "";
        } elseif ($end == (-1)) {
            return substr($html, $start);
        } else {
            return substr($html, $start, $end - $start);
        }
    }

}

用法：

$html = '
<html>
<body>
<p>Any string before</p>
<p>Hello <em> 進撃の巨人</em>!</p>
random code
random code
<p>Lorem <span>ipsum<span>.</p>
<p>Any string after</p>
</body>
</html>
';
$startTextToFind = 'Hello 進撃の巨人!';
$endTextToFind = 'Lorem ipsum.';

$extractedText = HtmlExtractUtil::extractBetween($html, $startTextToFind, $endTextToFind);

header("Content-type: text/plain; charset=utf-8");
echo $extractedText . "n";

第三種解決辦法

正規表示式在解析 HTML 時有其侷限性。像許多人在我之前做過的，我會參考這個 famous answer 。

依賴正規表示式時的潛在問題

例如，假設這個標籤出現在必須提取的部分之前的 HTML 中：

<p attr="Hello 進撃の巨人!">This comes before the match</p>

許多正規表示式解決方案將會絆倒，並返回一個字串，該字串從 p 標籤的中間開始。

或者考慮在 HTML 部分中必須匹配的註釋：

<!-- Next paragraph will display "Lorem ipsum." -->

或者，出現一些鬆散的 less-than 和 greater-than 標誌 (假設在評論或屬性值中)：

<!-- Next paragraph will display >-> << Lorem ipsum. >> -->
<p data-attr="->->->" class="myclass">

那些正規表示式會怎麼做？

這些只是例子… 有無數的其他情況對基於正規表示式的解決方案構成問題。

有更可靠的方式來解析 HTML 。

將 HTML 載入到 DOM 中

我將在這裡提出一個基於 DOMDocument 介面的解決方案，使用這個演演算法：

獲取 HTML 檔案的文字內容，並標識兩個子字串 (開始/結束) 所在的兩個偏移量。
然後透過 DOM 文位元組點跟蹤這些節點所適合的偏移量。在兩個邊界偏移中的任一個交叉的節點中，插入一個預定義的分隔符 (|) 。該分隔符不應該存在於 HTML 字串中。因此，在滿足條件之前，將其加倍 (||，||||，…)
最後，將此分隔符分割為 HTML 表示，並將其中間部分作為結果。

這是程式碼：

function extractBetween($html, $begin, $end) {
    $dom = new DOMDocument();
    // Load HTML in DOM, making sure it supports UTF-8; double HTML tags are no problem
    $dom->loadHTML('<html><head>
            <meta http-equiv="content-type" content="text/html; charset=utf-8">
        </head></html>' . $html);
    // Get complete text content
    $text = $dom->textContent;
    // Get positions of the beginning/ending text; exit if not found.
    if (($from = strpos($text, $begin)) === false) return false;
    if (($to = strpos($text, $end, $from + strlen($begin))) === false) return false;
    $to += strlen($end);
    // Define a non-occurring delimiter by repeating `|` enough times:
    for ($delim = '|'; strpos($html, $delim) !== false; $delim .= $delim);
    // Use XPath to traverse the DOM
    $xpath = new DOMXPath($dom);
    // Go through the text nodes keeping track of total text length.
    // When exceeding one of the two offsets, inject a delimiter at that position.
    $pos = 0;
    foreach($xpath->evaluate("//text()") as $node) {
        // Add length of node's text content to total length
        $newpos = $pos + strlen($node->nodeValue);
        while ($newpos > $from || ($from === $to && $newpos === $from)) {
            // The beginning/ending text starts/ends somewhere in this text node.
            // Inject the delimiter at that position:
            $node->nodeValue = substr_replace($node->nodeValue, $delim, $from - $pos, 0);
            // If a delimiter was inserted at both beginning and ending texts,
            // then get the HTML and return the part between the delimiters
            if ($from === $to) return explode($delim, $dom->saveHTML())[1];
            // Delimiter was inserted at beginning text. Now search for ending text
            $from = $to;
        }
        $pos = $newpos;
    }
}

你會這樣稱呼：

// Sample input data
$html = '
        <html>
        <body>
        <p>This comes before the match</p>
        <p>Hey! Hello <em> 進撃の巨人</em>!</p>
        random code
        random code
        <p>Lorem <span>ipsum<span>. la la la</p>
        <p>This comes after the match</p>
        </body>
        </html>
        ';

$begin = 'Hello 進撃の巨人!';
$end = 'Lorem ipsum.';

// Call
$html = extractBetween($html, $begin, $end);

// Output result
echo $html;

輸出：

Hello <em> 進撃の巨人</em>!</p>
        random code
        random code
        <p>Lorem <span>ipsum<span>.

你會發現這個程式碼比 regex 替代方案更容易維護。

看到它執行在 eval.in 。

第四種辦法

這可能遠遠不是最佳的解決方案，但是我喜歡打破這個”riddles” 的頭腦，所以這裡是我的方法。

<?php
$subject = ' <html> 
<body> 
<p>He<i>l</i>lo <em>Lydia</em>!</p> 
random code 
random code 
<p>Lorem <span>ipsum</span>.</p> 
</body> 
</html>';

$begin = 'Hello Lydia!';
$end = 'Lorem ipsum.';

$begin_chars = str_split($begin);
$end_chars = str_split($end);

$begin_re = '';
$end_re = '';

foreach ($begin_chars as $c) {
    if ($c == ' ') {
        $begin_re .= '(s|(<[a-z/]+>))+';
    }
    else {
        $begin_re .= $c . '(<[a-z/]+>)?';
    }
}
foreach ($end_chars as $c) {
    if ($c == ' ') {
        $end_re .= '(s|(<[a-z/]+>))+';
    }
    else {
        $end_re .= $c . '(<[a-z/]+>)?';
    }
}

$re = '~(.*)((' . $begin_re . ')(.*)(' . $end_re . '))(.*)~ms';

$result = preg_match( $re, $subject , $matches );
$start_tag = preg_match( '~(<[a-z/]+>)$~', $matches[1] , $stmatches );

echo $stmatches[1] . $matches[2];

輸出：

<p>He<i>l</i>lo <em>Lydia</em>!</p> 
random code 
random code 
<p>Lorem <span>ipsum</span>.</p>

這是匹配這種情況，但我認為這將需要更多的邏輯來轉義正規表示式特殊字元如句點。

一般來說，這段程式碼片段：

將字串拆分為陣列，每個陣列值表示單個字元。這需要做，因為 Hello 也需要匹配 Hel<i>l</i>o 。
為了做到這一點，對於正規表示式，每個字元之後插入一個額外的 (<[a-z/]+>)?，其中包含空格字元的特殊情況。

第五種辦法

你可以試試這個 RegEx：

(.*?)  # Data before sentences (to be removed)
(      # Capture Both sentences and text in between
  H.*?e.*?l.*?l.*?o.*?s    # Hello[space]
  (<.*?>)*                  # Optional Opening Tag(s)
  進.*? 撃.*?の.*? 巨.*? 人.*?   # 進撃の巨人
  (</.*?>)*                # Optional Closing Tag(s)
  (.*?)                     # Optional Data in between sentences
  (<.*?>)*                  # Optional Opening Tag(s)
  L.*?o.*?r.*?e.*?m.*?s    # Lorem[space]
  (<.*?>)*                  # Optional Opening Tag(s)
  i.*?p.*?s.*?u.*?m.*?      # ipsum
)
(.*)   # Data after sentences (to be removed)

用 2nd 捕獲組代替

Live Demo on Regex101

正規表示式可以縮短為：

(.*?)(H.*?e.*?l.*?l.*?o.*?s(<.*?>)*進.*? 撃.*?の.*? 巨.*? 人.*?(</.*?>)*(.*?)(<.*?>)*L.*?o.*?r.*?e.*?m.*?s(<.*?>)*i.*?p.*?s.*?u.*?m.*?)(.*)

第六種辦法

只是為了好玩

<?php
$begin = 'Hello Moto!';
$end = 'Lorem ipsum.';
//https://regex101.com/r/mC8aO6/1
$re = "/[\w\W]/"; 
$str = $begin.$end; 
$subst = "$0.*?"; 

$result = preg_replace($re, $subst, $str);
//Hello Moto! 
//to
//H.*?e.*?l.*?l.*?o.*? .*?M.*?o.*?t.*?o.*?!.*?

//https://regex101.com/r/fS6zG2/1
$re = "/(\!|\.\.)/"; 
$str = $result; 
$subst = "\\$1";

$result = preg_replace($re, $subst, $str);

$re = "/.*(<p.*?$result.*?p>).*/s"; 
$str = "        <html>n        <body>n        <p>He<i>l</i>lo <em>Moto</em>!n        random coden        random coden        <p>Lorem <span>ipsum<span>.<p>n        </body>n        </html>n        "; 
$subst = "$1"; 

$result = preg_replace($re, $subst, $str);
echo $result."n";
?>

輸入

$begin = 'Hello Moto!';
$end = 'Lorem ipsum.';

    <html>
    <body>
    <p>He<i>l</i>lo <em>Moto</em>!
    random code
    random code
    <p>Lorem <span>ipsum<span>.<p>
    </body>
    </html>

產量

<p>He<i>l</i>lo <em>Moto</em>!
        random code
        random code
        <p>Lorem <span>ipsum<span>.<p>

第七種辦法

在 HTML 源上進行內容搜尋有幾種不同的方法。它們都有優點和缺點。如果未知程式碼中的結構是一個問題，最安全的方法是使用 XML 解析器，但是這些結構很複雜，因此相當慢。

正規表示式設計用於文書處理。雖然正規表示式不是由於開銷而最快的事情，但 preg_函式是一個合理的妥協，以保持程式碼小而簡潔，而不會因為防止模式變得太複雜而影響到很多效能。

HTML 結構的分析可以透過遞迴正規表示式來實現。由於處理速度較慢，難以除錯，我更喜歡在 PHP 中編寫基本邏輯，並利用 preg_功能來執行較小的快速任務。

這是 OOP 中的一個解決方案，這是一個用於處理同一 HTML 源的許多搜尋的小類。它已經是一種處理擴充套件的類似問題的方法，如新增前一個和後續內容直到下一個標記邊界。它並不是一個完美的解決方案，但它很容易擴充套件。

邏輯是：為初始化支付一些執行時間，以便相對於純文字，帶標籤儲存標籤位置，並將字串儲存在<...> 和長度總和之間。然後在每個內容搜尋匹配針與普通內容。透過二進位制搜尋找到 HTML 源中的開始/結束位置。

二進位制搜尋工作類似：需要排序列表。您儲存第一個和最後一個元素的索引+1 。透過加法和整數除以 2 計算平均值。透過正確的位移執行分割和分層。如果找到的值為低，則將 index 值設定為當前索引的值越小，否則越大。停止索引差異 1. 如果搜尋確切的值，請早點刪除元素。 0，(14 + 1)=> 7; 7,15 => 11; 7,11 => 9; 7,9 => 8; 8-7 = diff.1 而不是 15 次迭代，只有 4 個完成。開始值越大，指數地儲存的時間越多。

PHP 類：

<?php
class HtmlTextSearch
{
  protected 
    $html            = '',
    $heystack        = '',
    $tags            = [],
    $current_tag_idx = null
  ;

  const
    RESULT_NO_MODIFICATION      = 0,
    RESULT_PREPEND_TAG          = 1,
    RESULT_PREPEND_TAG_CONTENT  = 2,
    RESULT_APPEND_TAG           = 4,
    RESULT_APPEND_TAG_CONTENT   = 8,
    MATCH_CASE_INSENSITIVE      =16,
    MATCH_BLANK_AS_WHITESPACE   =32,
    MATCH_BLANK_MULTIPLE        =64
  ;

  public function __construct($html)
  {
    $this->set_html($html);
  }

  public function set_html($html)
  {
    $this->html = $html;
    $regexp = '~<.*?>~su';
    preg_match_all($regexp, $html, $this->tags, PREG_PATTERN_ORDER | PREG_OFFSET_CAPTURE);
    $this->tags = $this->tags[0];
    # we use exact the same algorithm to strip html
    $this->heystack = preg_replace($regexp, '', $html);

    # convert positions to plain content
    $sum_length = 0;
    foreach($this->tags as &$tag)
    { $tag['pos_in_content'] = $tag[1] - $sum_length;
      $tag['sum_length'    ] = $sum_length += strlen($tag[0]);
    }

    # zero length dummy tags to mark start/end position of strings not beginning/ending with a tag
    array_unshift($this->tags , [0 => '', 1 => 0, 'pos_in_content' => 0, 'sum_length' => 0 ]); 
    array_push   ($this->tags , [0 => '', 1 => strlen($html)-1]); 
  }

  public function translate_pos_plain2html($content_position)
  {
    # binary search
    $idx = [true => 0, false => count($this->tags)-1];
    while(1 < $idx[false] - $idx[true])
    { $i = ($idx[true] + $idx[false]) >>1;                               // integer half of both array indexes
      $idx[$this->tags[$i]['pos_in_content'] <= $content_position] = $i; // hold one index less and the other greater
    }

    $this->current_tag_idx = $idx[true];
    return $this->tags[$this->current_tag_idx]['sum_length'] + $content_position;
  }

  public function &find_content($needle_start, $needle_end = '', $result_modifiers = self::RESULT_NO_MODIFICATION)
  {
    $needle_start = preg_quote($needle_start, '~');
    $needle_end   = '' == $needle_end ? '' : preg_quote($needle_end  , '~');
    if((self::MATCH_BLANK_MULTIPLE | self::MATCH_BLANK_AS_WHITESPACE) & $result_modifiers)
    { 
      $replacement  = self::MATCH_BLANK_AS_WHITESPACE & $result_modifiers ? 's' : ' ';
      if(self::MATCH_BLANK_MULTIPLE & $result_modifiers)
      { $replacement .= '+';
        $multiplier = '+';
      }
      else
        $multiplier = '';
      $repl_pattern = "~ $multiplier~";
      $needle_start = preg_replace($repl_pattern, $replacement, $needle_start);
      $needle_end   = preg_replace($repl_pattern, $replacement, $needle_end);
    }

    $icase = self::MATCH_CASE_INSENSITIVE & $result_modifiers ? 'i' : '';
    $search_pattern = "~{$needle_start}.*?{$needle_end}~su$icase";
    preg_match_all($search_pattern, $this->heystack, $matches, PREG_PATTERN_ORDER | PREG_OFFSET_CAPTURE);

    foreach($matches[0] as &$match)
    { $pre = $post = '';

      $pos_start = $this->translate_pos_plain2html($match[1]);
      if(self::RESULT_PREPEND_TAG_CONTENT & $result_modifiers)
        $pos_start = $this->tags[$this->current_tag_idx][1]
          +( self::RESULT_PREPEND_TAG & $result_modifiers ? 0 : strlen ($this->tags[$this->current_tag_idx][0]) );
      elseif(self::RESULT_PREPEND_TAG     & $result_modifiers)
        $pre = $this->tags[$this->current_tag_idx][0];

      $pos_end   = $this->translate_pos_plain2html($match[1] + strlen($match[0]));
      if(self::RESULT_APPEND_TAG_CONTENT & $result_modifiers)
      { $next_tag = $this->tags[$this->current_tag_idx+1];
        $pos_end = $next_tag[1]
          +( self::RESULT_APPEND_TAG  & $result_modifiers ? strlen ($next_tag[0]) : 0);
      }
      elseif(self::RESULT_APPEND_TAG     & $result_modifiers)
        $post = $this->tags[$this->current_tag_idx+1][0];

      $match = $pre . substr($this->html, $pos_start, $pos_end - $pos_start) . $post;
    };
    return $matches[0];
  }
}

一些測試用例：

$html_source = get($_POST['html'], <<< ___
<html>
  <body>
    <p>He said: "Hello <em> 進撃の巨人</em>!"</p>
    random code
    random code
    <p>Lorem <span>ipsum</span>. foo bar</p>
  </body>
</html>
___
);


  function get(&$ref, $default=null) { return isset($ref) ? $ref : $default; }

  function attr_checked($name, $method = "post")
  { $req = ['post' => '_POST', 'get' => '_GET'];
    return isset($GLOBALS[$req[$method]][$name]) ? ' checked="checked"' : '';
  }

  $begin = get($_POST['begin'], '"Hello 進撃の巨人!"');
  $end   = get($_POST['end'  ], 'Lorem ipsum.'   );
?>

<form action="" method="post">
  <textarea name="html" cols="80" rows="10"><?php
echo $html_source;
?></textarea>

  <br><input type="text"  name="begin" value="<?php echo $begin;?>">
  <br><input type="text"  name="end"   value="<?php echo $end  ;?>">

  <br><input type="checkbox" name="tag-pre" id="tag-pre"<?php echo attr_checked('tag-pre');?>>
      <label for="tag-pre">prefix tag</label>
      <br><input type="checkbox" name="txt-pre" id="txt-pre"<?php echo attr_checked('txt-pre');?>>
      <label for="txt-pre">prefix content</label>
  <br><input type="checkbox" name="txt-suf" id="txt-suf"<?php echo attr_checked('txt-suf');?>>
      <label for="txt-suf">suffix content</label>
  <br><input type="checkbox" name="tag-suf" id="tag-suf"<?php echo attr_checked('tag-suf');?>>
      <label for="tag-suf">suffix tag</label>
  <br>
  <br><input type="checkbox" name="wspace" id="wspace"<?php echo attr_checked('wspace');?>>
      <label for="wspace">blanc (#32) matches any whitespace character</label>
  <br><input type="checkbox" name="multiple" id="wspace"<?php echo attr_checked('multiple');?>>
      <label for="multiple">one or more blancs match any number of blancs/whitespaces</label>
  <br><input type="checkbox" name="icase"    id="icase"<?php echo attr_checked('icase');?>>
      <label for="icase">case insensitive</label>

  <br><button type="submit">submit</button>
</form>

<?php
  $html = new HtmlTextSearch($html_source);

  $opts=
  [ 'tag-pre' => HtmlTextSearch::RESULT_PREPEND_TAG,
    'txt-pre' => HtmlTextSearch::RESULT_PREPEND_TAG_CONTENT,
    'txt-suf' => HtmlTextSearch::RESULT_APPEND_TAG_CONTENT,
    'tag-suf' => HtmlTextSearch::RESULT_APPEND_TAG,
    'wspace'  => HtmlTextSearch::MATCH_BLANK_AS_WHITESPACE,
    'multiple'=> HtmlTextSearch::MATCH_BLANK_MULTIPLE,
    'icase'   => HtmlTextSearch::MATCH_CASE_INSENSITIVE
  ];
  $options = 0;
  foreach($opts as $k => $v)
    if(isset($_POST[$k]))
      $options |= $v;
  $results = $html->find_content($begin, $end, $options);
  var_dump($results);
?>

第八種辦法

這個怎麼樣？

$escape=array('\'=>1,'^'=>1,'?'=>1,'+'=>1,'*'=>1,'{'=>1,'}'=>1,'('=>1,')'=>1,'['=>1,']'=>1,'|'=>1,'.'=>1,'$'=>1,'+'=>1,'/'=>1);
$pattern='/';
for($i=0;isset($begin[$i]);$i++){
if(ord($c=$begin[$i])<0x80||ord($c)>0xbf){
    if(isset($escape[$c]))
        $pattern.="([ trnvf]*<\/?[a-zA-Z]+>[ trnvf]*)*\$c";
    else
        $pattern.="([ trnvf]*<\/?[a-zA-Z]+>[ trnvf]*)*$c";
    }
    else
        $pattern.=$c;
}
$pattern.="(.|n|r)*";
for($i=0;isset($end[$i]);$i++){
if(ord($c=$end[$i])<0x80||ord($c)>0xbf){
    if(isset($escape[$c]))
        $pattern.="([ trnvf]*<\/?[a-zA-Z]+>[ trnvf]*)*\$c";
    else
        $pattern.="([ trnvf]*<\/?[a-zA-Z]+>[ trnvf]*)*$c";
    }
    else
        $pattern.=$c;
}
$pattern[17]='?';
$pattern.='(<\/?[a-zA-Z]+>)?/';
preg_match($pattern,$html,$a);
$match=$a[0];

第九種辦法

PHP 解決方案：

PHPFiddle Demo

$html = '
        <html>
        <body>
        <p>Hello <em> 進撃の巨人</em>!</p>
        random code
        random code
        <p>Lorem <span>ipsum<span>.</p>
        </body>
        </html>
        ';
$begin = 'Hello 進撃の巨人!';
$end = 'Lorem ipsum.';

$matchHtmlTag = '(?:<.*?>)?';
$matchAllNonGreedy = '(?:.|r?n)*?';
$matchUnescapedCharNotAtEnd = '([^\\](?!$)|\.(?!$))';
$matchBeginWithTags = preg_replace(
    $matchUnescapedCharNotAtEnd, '$0' . $matchHtmlTag, preg_quote($begin));
$matchEndWithTags = preg_replace(
    $matchUnescapedCharNotAtEnd, '$0' . $matchHtmlTag, preg_quote($end));
$pattern = '/' . $matchBeginWithTags . $matchAllNonGreedy . $matchEndWithTags . '/';

preg_match($pattern, $html, $matches);
$html = $matches[0];

生成的正規表示式 ($ pattern)：

Regex101 Demo

H(?:<.*?>)?e(?:<.*?>)?l(?:<.*?>)?l(?:<.*?>)?o(?:<.*?>)? (?:<.*?>)? 進 (?:<.*?>)? 撃 (?:<.*?>)?の(?:<.*?>)? 巨 (?:<.*?>)? 人 (?:<.*?>)?!(?:.|r?n)*?L(?:<.*?>)?o(?:<.*?>)?r(?:<.*?>)?e(?:<.*?>)?m(?:<.*?>)? (?:<.*?>)?i(?:<.*?>)?p(?:<.*?>)?s(?:<.*?>)?u(?:<.*?>)?m(?:<.*?>)?.

第十種辦法

假設您的示例中的 random code 在<p></p> 內，我建議使用 domdocument 和 xpath，而不是正規表示式在您嘗試做什麼。

$html = '
        <html>
        <body>
        <div>nada blahhh <p>test paragraph</p> <em>blahh</em></div>
        <p>test</p>
        <span>this is test</span>
        <p>Hello <em> 進撃の巨人</em>!</p>
        <p>random code</p>
        <p>random code</p>
        <p>Lorem <span>ipsum<span>.</p>
        <div>nada blahhh <p>test paragraph</p> <em>blahh</em></div>
        <p>test</p>
        <span>this is test</span>
        </body>
        </html>
        ';
$begin = 'Hello 進撃の巨人!';
$begin = iconv ( 'iso-8859-1','utf-8' , $begin ); // had to use iconv it won't be needed in your case
$end = 'Lorem ipsum.';       
$doc = new DOMDocument();
$doc->loadHTML($html);

$xpath = new DOMXpath($doc);
// example 3: same as above with wildcard
$elements = $xpath->query("*/p");

if (!is_null($elements)) {
    $flag = 'no_output';
  foreach ($elements as $element) {
      if($flag=='prepare_for_output'){$flag='output';}
      if($element->nodeValue==$begin){
      $flag='prepare_for_output';
      }
      if($element->nodeValue==$end){
      $flag='no_output';
      }
      if($flag=='output') {
      echo $element->nodeValue."n";
      }
  }
}

http://sandbox.onlinephpfunctions.com/code/fa1095d98c6ef5c600f7b06366b4e0c4798a112f

參考文獻

Search HTML for 2 phrases (ignoring all tags) and strip everything else

注：本文內容整合自 Google/Baidu/Bing 輔助翻譯的英文資料結果。如果您對結果不滿意，可以加入我們改善翻譯效果：薇曉朵技術論壇。

搜尋 2 個短語的 HTML(忽略所有標籤) 並剝離所有其他內容