xpath匹配 · PHP筆記

[PHP: DOMXPath - Manual](https://www.php.net/manual/zh/class.domxpath.php) ``` <?php // 1. 獲取頁面HTML內容 $url = 'https://example.com'; // 目標頁面地址 $html = file_get_contents($url); // 簡單獲取方式（適用于不需要鑒權的頁面） // 或使用cURL獲取（推薦）： /* $ch = curl_init($url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0'); $html = curl_exec($ch); curl_close($ch); */ // 2. 創建DOM解析對象 $dom = new DOMDocument(); libxml_use_internal_errors(true); // 禁用錯誤報告（處理不規范的HTML） $dom->loadHTML($html); libxml_clear_errors(); // 3. 創建XPath解析器 $xpath = new DOMXPath($dom); // 4. 定義XPath表達式（示例） $targetXpath = '//h1[@class="title"]'; // 提取class為title的h1標簽 // 5. 執行XPath查詢 $nodes = $xpath->query($targetXpath); // 6. 處理查詢結果 $result = []; if ($nodes->length > 0) { foreach ($nodes as $node) { // 提取文本內容 $result[] = trim($node->nodeValue); // 如果需要提取屬性（示例）： // $class = $node->getAttribute('class'); // $result[] = ['text' => trim($node->nodeValue), 'class' => $class]; } } else { // 沒有匹配內容時的處理 $result = ['error' => '未找到匹配內容']; } // 7. 輸出結果 print_r($result); // 8. 清理內存 unset($dom, $xpath); ?> ``` **獲取HTML的三種方式**： * `file_get_contents()`：簡單快速，適合靜態頁面 * cURL擴展：推薦方式，支持： ``` curl_setopt($ch, CURLOPT_PROXY, 'IP:PORT'); // 設置代理 curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt'); // 保持會話 curl_setopt($ch, CURLOPT_TIMEOUT, 15); // 超時設置 ``` * GuzzleHTTP客戶端（需要安裝）：適合復雜需求 ``` composer require guzzlehttp/guzzle ``` **XPath編寫技巧** ``` //div[@id="main"]/ul/li/a # 多層結構定位 //meta[@property="og:title"]/@content # 獲取屬性值 //a[contains(@class, "btn")] # 模糊匹配class //*[starts-with(text(), "Hello")] # 文本前綴匹配 ``` **常見問題處理** **XPath匹配不到內容** 1. 驗證XPath有效性： ~~~ // 調試輸出整個DOM結構 echo $dom->saveHTML(); ~~~ 2. 處理動態加載內容： ~~~ // 使用瀏覽器自動化工具 composer require facebook/webdriver ~~~ 3. 處理iframe嵌套： ~~~ // 需要單獨處理iframe內容 $iframeNodes = $xpath->query('//iframe'); foreach ($iframeNodes as $iframe) { $iframeUrl = $iframe->getAttribute('src'); // 單獨處理iframe內容 } ~~~ **中文亂碼** * 解決方案： ~~~ // 轉換編碼（處理GBK頁面） $html = mb_convert_encoding($html, 'HTML-ENTITIES', 'GB2312'); ~~~ **需要處理JavaScript渲染** 解決方案：使用無頭瀏覽器方案： ``` composer require chrome-php/chrome use HeadlessChromium\BrowserFactory; $browserFactory = new BrowserFactory(); $browser = $browserFactory->createBrowser(); $page = $browser->createPage(); $page->navigate($url)->waitForNavigation(); $html = $page->evaluate('document.documentElement.outerHTML')->getReturnValue(); ``` ### 高級功能擴展 1. **結果緩存機制**： ~~~ $cacheFile = md5($url).'.cache'; if (file_exists($cacheFile) && time()-filemtime($cacheFile) < 3600) { return json_decode(file_get_contents($cacheFile), true); } // ...執行采集... file_put_contents($cacheFile, json_encode($result)); ~~~ 2. **自動重試機制**： ~~~ $retry = 3; while ($retry > 0) { try { // 執行采集代碼 break; } catch (Exception $e) { $retry--; sleep(5); } } ~~~ 3. **分布式采集**： ~~~ // 使用Redis隊列 $redis = new Redis(); $redis->connect('127.0.0.1', 6379); while ($url = $redis->rpop('url_queue')) { // 執行采集任務 // 存儲結果到數據庫 } ~~~ ### 性能優化建議 1. **啟用緩存**： * 使用Memcached緩存DOM解析結果 ~~~ $memcached = new Memcached(); $memcached->addServer('localhost', 11211); $cacheKey = 'page_'.md5($url); if ($cached = $memcached->get($cacheKey)) { $dom->loadHTML($cached); } else { // 解析并存儲 $memcached->set($cacheKey, $html, 3600); } ~~~ 2. **并行處理**： ~~~ // 使用多進程 $urls = ['url1', 'url2', 'url3']; $pool = new Pool(3); // 3個線程 foreach ($urls as $url) { $pool->submit(new class($url) extends Thread { public function __construct($url) { $this->url = $url; } public function run() { // 執行采集邏輯 } }); } $pool->shutdown(); ~~~ 3. **DNS預解析**： ~~~ // 在采集前解析所有域名 dns_get_record('example.com', DNS_A); ~~~ 通過以上方法，您可以構建一個健壯的網頁內容采集系統，能夠應對各種復雜的網頁結構和采集需求。