[Python 爬蟲] 如何爬取以 WordPress 架設的網站 Blog 文章

判斷網站是不是以 WP 架設，以及使用 API 爬取文章

2023年1月15日下午 1:00

技術文章

互連網世界中的網站有超過 40% 是用 WordPress 架設的。WordPress (WP) 是一個開源、使用者友善、且擁有豐富的外掛及佈景主題生態系的內容管理系統 (Content Management System, CMS)。它可以用來架設部落格、電子商務、公司門戶、論壇等各種不同類型的網站。所以如果你想要爬取資料，滿有可能會遇到一個 WP 網站，本文說明怎麼透過 API 爬取以 WordPress 架設的 Blog 文章。

檢查網站是否以 WordPress 架設

為什麼要知道一個網站是不是以 WordPress 架設？除了爬取資料的需要外，有時候你也會想知道「這樣的網站版型/架構能不能用 WP 做出來」，進而套用在自己的 WP 網站上。要檢查一個網站是不是以 WP 架設有多種方式，以下列出比較方便的幾種：

直接在網址後面接上 /wp-admin。/wp-admin 是 WP 預設的管理者登入網址，以卡門人妻這個部落格為例，其網址是 https://wifekaman.com/，而你會在 https://wifekaman.com/wp-admin 看到類似以下的登入畫面。管理者網址是可以關閉或修改的，但是大部分的個人使用者或中小企業不會去動它

檢視網頁原始碼，找尋特殊字串如 wp-content。在瀏覽器網頁中按右鍵 → 檢視原始碼，在原始碼中搜尋 WP 系統的特殊字串如 wp-content

使用現成的服務。把網址丟到現成的服務如 isitwp.com 中檢查

透過 WordPress API 取得 Blog Posts

WordPress 系統有內建 REST API 支援，所以如果你想要爬取一個 WP 網站的 Blog 文章，就不要再事倍功半地去解析網頁文件架構了，直接使用以下 API endpoint 就好：

1	`https://example.com/wp-json/wp/v2/posts`

一樣以卡門人妻部落格為例，輸入 https://wifekaman.com/wp-json/wp/v2/posts/ 後會看到以下資料：

預設是一次回傳 10 筆資料 (可通過 per_page 參數調整，上限為 100)；同時在 Response Headers 中的 x-wp-total 與 x-wp-totalpages 會顯示網站的總文章數與頁數：

以上面的例子來說，這個部落格總共有 420 篇文章、42頁，預設是回傳第一頁的 10 篇文章，在網址加上 page 參數就可以取得剩下的文章，例如：

https://wifekaman.com/wp-json/wp/v2/posts/?page=2>  # 第 11 - 20 篇文章
https://wifekaman.com/wp-json/wp/v2/posts/?page=42>  # 第 411 - 420 篇文章

完整範例程式碼

importrequests
site ="<https://wifekaman.com/>"
per_page =100# 每次回傳 100 篇文章 (WP API 允許的上限)
url =f"{site}/wp-json/wp/v2/posts?per_page={per_page}"
# 附加 user-agent header, 假裝是瀏覽器發出的請求
headers ={
"user-agent":
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}
resp =requests.get(url, headers=headers)
total_posts =int(resp.headers["x-wp-total"])
total_pages =int(resp.headers["x-wp-totalpages"])
print(f"{total_posts} posts with {total_pages} pages in total")
posts =[]
print("Starting retrieval")
forpage inrange(1, total_pages +1):
print(f"Retrieving page {page}...")
ifpage > 1:  # 重新取得第二頁之後的資料
url =f"{site}/wp-json/wp/v2/posts?per_page={per_page}&page={page}"
resp =requests.get(url, headers=headers)
foritem inresp.json():
p ={
"id": item["id"],
"link": item["link"],
"date": item["date"],
"title": item["title"]["rendered"],
"content": item["content"]["rendered"]
}
posts.append(p)
print(f"Retrieval finished. {len(posts)} posts retrieved")
first_post =posts[0]
last_post =posts[len(posts) -1]
print(f"The first: {first_post['date']} - {first_post['title']}")
print(f"The last: {last_post['date']} - {last_post['title']}")

執行結果

420 posts with 5 pages intotal
Starting retrieval
Retrieving page 1...
Retrieving page 2...
Retrieving page 3...
Retrieving page 4...
Retrieving page 5...
Retrieval finished. 420 posts retrieved
The first: 2022-02-15T09:07:05 - 【首爾必去景點】仁寺洞＋益善洞｜傳統韓食｜文青咖啡廳｜親子飯店｜Insadong＋Ikseondong
The last: 2008-09-06T04:48:00 - 星砂

例外情況

WP 內建的 API endpoint 是可以被網站擁有者關閉的。以閱讀前哨站為例，會看到以下的回應：

以上的 API endpoint 是在自有主機上架站、安裝 WP 的網站才會有。如果網站是透過 WordPress.com 建置的，WordPress.com 有提供另外一套 REST API。例如這個網站，可以透過以下 endpoint 取得文章：

1 2	`# format:` `https://public-api.wordpress.com/rest/v1.1/sites/$site/posts/` `https://public-api.wordpress.com/rest/v1.1/sites/bestversionofthyself.com/posts/`

想系統化學習 Python 網路爬蟲，可以參考 Python 網頁爬蟲入門實戰：經典長銷、千人好評的 Python 爬蟲課程

文章標籤

# Web Scraping # python