Scraping in PHP

Puppeteer

Install Puppeteer:

npm install puppeteer --global

About libraries: Mohamed Said on Twitter: “Is there any neat PHP package that drives headless chrome? Similar do Dusk but not designed for only testing.” / Twitter

Without JavaScript processing

The WordPress way: native PHP without JS execution

WordPress HTTP API contain methods like wp_remote_get() which is essentially a wrapper for curl.

Note that wp_remote_get() function has $blocking request argument (see this comment) that can “trigger-and-forget” a request and continue with PHP execution.

Nice usage example is wptt-webfont-loader.php

Various HTTP clients w/o ability to scrape JavaScript

guzzle/guzzle is one HTTP client, but many others are equally excellent - it just happens to be one of the most mature and most downloaded.
FriendsOfPHP/Goutte is HTTP client made for scraping. Exactly the same API as Symfony Panther but is much faster as no-JS
cubiclesoft/ultimate-web-scraper
KnpLabs/snappy is wrapper for wkhtmltopdf/wkhtmltoimage

With JavaScript execution

Articles:

PHP Web Scraping: What to know before you start with Symfony Panther, Goutte, and more

Web Scraping with PHP
spatie/crawler & spatie/browsershot is powerful crawler that can execute Javascript by using Puppeteer but not supporting complete Puppeteer API

Example usage: How to write decent crawlers with php | thePHP Website Easily convert webpages to images using PHP - Freek Van der Herten’s blog on PHP, Laravel and JavaScript http://www.lib4dev.in/info/spatie/crawler/45406338
Symfony Panther How to use: Scraping javascript websites using PHP Panther Library - Z.Rashwani Blog / zrashwani/arachnid

Laravel Dusk with katsana/dusk-crawler

Other resources:

Symfony Panther Symfony Panther is a standalone library that provides the same APIs as Goutte

Requirements: Check with phpinfo() that ‘proc_open’ function is not disabled in PHP. So disable any ISPConfig custom PHP.INI settings.

Will install also chromium-browser as a snap apt install chromium-chromedriver

apt install libnss3

libicu-dev
libzip-dev
wget
gnupg2
libasound2
fonts-liberation
libappindicator3-1
xdg-utils
lsb-release
libxss1 \

Close on port:

kill -9 $(lsof -t -i:9515)

rialto-php/puphpeteer is a Puppeteer bridge for PHP, supporting the entire API. The most significant difference, in my opinion, is that in every method call and getter/setter in PuPHPeteer is synchronous.

Usage resources:

Za njega mi je trebao: apt install libxss1

WOow: Here are a few workarounds or tools which could help your headless browser-based scrapers from getting banned. Puppeteer Extra – Puppeteer Stealth Plugin Faking Geolocation https://www.scrapehero.com/how-to-take-screenshots-of-a-web-page-using-puppeteer/

Blocking images: https://www.scrapehero.com/how-to-increase-web-scraping-speed-using-puppeteer/

SOCKS5 proxy: https://pocketadmin.tech/en/puppeteer-use-proxy/ https://docs.browserless.io/docs/using-a-proxy.html BUT SOCKS doesn’t work authentication

https://dev.to/sonyarianto/practical-puppeteer-using-proxy-to-browse-a-page-1m82

To retrieve meta from page, try:

1

await page.$eval('meta[http-equiv=refresh]', a => a.content)

to retrieve HTML content use

const html = await page.content();

and then open url

date 01. Jan 0001 | modified 13. Feb 2025

filename: Scraping » PHP

Scraping in PHP

Scraping in PHP

Puppeteer

Without JavaScript processing

The WordPress way: native PHP without JS execution

Various HTTP clients w/o ability to scrape JavaScript

With JavaScript execution

Article Content