Scraping in PHP
Puppeteer
Install Puppeteer:
npm install puppeteer --global
About libraries: Mohamed Said on Twitter: “Is there any neat PHP package that drives headless chrome? Similar do Dusk but not designed for only testing.” / Twitter
Without JavaScript processing
The WordPress way: native PHP without JS execution
WordPress HTTP API contain methods like wp_remote_get() which is essentially a wrapper for curl
.
Note that
wp_remote_get()
function has$blocking
request argument (see this comment) that can “trigger-and-forget” a request and continue with PHP execution.
Nice usage example is wptt-webfont-loader.php
Various HTTP clients w/o ability to scrape JavaScript
- guzzle/guzzle is one HTTP client, but many others are equally excellent - it just happens to be one of the most mature and most downloaded.
- FriendsOfPHP/Goutte is HTTP client made for scraping. Exactly the same API as Symfony Panther but is much faster as no-JS
- cubiclesoft/ultimate-web-scraper
- KnpLabs/snappy is wrapper for wkhtmltopdf/wkhtmltoimage
With JavaScript execution
Articles:
PHP Web Scraping: What to know before you start with Symfony Panther, Goutte, and more
-
spatie/crawler & spatie/browsershot is powerful crawler that can execute Javascript by using Puppeteer but not supporting complete Puppeteer API
Example usage: How to write decent crawlers with php | thePHP Website Easily convert webpages to images using PHP - Freek Van der Herten’s blog on PHP, Laravel and JavaScript http://www.lib4dev.in/info/spatie/crawler/45406338
-
Symfony Panther How to use: Scraping javascript websites using PHP Panther Library - Z.Rashwani Blog / zrashwani/arachnid
Laravel Dusk with katsana/dusk-crawler
Other resources:
Symfony Panther Symfony Panther is a standalone library that provides the same APIs as Goutte
Requirements: Check with phpinfo() that ‘proc_open’ function is not disabled in PHP. So disable any ISPConfig custom PHP.INI settings.
Will install also chromium-browser
as a snap
apt install chromium-chromedriver
apt install libnss3
libicu-dev
libzip-dev
wget
gnupg2
libasound2
fonts-liberation
libappindicator3-1
xdg-utils
lsb-release
libxss1 \
Close on port:
kill -9 $(lsof -t -i:9515)
rialto-php/puphpeteer is a Puppeteer bridge for PHP, supporting the entire API. The most significant difference, in my opinion, is that in every method call and getter/setter in PuPHPeteer is synchronous.
Usage resources:
- Puphpeteer: A Puppeteer bridge for PHP - Laravel News
- Scraping HTML with PHP Node and Puppeteer - DEV
- Keep facebook login session with PHP puphpeteer – Laravel Questions
Za njega mi je trebao: apt install libxss1
WOow: Here are a few workarounds or tools which could help your headless browser-based scrapers from getting banned. Puppeteer Extra – Puppeteer Stealth Plugin Faking Geolocation https://www.scrapehero.com/how-to-take-screenshots-of-a-web-page-using-puppeteer/
Blocking images: https://www.scrapehero.com/how-to-increase-web-scraping-speed-using-puppeteer/
SOCKS5 proxy: https://pocketadmin.tech/en/puppeteer-use-proxy/ https://docs.browserless.io/docs/using-a-proxy.html BUT SOCKS doesn’t work authentication
https://dev.to/sonyarianto/practical-puppeteer-using-proxy-to-browse-a-page-1m82
To retrieve meta from page, try:
|
|
to retrieve HTML content use
const html = await page.content();
and then open url