Scraping in PHP
npm install puppeteer --global
The WordPress way: native PHP without JS execution
$blockingrequest argument (see this comment) that can “trigger-and-forget” a request and continue with PHP execution.
Nice usage example is wptt-webfont-loader.php
- guzzle/guzzle is one HTTP client, but many others are equally excellent - it just happens to be one of the most mature and most downloaded.
- FriendsOfPHP/Goutte is HTTP client made for scraping. Exactly the same API as Symfony Panther but is much faster as no-JS
- KnpLabs/snappy is wrapper for wkhtmltopdf/wkhtmltoimage
Symfony Panther Symfony Panther is a standalone library that provides the same APIs as Goutte
Requirements: Check with phpinfo() that ‘proc_open’ function is not disabled in PHP. So disable any ISPConfig custom PHP.INI settings.
Will install also
chromium-browser as a snap
apt install chromium-chromedriver
apt install libnss3
Close on port:
kill -9 $(lsof -t -i:9515)
rialto-php/puphpeteer is a Puppeteer bridge for PHP, supporting the entire API. The most significant difference, in my opinion, is that in every method call and getter/setter in PuPHPeteer is synchronous.
- Puphpeteer: A Puppeteer bridge for PHP - Laravel News
- Scraping HTML with PHP Node and Puppeteer - DEV
- Keep facebook login session with PHP puphpeteer – Laravel Questions
Za njega mi je trebao: apt install libxss1
WOow: Here are a few workarounds or tools which could help your headless browser-based scrapers from getting banned. Puppeteer Extra – Puppeteer Stealth Plugin Faking Geolocation https://www.scrapehero.com/how-to-take-screenshots-of-a-web-page-using-puppeteer/
SOCKS5 proxy: https://pocketadmin.tech/en/puppeteer-use-proxy/ https://docs.browserless.io/docs/using-a-proxy.html BUT SOCKS doesn’t work authentication
To retrieve meta from page, try:
to retrieve HTML content use
const html = await page.content();
and then open url