Scraping in PHP

Scraping in PHP

Puppeteer

Install Puppeteer:

npm install puppeteer --global

About libraries: Mohamed Said on Twitter: “Is there any neat PHP package that drives headless chrome? Similar do Dusk but not designed for only testing.” / Twitter

Without JavaScript processing

The WordPress way: native PHP without JS execution

WordPress HTTP API contain methods like wp_remote_get() which is essentially a wrapper for curl.

Note that wp_remote_get() function has $blocking request argument (see this comment) that can “trigger-and-forget” a request and continue with PHP execution.

Nice usage example is wptt-webfont-loader.php

Various HTTP clients w/o ability to scrape JavaScript

With JavaScript execution

Articles:

PHP Web Scraping: What to know before you start with Symfony Panther, Goutte, and more

Laravel Dusk with katsana/dusk-crawler


Other resources:

Symfony Panther Symfony Panther is a standalone library that provides the same APIs as Goutte

Requirements: Check with phpinfo() that ‘proc_open’ function is not disabled in PHP. So disable any ISPConfig custom PHP.INI settings.

Will install also chromium-browser as a snap apt install chromium-chromedriver

apt install libnss3

libicu-dev
libzip-dev
wget
gnupg2
libasound2
fonts-liberation
libappindicator3-1
xdg-utils
lsb-release
libxss1 \

Close on port:

kill -9 $(lsof -t -i:9515)

rialto-php/puphpeteer is a Puppeteer bridge for PHP, supporting the entire API. The most significant difference, in my opinion, is that in every method call and getter/setter in PuPHPeteer is synchronous.

Usage resources:

Za njega mi je trebao: apt install libxss1

WOow: Here are a few workarounds or tools which could help your headless browser-based scrapers from getting banned. Puppeteer Extra – Puppeteer Stealth Plugin Faking Geolocation https://www.scrapehero.com/how-to-take-screenshots-of-a-web-page-using-puppeteer/

Blocking images: https://www.scrapehero.com/how-to-increase-web-scraping-speed-using-puppeteer/

SOCKS5 proxy: https://pocketadmin.tech/en/puppeteer-use-proxy/ https://docs.browserless.io/docs/using-a-proxy.html BUT SOCKS doesn’t work authentication

https://dev.to/sonyarianto/practical-puppeteer-using-proxy-to-browse-a-page-1m82


To retrieve meta from page, try:

1
await page.$eval('meta[http-equiv=refresh]', a => a.content)

to retrieve HTML content use

const html = await page.content();

and then open url

date 01. Jan 0001 | modified 29. Dec 2023
filename: Scraping » PHP