Scraping in PHP

Scraping in PHP


Install Puppeteer:

npm install puppeteer --global

About libraries: Mohamed Said on Twitter: “Is there any neat PHP package that drives headless chrome? Similar do Dusk but not designed for only testing.” / Twitter

Without JavaScript processing

The WordPress way: native PHP without JS execution

WordPress HTTP API contain methods like wp_remote_get() which is essentially a wrapper for curl.

Note that wp_remote_get() function has $blocking request argument (see this comment) that can “trigger-and-forget” a request and continue with PHP execution.

Nice usage example is wptt-webfont-loader.php

Various HTTP clients w/o ability to scrape JavaScript

With JavaScript execution


PHP Web Scraping: What to know before you start with Symfony Panther, Goutte, and more

Laravel Dusk with katsana/dusk-crawler

Other resources:

Symfony Panther Symfony Panther is a standalone library that provides the same APIs as Goutte

Requirements: Check with phpinfo() that ‘proc_open’ function is not disabled in PHP. So disable any ISPConfig custom PHP.INI settings.

Will install also chromium-browser as a snap apt install chromium-chromedriver

apt install libnss3

libxss1 \

Close on port:

kill -9 $(lsof -t -i:9515)

rialto-php/puphpeteer is a Puppeteer bridge for PHP, supporting the entire API. The most significant difference, in my opinion, is that in every method call and getter/setter in PuPHPeteer is synchronous.

Usage resources:

Za njega mi je trebao: apt install libxss1

WOow: Here are a few workarounds or tools which could help your headless browser-based scrapers from getting banned. Puppeteer Extra – Puppeteer Stealth Plugin Faking Geolocation

Blocking images:

SOCKS5 proxy: BUT SOCKS doesn’t work authentication

To retrieve meta from page, try:

await page.$eval('meta[http-equiv=refresh]', a => a.content)

to retrieve HTML content use

const html = await page.content();

and then open url

date 01. Jan 0001 | modified 28. May 2021
filename: Scraping » PHP