Scraping: Using Proxies

Scraping: Using Proxies

Neće preko Proxy: https://www.gutegutscheine.ch/ ali ZBOG RESIDENTIAL i ni zbog čega drugog… radi mi provereno na: proxyland.io curl -I https://www.gutegutscheine.ch/

also, do “Block images” da sačuvaš bandwidth: https://github.com/puppeteer/puppeteer/blob/main/examples/block-images.js https://www.scrapehero.com/how-to-increase-web-scraping-speed-using-puppeteer/ https://github.com/puppeteer/puppeteer/issues/1913

Nije mi se bunio na screaming frog - dakle, kada “zajašiš” nije panika

A neće ni preko Puppeteer

Probam ovde: https://try-puppeteer.appspot.com/

The header that is sent with puppeteer identifies it as headless chrome, which may be the reason it is blocked so easily. Try copying the headers from your non-headless browser.

Rešeno sa USER AGENT-OM!

await page.setUserAgent(‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3419.0 Safari/537.36’);

Savršeni tekstovi:

Napravi da kad detektuješ headless Chrome, da vratiš sve vaučere tupave koji pokazuju na NJIHOV sajt :)

Ozbiljna firma za Proxy: https://intoli.com/signup/

WOOWOOWOOW: SVE REŠENO: puppeteer-extra/packages/puppeteer-extra-plugin-stealth at master · berstend/puppeteer-extra HOLY FUCK: Recaptcha: puppeteer-extra/packages/puppeteer-extra-plugin-recaptcha at master · berstend/puppeteer-extra

A i ovo je super ideja :)

Postoji i: Here is an example of launching puppeteer with random user agent using the modern-random-ua NPM package https://github.com/skratchdot/random-useragent

A najavljen je i https://datadome.co/bot-detection/will-playwright-replace-puppeteer-for-bad-bot-play-acting/ https://github.com/microsoft/playwright JEBOTE, SVA 3 BROWSERA! Ista ekipa koja je pisala Pupeteer otišla u Microsoft i napravila ovo.

https://help.apify.com/en/collections/1669748-overcoming-anti-scraping-protection

https://github.com/digitalhurricane-io/puppeteer-detection-100-percent

A Proxy detection?

Detektuje na osnovu header-a verovatno: X_forwarded_for Ovaj dobro detektovao: https://ip-check.net/detect-proxy.php a ovaj i nije baš: https://www.infobyip.com/detectproxy.php

HTTP_X_FORWARDED_FOR

https://stackoverflow.com/questions/32459301/how-to-detect-or-prevent-proxy-browsing

Firme koje rade zaštitu

EVO GA SCRIPT: https://www.blocked.com/index.php https://datadome.co/ https://www.ipqualityscore.com/

Proxy services

Odlični ali preskupi:

https://oxylabs.io/pricing/datacenter-proxies https://luminati.io/pricing/

https://medium.com/@colopmike8/top-5-residential-proxy-providers-2a644fddbe09 https://proxyrate.com/ https://www.scraperapi.com/blog/the-10-best-rotating-proxy-services-for-web-scraping/ https://medium.com/@makcorps.activation.api/the-10-best-residential-proxy-providers-2020-9d2a42450b59

Ali najjeftiniji je: 1. https://anonymous-proxies.net/pricing (Bucharest) unlimited, 1 proxy je $5. 5 proxyja je $25 - obedljivo najjeftinije (mesečno plaćanje) min order je $10 i sajt izgleda super 2. https://stormproxies.com/clients/signup/kqUM3vnwq?product_id_page-0[]=97-97 jer tu je unlimited bandwidth! $19.00 mesečno. za $50 je 5 proxy-ja

3. Proveri - dobar je: https://rsocks.net/mobile-proxy

Najbolji koje sam našao, da ne koštaju ruku i nogu: https://proxyland.io/ izgleda “normalno” - pay as you go cena je $50 za 20GB što je oko 13000 strana njihove težine od oko 1.5MB (iako ću tu moći dosta da uštedim) pošto imaju 431 šop, ispada da za 13000/430 = 30 puta scrape ceo sajt. To je super! i više nego dovoljno! Znači, košta me $50 mesečno da ga scrape-ujem svaki dan (samo da proverim šta je novo) https://stackoverflow.com/questions/52777757/how-to-use-proxy-in-puppeteer-and-headless-chrom

skuplji: https://www.yourprivateproxy.com/buy-residential-proxies https://astroproxy.com/#rates-block - 20GB je takođe ispalo $50 https://zenscrape.com/residential-proxies/ https://homeip.io/pricing/ https://www.ipburger.com/pricing/residential/ https://www.geosurf.com/products/residential-ips/ https://shifter.io/ https://www.proxyrack.com/unmetered-residential/ https://blazingseollc.com/proxy/faq/ - ovi nemaju residental https://www.scrapingdog.com/pricing https://www.proxy-cheap.com/pricing/residential-proxies/ https://proxyaqua.com/residential-rotating-proxies/

Dakle, imam problem sa residential proxy

Teoretski, mogu sledeće:


dobro objašnjenje:

https://www.ipqualityscore.com/articles/view/13/how-residential-proxies-enable-fraud


DA PRODAJEM BANDWIDTH!

https://packetstream.io/support/faq


Avoid scrape:

Proxy za Windows:

FreeProxy Internet Suite (http://www.handcraftedsoftware.org/) https://dannyda.com/2020/01/03/list-of-open-source-free-proxy-forward-proxy-reverse-proxy-cache-server-software/

await page.setUserAgent(‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3419.0 Safari/537.36’); await page.setUserAgent(‘Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)’); await page.setUserAgent(‘Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/W.X.Y.Z Safari/537.36’);



https://www.producthunt.com/posts/scrapeowl

https://scrapeowl.com/pricing/ za 29$/month dobijam 250000 API Calls (Credits) i kada koristim Premium Residential Proxy sa Javascript-om, onda jedan request košta 25 kredita Dakle 10000 upita

Skoro sve identično je i https://www.scrapingbee.com/#pricing odnosno kao i https://www.scraperapi.com/pricing

Datashake Web Scraper API zenscrape ProxyCrawl


Mogu da napravim svoj Proxy network:

How to Set up Your Own Web Proxy on Ubuntu 16.04 VPS PHP Web Proxy Script - A simple and free alternative to Glype

Better?

Woow! Scrapoxy Understand Scrapoxy — Scrapoxy 3.1.1 documentation

tinyproxy(8): light-weight HTTP proxy daemon - Linux man page GitHub - tinyproxy/tinyproxy: tinyproxy - a light-weight HTTP/HTTPS proxy daemon for POSIX operating systems How to setup a simple proxy server with tinyproxy (Debian 10 Buster) - NXNJZ

GitHub - dzt/easy-proxy: Make mass proxies easily. (DigitalOcean)

Ali se ne moraš mlatiti - najjeftiniji je InstantProxies How to use private proxies with WordPress Rankie plugin - ValvePress a i pominju ga u komentarima na Envato vezano za Rankie


New alternatives:

Shadowsocks - A secure socks5 proxy a little older Dante - A free SOCKS server or even older: tinyproxy is NOT SOCKS5 proxy, but rofl0r/microsocks is

Client shadowsocks/shadowsocks-windows: A C# port of shadowsocks


WOOW:

Bible: Self-hosted socks5 or shadowsocks server in a single command Hetzner Online Community

Shadowsocks is NOT SOCKS5 Proxy but its own protocol, better in a way because it can’t be blocked or detected easily: How to Set up Shadowsocks-libev Proxy Server on Ubuntu 16.04/17.10 Shadowsocks - Quick Guide

Easily Boost Ubuntu Network Performance by Enabling TCP BBR

HTTP proxy can only proxy HTTP (TCP) traffic whereas a SOCKS5 proxy can handle any type of traffic using either TCP or UDP. SOCKS5 proxy is more universal and can be used with more applications.

Self-hosted socks5 or shadowsocks server in a single command This is just a basic proxy, very simple to install and use. If you are interested in a more functional and complex solution, you may check out StreisandEffect/streisand

Install Dante (SOCKS5):

export PORT=22333; export PASSWORD=somepass; export USER=user curl https://selivan.github.io/socks.txt | sudo –preserve-env bash

or even better:

export PORT=22333; export PASSWORD=somepass; export USER=user curl https://gist.githubusercontent.com/cvladan/8f0de5e61d773664d634cd70f3e07fe6/raw/danted_setup.sh | sudo –preserve-env bash

Test if it’s working:

# curl -x socks5://<your_username>:<your_password>@<your_ip_server>:<your_danted_port> ifconfig.co
curl -x socks5://user:pass@46.4.33.38:22333 ifconfig.co

Hetzner Online Community

Dante configuration

Q: Error after every reboot: interface "eno1" has no usable IP-addresses configured A: https://serverfault.com/a/976031/152357

How To Use Systemctl to Manage Systemd Services and Units

Check the config with: systemctl cat danted On any problem, check service log: journalctl -u danted

Enable specific rules

Rules

There are two sets of rules and they work at different levels. Rules prefixed with client (client pass/block) are checked first and are used to see if the client is allowed to connect to the Dante server. We will call them “client-rules”. The other rules, which we call “socks-rules”, are prefixed with socks pass/block and these rules are only checked if the client connection has been allowed by the client-rules.

Both set of rules are evaluated on a “first match is best match” basis. That means, the first rule matched for a particular client or socks request is the rule that will be used.

Limit by IP

Solving SOCKS5 & Puppeteer “no-authentication” problem is using Dante’s Limit by IP Address. Place inside /etc/danted.conf before any other socks rule:

# allow access from IP without authentication
socks pass {
  from: 46.4.33.38/32 to: 0.0.0.0/0
  log: error
  socksmethod: none
}

We must change global defaults to:

socksmethod: pam.any none

and because rules have default to global socksmethod fields, we also must change all the rest socks pass rules. It’s usually only one pass there.. Put socksmethod: pam.any there so authentication is still required for any other IP.

Test:

curl -x socks5://user:pass@159.100.248.236:22333 ifconfig.co

Ako hoćeš svima pristup bez authentikacije: samo socksmethod: none i ništa više ne mora


Još jedan script for automated dante socks proxy server installation: akmaslov-dev/dante-proxy-server: Dante socks proxy server


Web Scraping: A Brief Overview of Scrapy and Selenium, Part I | by Anastasia Reusova | Towards Data Science Web Scraping: A Less Brief Overview of Scrapy and Selenium, Part II | by Anastasia Reusova | Towards Data Science


interesting: Optimize Shadowsocks


Danted and Puppeteer Using http/s and socks4/5 proxies with puppeteer and chrome with squid and danted – CoLaBug.com

Unfortunately, it is not possible to use puppeteer/chromium with a SOCKS5 proxy. The Chrome browser does not support socks with authentication. To mi se u prvom trenutku učinilo kao velika mana, ali sam ipak odlučio da koristim proxy bez authentication, tako da je ipak moguće.


Cuadrix/puppeteer-page-proxy: Additional module to use with ‘puppeteer’ for setting proxies per page basis.


Alive Proxy Servers for Personal Use

date: 2011-02-09

I need a source for reliable, currently alive proxy servers. I do use them for playing music from Last.fm, or watching TV on clicker.tv.

So, I found a quite reliable source of proxy servers: http://blog.qualityproxylist.com/


Appsummo - probao https://appsumo.com/products/quickscraper/ Home - Quick Scraper - Quick Scraper - Proxy API for Web Scraping


Top 7 Web Scraping Tools in 2023 Bardeen | Automate your repetitive tasks with one click


Scraping Robot is javascript rendering and they handle proxies. To je baš firma koja se bavi proxijima, Rayobyte, formerly Blazing SEO. A sve sam počeo jer je taj Blazing SEO davno nudio Captcha solving i OCR, sa sve PHP primerima: Blazing OCR API. Inače, u free paketu su all features included, a kvota je 5000 scrapes/month.


Proxifier - The Most Advanced Proxy Client YtFlow/Maple: A lightweight Universal Windows proxy app based on https://github.com/eycorsican/leaf

getlantern/lantern


ScrapingAnt - Web Scraping API | Proxy API pruža 10000 API credits/month for free a to uključuje i Javascript rendering kao i 100K API credits/m for $20.

Proxy list of completely free proxies that is updated every few minutes: Free Proxies for Web Scraping | ScrapingAnt


Best Free Proxy Scraping Tools | ScrapingAnt su alati koji prave liste proxy servera koji su aktivni.

iw4p/proxy-scraper: scrape proxies from more than 5 different sources and check which ones are still alive

constverum/ProxyBroker: Proxy [Finder | Checker | Server]. HTTP(S) & SOCKS 🎭 i ima savršenu opciju da run a local proxy server that distributes incoming requests to a pool of found proxies.

imWildCat/scylla: Intelligent proxy pool for Humans™

date 31. Oct 2020 | modified 10. Jun 2024
filename: Scraping » Proxying