Scraping: Using Proxies
Neće preko Proxy: https://www.gutegutscheine.ch/ ali ZBOG RESIDENTIAL i ni zbog čega drugog… radi mi provereno na: proxyland.io curl -I https://www.gutegutscheine.ch/
also, do “Block images” da sačuvaš bandwidth: https://github.com/puppeteer/puppeteer/blob/main/examples/block-images.js https://www.scrapehero.com/how-to-increase-web-scraping-speed-using-puppeteer/ https://github.com/puppeteer/puppeteer/issues/1913
Nije mi se bunio na screaming frog - dakle, kada “zajašiš” nije panika
A neće ni preko Puppeteer
Probam ovde: https://try-puppeteer.appspot.com/
The header that is sent with puppeteer identifies it as headless chrome, which may be the reason it is blocked so easily. Try copying the headers from your non-headless browser.
Rešeno sa USER AGENT-OM!
await page.setUserAgent(‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3419.0 Safari/537.36’);
Savršeni tekstovi:
-
https://intoli.com/blog/making-chrome-headless-undetectable/ https://news.ycombinator.com/item?id=14936025 https://antoinevastel.com/bot%20detection/2018/01/17/detect-chrome-headless-v2.html
Odličan autor: https://antoinevastel.com/tags.html#Browser-Fingerprinting-ref
Napravi da kad detektuješ headless Chrome, da vratiš sve vaučere tupave koji pokazuju na NJIHOV sajt :)
Ozbiljna firma za Proxy: https://intoli.com/signup/
WOOWOOWOOW: SVE REŠENO: puppeteer-extra/packages/puppeteer-extra-plugin-stealth at master · berstend/puppeteer-extra HOLY FUCK: Recaptcha: puppeteer-extra/packages/puppeteer-extra-plugin-recaptcha at master · berstend/puppeteer-extra
A i ovo je super ideja :)
Postoji i: Here is an example of launching puppeteer with random user agent using the modern-random-ua NPM package https://github.com/skratchdot/random-useragent
A najavljen je i https://datadome.co/bot-detection/will-playwright-replace-puppeteer-for-bad-bot-play-acting/ https://github.com/microsoft/playwright JEBOTE, SVA 3 BROWSERA! Ista ekipa koja je pisala Pupeteer otišla u Microsoft i napravila ovo.
https://help.apify.com/en/collections/1669748-overcoming-anti-scraping-protection
https://github.com/digitalhurricane-io/puppeteer-detection-100-percent
A Proxy detection?
Detektuje na osnovu header-a verovatno: X_forwarded_for Ovaj dobro detektovao: https://ip-check.net/detect-proxy.php a ovaj i nije baš: https://www.infobyip.com/detectproxy.php
HTTP_X_FORWARDED_FOR
https://stackoverflow.com/questions/32459301/how-to-detect-or-prevent-proxy-browsing
Firme koje rade zaštitu
EVO GA SCRIPT: https://www.blocked.com/index.php https://datadome.co/ https://www.ipqualityscore.com/
Proxy services
Odlični ali preskupi:
https://oxylabs.io/pricing/datacenter-proxies https://luminati.io/pricing/
https://medium.com/@colopmike8/top-5-residential-proxy-providers-2a644fddbe09 https://proxyrate.com/ https://www.scraperapi.com/blog/the-10-best-rotating-proxy-services-for-web-scraping/ https://medium.com/@makcorps.activation.api/the-10-best-residential-proxy-providers-2020-9d2a42450b59
Ali najjeftiniji je: 1. https://anonymous-proxies.net/pricing (Bucharest) unlimited, 1 proxy je $5. 5 proxyja je $25 - obedljivo najjeftinije (mesečno plaćanje) min order je $10 i sajt izgleda super 2. https://stormproxies.com/clients/signup/kqUM3vnwq?product_id_page-0[]=97-97 jer tu je unlimited bandwidth! $19.00 mesečno. za $50 je 5 proxy-ja
3. Proveri - dobar je: https://rsocks.net/mobile-proxy
Najbolji koje sam našao, da ne koštaju ruku i nogu: https://proxyland.io/ izgleda “normalno” - pay as you go cena je $50 za 20GB što je oko 13000 strana njihove težine od oko 1.5MB (iako ću tu moći dosta da uštedim) pošto imaju 431 šop, ispada da za 13000/430 = 30 puta scrape ceo sajt. To je super! i više nego dovoljno! Znači, košta me $50 mesečno da ga scrape-ujem svaki dan (samo da proverim šta je novo) https://stackoverflow.com/questions/52777757/how-to-use-proxy-in-puppeteer-and-headless-chrom
skuplji: https://www.yourprivateproxy.com/buy-residential-proxies https://astroproxy.com/#rates-block - 20GB je takođe ispalo $50 https://zenscrape.com/residential-proxies/ https://homeip.io/pricing/ https://www.ipburger.com/pricing/residential/ https://www.geosurf.com/products/residential-ips/ https://shifter.io/ https://www.proxyrack.com/unmetered-residential/ https://blazingseollc.com/proxy/faq/ - ovi nemaju residental https://www.scrapingdog.com/pricing https://www.proxy-cheap.com/pricing/residential-proxies/ https://proxyaqua.com/residential-rotating-proxies/
Dakle, imam problem sa residential proxy
Teoretski, mogu sledeće:
-
da stavim kod svih koje znam
- linux server u buha
- mama (nema računar)
- dača (nema računar)
-
android telefoni https://play.google.com/store/apps/details?id=com.gorillasoftware.everyproxy
- https://www.everyproxy.co.uk/
- https://www.shustechs.com/how-to-use-every-proxy/
- mislim da to može, možda i više komada bez problema? problem je što ide sa lokalnog računara? jbg
dobro objašnjenje:
https://www.ipqualityscore.com/articles/view/13/how-residential-proxies-enable-fraud
DA PRODAJEM BANDWIDTH!
https://packetstream.io/support/faq
Avoid scrape:
Proxy za Windows:
FreeProxy Internet Suite (http://www.handcraftedsoftware.org/) https://dannyda.com/2020/01/03/list-of-open-source-free-proxy-forward-proxy-reverse-proxy-cache-server-software/
await page.setUserAgent(‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3419.0 Safari/537.36’); await page.setUserAgent(‘Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)’); await page.setUserAgent(‘Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/W.X.Y.Z Safari/537.36’);
https://www.producthunt.com/posts/scrapeowl
https://scrapeowl.com/pricing/ za 29$/month dobijam 250000 API Calls (Credits) i kada koristim Premium Residential Proxy sa Javascript-om, onda jedan request košta 25 kredita Dakle 10000 upita
Skoro sve identično je i https://www.scrapingbee.com/#pricing odnosno kao i https://www.scraperapi.com/pricing
Datashake Web Scraper API zenscrape ProxyCrawl
Mogu da napravim svoj Proxy network:
How to Set up Your Own Web Proxy on Ubuntu 16.04 VPS PHP Web Proxy Script - A simple and free alternative to Glype
Better?
Woow! Scrapoxy Understand Scrapoxy — Scrapoxy 3.1.1 documentation
tinyproxy(8): light-weight HTTP proxy daemon - Linux man page GitHub - tinyproxy/tinyproxy: tinyproxy - a light-weight HTTP/HTTPS proxy daemon for POSIX operating systems How to setup a simple proxy server with tinyproxy (Debian 10 Buster) - NXNJZ
GitHub - dzt/easy-proxy: Make mass proxies easily. (DigitalOcean)
Ali se ne moraš mlatiti - najjeftiniji je InstantProxies How to use private proxies with WordPress Rankie plugin - ValvePress a i pominju ga u komentarima na Envato vezano za Rankie
New alternatives:
Shadowsocks - A secure socks5 proxy a little older Dante - A free SOCKS server or even older: tinyproxy is NOT SOCKS5 proxy, but rofl0r/microsocks is
Client shadowsocks/shadowsocks-windows: A C# port of shadowsocks
WOOW:
Bible: Self-hosted socks5 or shadowsocks server in a single command Hetzner Online Community
Shadowsocks is NOT SOCKS5 Proxy but its own protocol, better in a way because it can’t be blocked or detected easily: How to Set up Shadowsocks-libev Proxy Server on Ubuntu 16.04/17.10 Shadowsocks - Quick Guide
Easily Boost Ubuntu Network Performance by Enabling TCP BBR
HTTP proxy can only proxy HTTP (TCP) traffic whereas a SOCKS5 proxy can handle any type of traffic using either TCP or UDP. SOCKS5 proxy is more universal and can be used with more applications.
Self-hosted socks5 or shadowsocks server in a single command This is just a basic proxy, very simple to install and use. If you are interested in a more functional and complex solution, you may check out StreisandEffect/streisand
Install Dante (SOCKS5):
export PORT=22333; export PASSWORD=somepass; export USER=user curl https://selivan.github.io/socks.txt | sudo –preserve-env bash
or even better:
export PORT=22333; export PASSWORD=somepass; export USER=user curl https://gist.githubusercontent.com/cvladan/8f0de5e61d773664d634cd70f3e07fe6/raw/danted_setup.sh | sudo –preserve-env bash
Test if it’s working:
# curl -x socks5://<your_username>:<your_password>@<your_ip_server>:<your_danted_port> ifconfig.co
curl -x socks5://user:pass@46.4.33.38:22333 ifconfig.co
Q: Error after every reboot: interface "eno1" has no usable IP-addresses configured
A: https://serverfault.com/a/976031/152357
How To Use Systemctl to Manage Systemd Services and Units
Check the config with: systemctl cat danted
On any problem, check service log: journalctl -u danted
Enable specific rules
There are two sets of rules and they work at different levels. Rules prefixed with client (client pass/block
) are checked first and are used to see if the client is allowed to connect to the Dante server. We will call them “client-rules”. The other rules, which we call “socks-rules”, are prefixed with socks pass/block
and these rules are only checked if the client connection has been allowed by the client-rules.
Both set of rules are evaluated on a “first match is best match” basis. That means, the first rule matched for a particular client or socks request is the rule that will be used.
Limit by IP
Solving SOCKS5 & Puppeteer “no-authentication” problem is using Dante’s Limit by IP Address.
Place inside /etc/danted.conf
before any other socks
rule:
# allow access from IP without authentication
socks pass {
from: 46.4.33.38/32 to: 0.0.0.0/0
log: error
socksmethod: none
}
We must change global defaults to:
socksmethod: pam.any none
and because rules have default to global socksmethod
fields, we also must change all the rest socks pass
rules. It’s usually only one pass
there.. Put socksmethod: pam.any
there so authentication is still required for any other IP.
Test:
curl -x socks5://user:pass@159.100.248.236:22333 ifconfig.co
Ako hoćeš svima pristup bez authentikacije: samo socksmethod: none
i ništa više ne mora
Još jedan script for automated dante socks proxy server installation: akmaslov-dev/dante-proxy-server: Dante socks proxy server
Web Scraping: A Brief Overview of Scrapy and Selenium, Part I | by Anastasia Reusova | Towards Data Science Web Scraping: A Less Brief Overview of Scrapy and Selenium, Part II | by Anastasia Reusova | Towards Data Science
interesting: Optimize Shadowsocks
Danted and Puppeteer Using http/s and socks4/5 proxies with puppeteer and chrome with squid and danted – CoLaBug.com
Unfortunately, it is not possible to use puppeteer/chromium with a SOCKS5 proxy. The Chrome browser does not support socks with authentication. To mi se u prvom trenutku učinilo kao velika mana, ali sam ipak odlučio da koristim proxy bez authentication, tako da je ipak moguće.
Alive Proxy Servers for Personal Use
date: 2011-02-09
I need a source for reliable, currently alive proxy servers. I do use them for playing music from Last.fm, or watching TV on clicker.tv.
So, I found a quite reliable source of proxy servers: http://blog.qualityproxylist.com/
Appsummo - probao https://appsumo.com/products/quickscraper/ Home - Quick Scraper - Quick Scraper - Proxy API for Web Scraping
Top 7 Web Scraping Tools in 2023 Bardeen | Automate your repetitive tasks with one click
Scraping Robot is javascript rendering and they handle proxies. To je baš firma koja se bavi proxijima, Rayobyte, formerly Blazing SEO. A sve sam počeo jer je taj Blazing SEO davno nudio Captcha solving i OCR, sa sve PHP primerima: Blazing OCR API. Inače, u free paketu su all features included, a kvota je 5000 scrapes/month.
Proxifier - The Most Advanced Proxy Client YtFlow/Maple: A lightweight Universal Windows proxy app based on https://github.com/eycorsican/leaf
ScrapingAnt - Web Scraping API | Proxy API pruža 10000 API credits/month for free a to uključuje i Javascript rendering kao i 100K API credits/m for $20.
Proxy list of completely free proxies that is updated every few minutes: Free Proxies for Web Scraping | ScrapingAnt
Best Free Proxy Scraping Tools | ScrapingAnt su alati koji prave liste proxy servera koji su aktivni.
constverum/ProxyBroker: Proxy [Finder | Checker | Server]. HTTP(S) & SOCKS 🎭 i ima savršenu opciju da run a local proxy server that distributes incoming requests to a pool of found proxies.