HAipproxy

中文文档 | README

This project crawls proxy ip resources from the Internet.What we wish is to provide a anonymous ip proxy pool with highly availability and low latency for distributed spiders.

Features

Quick start

Please go to release to download the source code, the master is unstable.

Standalone

Server

Client

haipproxy provides both py client and squid proxy for your spiders.Any clients about any languages are welcome!

Python Client

from client.py_cli import ProxyFetcher
# args are used to connect redis, if args is None, redis args in settings.py will be used
args = dict(host='127.0.0.1', port=6379, password='123456', db=0)
# https is used for common proxy.If you want to crawl a customized website, you'd better 
# write a customized ip validator according to zhihu validator
fetcher = ProxyFetcher('https', strategy='greedy', redis_args=args)
# get one proxy ip
print(fetcher.get_proxy())
# get available proxy ip list
print(fetcher.get_proxies()) # or print(fetcher.pool)

Using squid as proxy server

Dockerize

or

import requests
proxies = {'https': 'http://127.0.0.1:3128'}
resp = requests.get('https://httpbin.org/ip', proxies=proxies)
print(resp.text)

WorkFlow

Other important things

Test Result

Here are test results for crawling https://zhihu.com using haipproxy.Source Code can be seen here

requests time cost strategy client
0 2018/03/03 22:03 0 greedy py_cli
10000 2018/03/03 11:03 1 hour greedy py_cli
20000 2018/03/04 00:08 2 hours greedy py_cli
30000 2018/03/04 01:02 3 hours greedy py_cli
40000 2018/03/04 02:15 4 hours greedy py_cli
50000 2018/03/04 03:03 5 hours greedy py_cli
60000 2018/03/04 05:18 7 hours greedy py_cli
70000 2018/03/04 07:11 9 hours greedy py_cli
80000 2018/03/04 08:43 11 hours greedy py_cli

Reference

Thanks to all the contributors of the following projects.

dungproxy

proxyspider

ProxyPool

proxy_pool

ProxyPool

IPProxyTool

IPProxyPool

proxy_list

proxy_pool