php爬取socks5代理并(多线程)验证是否可用
最近小伙伴皮师傅在折腾压力测试工具,需要爬取socks5代理,就写了这么个东西
改了几次,最开始是单线程,效率不行..改成多线程的,但是实测有bug...就这样吧,权当记录,水一下,有需要就接着研究吧
上菜:
<?php
$start = microtime(true);
$proxys = curl('https://api.proxyscrape.com/?request=displayproxies&proxytype=socks5&country=all&timeout=1600');
$proxys = explode(PHP_EOL,trim($proxys));
file_put_contents(__DIR__ . DIRECTORY_SEPARATOR . 'Proxys.txt',$proxys);
$proxys_count = count($proxys);
$mh = curl_multi_init();
foreach($proxys as $key => $v){
$v = trim($v);//处理ip数据
$p = explode(':',$v);//分割一下
//发送的数据
$data['ip_addr'] = $p[0];
$data['port'] = $p[1];
//请求的链接
$url = 'https://onlinechecker.proxyscrape.com/index.php';
$ch[$key] = curl_init();
curl_setopt($ch[$key], CURLOPT_URL , $url);
curl_setopt($ch[$key], CURLOPT_RETURNTRANSFER , true);
curl_setopt($ch[$key], CURLOPT_SSL_VERIFYHOST , 0);
curl_setopt($ch[$key], CURLOPT_SSL_VERIFYPEER , 0);
//curl_setopt($ch[$key], CURLOPT_TIMEOUT , 2);
curl_setopt($ch[$key], CURLOPT_FOLLOWLOCATION , true);
curl_setopt($ch[$key], CURLOPT_HTTP_VERSION , CURL_HTTP_VERSION_1_1);
curl_setopt($ch[$key], CURLOPT_POST, 1);
curl_setopt($ch[$key], CURLOPT_POSTFIELDS, http_build_query($data));
curl_multi_add_handle($mh, $ch[$key]);
}
$running = null;
do {
curl_multi_exec($mh, $running);
} while($running > 0);
$i = 0; $ips = ''; $ii = 0;
//获取内容
foreach($proxys as $key => $v){
$ii++;
//返回获取的输出文本流
$res[$key] = curl_multi_getcontent($ch[$key]);
$result = json_decode($res[$key],true);
if($result['working']){
$i++;
$ips .= $v . PHP_EOL;
}
//关闭执行完的子句柄
curl_close($ch[$key]);
}
$end = microtime(true) - $start;
file_put_contents(__DIR__ . DIRECTORY_SEPARATOR . 'ips.txt',$ips);
$log = '代理IP总数:' . $proxys_count . '|正常数量:' . $i . '|异常数量:' . ($proxys_count - $i) . '|耗时约: ' . round($end) . ' 秒';
file_put_contents(__DIR__ . '/Log.txt', $log . PHP_EOL, FILE_APPEND);
echo $log;
function curl($url,$data=''){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL , $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER , true);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST , 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER , 0);
curl_setopt($ch, CURLOPT_TIMEOUT , 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION , true);
curl_setopt($ch, CURLOPT_HTTP_VERSION , CURL_HTTP_VERSION_1_1);
if($data){
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($data));
}
$response = curl_exec($ch);
if (curl_errno($ch)) {
//echo 'Curl error: ' . curl_error($ch);
file_put_contents(__DIR__ . DIRECTORY_SEPARATOR . 'err.txt', $data['port'].':'.$data['ip_addr'] . '错误: ' . curl_error($ch),FILE_APPEND);
}
curl_close($ch);
return $response;
}
这玩意需要在命令行下去执行,也可以加crontab去跑..
压测拿着我自己博客来回摩擦,百度站长平台都在告警高频率抓取失败!!!
shortlink再改改,评论里带http头会直接加上a标签并且不转链
@皮师傅 就像这样https://www.pishifu.org