php爬取socks5代理并(多线程)验证是否可用

最近小伙伴皮师傅在折腾压力测试工具,需要爬取socks5代理,就写了这么个东西

改了几次,最开始是单线程,效率不行..改成多线程的,但是实测有bug...就这样吧,权当记录,水一下,有需要就接着研究吧

上菜:

<?php
$start = microtime(true);

$proxys = curl('https://api.proxyscrape.com/?request=displayproxies&proxytype=socks5&country=all&timeout=1600');
$proxys = explode(PHP_EOL,trim($proxys));
file_put_contents(__DIR__ . DIRECTORY_SEPARATOR . 'Proxys.txt',$proxys);

$proxys_count = count($proxys);

$mh = curl_multi_init();
foreach($proxys as $key => $v){
    $v = trim($v);//处理ip数据
    $p = explode(':',$v);//分割一下
    //发送的数据
    $data['ip_addr'] = $p[0];
    $data['port'] = $p[1];
    
    //请求的链接
    $url = 'https://onlinechecker.proxyscrape.com/index.php';
    $ch[$key] = curl_init();
    curl_setopt($ch[$key], CURLOPT_URL , $url);
    curl_setopt($ch[$key], CURLOPT_RETURNTRANSFER , true);
    curl_setopt($ch[$key], CURLOPT_SSL_VERIFYHOST , 0);
    curl_setopt($ch[$key], CURLOPT_SSL_VERIFYPEER , 0);
    //curl_setopt($ch[$key], CURLOPT_TIMEOUT , 2);
    curl_setopt($ch[$key], CURLOPT_FOLLOWLOCATION , true);
    curl_setopt($ch[$key], CURLOPT_HTTP_VERSION , CURL_HTTP_VERSION_1_1);
    curl_setopt($ch[$key], CURLOPT_POST, 1);
    curl_setopt($ch[$key], CURLOPT_POSTFIELDS, http_build_query($data));
  curl_multi_add_handle($mh, $ch[$key]);
}
$running = null;
do {
  curl_multi_exec($mh, $running);
} while($running > 0);



$i = 0; $ips = ''; $ii = 0;

//获取内容
foreach($proxys as $key => $v){
    $ii++;
  //返回获取的输出文本流
  $res[$key] = curl_multi_getcontent($ch[$key]);
    
    $result = json_decode($res[$key],true);
    if($result['working']){
        $i++;
        $ips .= $v . PHP_EOL;
    }

  //关闭执行完的子句柄
  curl_close($ch[$key]);
}
$end = microtime(true) - $start;

file_put_contents(__DIR__ . DIRECTORY_SEPARATOR . 'ips.txt',$ips);
$log =  '代理IP总数:' . $proxys_count . '|正常数量:' . $i . '|异常数量:' . ($proxys_count - $i) . '|耗时约: ' . round($end) . ' 秒';

file_put_contents(__DIR__ . '/Log.txt', $log . PHP_EOL, FILE_APPEND);

echo $log;




function curl($url,$data=''){
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL , $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER , true);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST , 0);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER , 0);
    curl_setopt($ch, CURLOPT_TIMEOUT , 0);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION , true);
    curl_setopt($ch, CURLOPT_HTTP_VERSION , CURL_HTTP_VERSION_1_1);
    if($data){
        curl_setopt($ch, CURLOPT_POST, 1);
        curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($data));
    }
    $response = curl_exec($ch);
    if (curl_errno($ch)) {
            //echo 'Curl error: ' . curl_error($ch);
            file_put_contents(__DIR__ . DIRECTORY_SEPARATOR . 'err.txt', $data['port'].':'.$data['ip_addr'] . '错误: ' . curl_error($ch),FILE_APPEND);
    }
    curl_close($ch);
    return $response;
}

这玩意需要在命令行下去执行,也可以加crontab去跑..

本文链接:

http://logs.ee/coding/20191231/12.html
1 + 6 =
3 评论
    皮师傅Firefox Browser 68Android
    1月1日 回复

    压测拿着我自己博客来回摩擦,百度站长平台都在告警高频率抓取失败!!!

    皮师傅Chrome 79Windows 7
    1月2日 回复

    shortlink再改改,评论里带http头会直接加上a标签并且不转链

      皮师傅Chrome 79Windows 7
      1月2日 回复