Powershell + Selenium 爬虫--代理(03)

网友投稿 936 2022-09-18

Powershell + Selenium 爬虫--代理(03)

Powershell + Selenium 爬虫--代理(03)

上一篇介绍了Senlinum 的操作, 真正需要使用senlenium 爬取目标网站还需要做一些其他伪装, 例如: 设置浏览器的代理来访问目标网站, 这样以来可以避免目标网站发现是爬虫, 从而把自己的上网IP 拉进网站后台的黑名单当中, 这样有可能造成自己的IP 被永久限制访问网站或者限制访问指定的内容

为此, 我们找到了一些网上的免费的代理网址, 通过代理网址提供的免费代理IP 来访问目标网站就相对来说安全多了, 避免了自己 上网IP 暴露给目标网站

讲到这里, 我再梳理一下以上的逻辑:

1. 确定要爬取的目标网址

2. 使用代理IP 伪装自己, 访问目标网址

3. 代理IP 池, 有待进一步验证和更新

#ipmo D:\tools\Selenium\WebDriver.Support.dll #ipmo D:\tools\Selenium\WebDriver.dll $proxyurl = 'http://66ip-/' $testurl = "https://baidu.com" $ChromeOption = New-Object OpenQA.Selenium.Chrome.ChromeOptions $ChromeOption.AddExcludedArgument("enable-automation") # For closed "disable-infobars" message $ChromeOption.AddArguments("--start-maximized") # By default open chrome will use maximized window $ChromeOption.AddArgument('--disable-blink-features=AutomationControlled') # Set "window.navigator.webdriver" = False #$ChromeOption.AddArgument('--proxy-server=# Set proxy address access target website $ChromeDriver = New-Object OpenQA.Selenium.Chrome.ChromeDriver($ChromeOption) $ChromeDriver.Navigate().GoToUrl($proxyurl) sleep 5 #region https://89ip- <# $i = 0 $proxyIPs = @() while ($true) { $i++ if ($i -ne 1) { $ChromeDriver.FindElementByLinkText('下一页') |Out-Null sleep 3 } $trs = $ChromeDriver.FindElementsByCssSelector('tbody tr') if ($trs.Count -gt 0) { $j = 0 foreach ($tr in $trs) { $j++ $w = $j.ToString() + '/' + $trs.Count.ToString() $percent = "{0:0.0%}" -f ($j/$trs.Count) Write-Progress -Activity "Process test proxy address" -Status "请耐心等待,Current $i 页 $w , $percent" -PercentComplete ($j/($trs.count) * 100) $trinfo = $tr.Text -split ' ' $recordtime = $trinfo[4] + " " + $trinfo[5] try { $testproxy = "-f ($trinfo[0]), ($trinfo[1]) $testresult = Invoke-WebRequest -Uri $testurl -Proxy $testproxy -TimeoutSec 3 -ErrorAction Stop if ($testresult.StatusCode -eq 200) { Write-Host $testproxy $obj = New-Object psobject $obj | Add-Member -MemberType NoteProperty -Name IP -Value $trinfo[0] -Force $obj | Add-Member -MemberType NoteProperty -Name Port -Value $trinfo[1] -Force $obj | Add-Member -MemberType NoteProperty -Name Region -Value $trinfo[2] -Force $obj | Add-Member -MemberType NoteProperty -Name ISP -Value $trinfo[3] -Force $obj | Add-Member -MemberType NoteProperty -Name RecordTime -Value $recordtime -Force #$obj | epcsv d:\ProxyServerList-20210828.csv -Encoding UTF8 -Append -Force -NoTypeInformation $proxyIPs +=$obj } } catch { #$errormsg = $_.Exception.Message #Write-Host "$testproxy Test Failed " } } } else { break } } $proxyIPs |epcsv d:\ProxyServerList-20210829.csv -Encoding UTF8 -Force -NoTypeInformation #> #endregion #region http://66ip-/ $proxylist = @() $regionnames = ($ChromeDriver.FindElementsByTagName('li') |select text -Last 34).Text foreach($regionname in $regionnames) { $ChromeDriver.FindElementByLinkText($regionname).Click() sleep 3 $trcount = ($ChromeDriver.FindElementsByTagName('tr') |measure |select count).count $filtercount = $trcount - 3 $iplist = $ChromeDriver.FindElementsByTagName('tr') |select Text -Last $filtercount $j = 0 if($iplist.Count -ge 0) { foreach($ipstring in $iplist.Text) { $ipinfo = $ipstring -split ' ' $ipaddress = $ipinfo[0] $ipport = $ipinfo[1] $ipregion = $ipinfo[2] $iptype = $ipinfo[3] $j++ $w = $j.ToString() + '/' + $iplist.Count.ToString() $percent = "{0:0.0%}" -f ($j/$iplist.Count) Write-Progress -Activity "Process test proxy address" -Status "请耐心等待,Current $ipregion $w , $percent" -PercentComplete ($j/($iplist.count) * 100) try { $testproxy = "-f $ipaddress, $ipport $testresult = Invoke-WebRequest -Uri $testurl -Proxy $testproxy -TimeoutSec 3 -ErrorAction Stop if ($testresult.StatusCode -eq 200) { Write-Host $testproxy $obj = New-Object psobject $obj |Add-Member -MemberType NoteProperty -Name IPAddress -Value $ipaddress -Force $obj |Add-Member -MemberType NoteProperty -Name Port -Value $ipport -Force $obj |Add-Member -MemberType NoteProperty -Name Region -Value $ipregion -Force $proxylist +=$obj } } catch { } } } } $proxylist |select IPAddress,Port,Region -Unique |ogv #endregion

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:一线城市的房租在工资中占比高吗?Python帮你分析
下一篇:手把手搭建单域多站点企业IT实战环境系列课程之第三季:基于VLMCSD搭建二合一KMS激活服务器
相关文章

 发表评论

暂时没有评论,来抢沙发吧~