用不到 200 行的 Python 代码编写一个批量检测 URL 是否可以访问的脚本（ 2 ）

本篇文章是《用不到 200 行的 Python 代码编写一个批量检测 URL 是否可以访问的脚本》的扩展，本篇文章最重要的部分是增加了多线程操作，大大节约了检测 URL 所需要的时间；以及对日志的写入进行了优化。在阅读本篇文章前请先阅读上一篇文章，很多东西将不再赘述。

本篇文章描述的脚本依旧会对每个 URL 检测四次，其中 http 检测两次，https 检测两次（为什么是四次？详情请见上一篇文章）。

首先是包的导入和浏览器所使用的 User-Agent 的定义：

# coding=utf-8

import httplib, urllib, urllib2, ssl, time, os, threading

HEADERS = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3298.4 Safari/537.36'}

因为这个版本增加了多线程，所以多引入了一个名为 threading 的包。

紧接着是 show_status_urllib(url) 和 show_status_urllib2(url) 两个检测方法，这里面捕获了大量的异常；以及一个用于处理所返回的网页状态码的函数 return_code(function_name, response) ：

def show_status_urllib(url):
    function_name = 'urllib'
    try:
        response = urllib.urlopen(url)
    except IOError, e:
        return False, function_name + ' : IOError - ' + e.message
    except ssl.CertificateError, e:
        return False, function_name + ' : CertificateError - ' + e.message
    except httplib.BadStatusLine, e:
        return False, function_name + ' : httplib.BadStatisLine - ' + e.message
    else:
        return return_code(function_name, response)


def show_status_urllib2(url):
    function_name = 'urllib2'
    try:
        # request = urllib2.Request(url)
        request = urllib2.Request(url, '', HEADERS)
        response = urllib2.urlopen(request)
    except urllib2.HTTPError, e:
        return False, function_name + ' : HTTPError code ' + str(e.code)
    except urllib2.URLError, e:
        return False, function_name + ' : URLError - ' + str(e.reason)
    except IOError, e:
        return False, function_name + ' : IOError - ' + e.message
    except ssl.CertificateError, e:
        return False, function_name + ' : CertificateError - ' + e.message
    except httplib.BadStatusLine, e:
        return False, function_name + ' : httplib.BadStatisLine - ' + e.message
    else:
        return return_code(function_name, response)


def return_code(function_name, response):
    code = response.getcode()
    if code == 404:
        return False, function_name + ' : code 404 - Page Not Found !'
    elif code == 403:
        return False, function_name + ' : code 403 - Forbidden !'
    elif code == 200:
        return True, function_name + ' : code 200 - OK !'
    else:
        return False, function_name + ' : code ' + str(code)

这三个方法与之前脚本里的基本一致。

接下来是一个用于格式化日志的函数，以及一个对 URL 进行四次检测的函数：

def return_log(protocol, is_true, url, urllib_log, urllib2_log):
    if is_true:
        return '### ' + protocol + '_YES ### ' + url + ' → （' + urllib_log + ' ）（ ' + urllib2_log + '）'
    else:
        return '### ' + protocol + '_NO ### ' + url + ' → （' + urllib_log + ' ）（ ' + urllib2_log + '）'


def request_url(url):
    url = url.strip().replace(" ", "").replace("\n", "").replace("\r\n", "")

    http_url = 'http://' + url
    is_http_urllib_true, http_urllib_log = show_status_urllib(http_url)
    is_http_urllib2_true, http_urllib2_log = show_status_urllib2(http_url)

    https_url = 'https://' + url
    is_https_urllib_true, https_urllib_log = show_status_urllib(https_url)
    is_https_urllib2_true, https_urllib2_log = show_status_urllib2(https_url)

    return return_log('HTTP', is_http_urllib_true and is_http_urllib2_true, http_url, http_urllib_log, http_urllib2_log)\
           + '\n' + return_log('HTTPS', is_https_urllib_true and is_https_urllib2_true, https_url, https_urllib_log, https_urllib2_log)

从这里开始接下来的部分发生了比较大的变化。

接下来是一个对日志进行分类并将日志保存到文本文件的函数：

LOG_FILE = 0
LOG_HTTP_NO_AND_HTTPS_NO_FILE = 1
LOG_HTTP_NO_BUT_HTTPS_YES_FILE = 2
LOG_HTTP_YES_BUT_HTTPS_NO_FILE = 3


def save_log(log_of_list, line):
    line += '\n'
    log_of_list[LOG_FILE].write(line)
    if '### HTTP_NO ###' in line:
        if '### HTTPS_NO ###' in line:
            log_of_list[LOG_HTTP_NO_AND_HTTPS_NO_FILE].write(line)
        elif '### HTTPS_YES ###' in line:
            log_of_list[LOG_HTTP_NO_BUT_HTTPS_YES_FILE].write(line)
    elif '### HTTP_YES ###' in line and '### HTTPS_NO ###' in line:
        log_of_list[LOG_HTTP_YES_BUT_HTTPS_NO_FILE].write(line)

log_of_list 是一个列表，里面存放了四个打开的 log 文件，这四个 log 文件会在 main() 主函数里自动创建，存放路径是在 logs/ 文件夹下（相对路径），具体如下图所示：

用不到 200 行的 Python 代码编写一个批量检测 URL 是否可以访问的脚本 - 目录结构（1） — 用不到 200 行的 Python 代码编写一个批量检测 URL 是否可以访问的脚本 – 目录结构（1）

用不到 200 行的 Python 代码编写一个批量检测 URL 是否可以访问的脚本 - 目录结构（2） — 用不到 200 行的 Python 代码编写一个批量检测 URL 是否可以访问的脚本 – 目录结构（2）

在上一篇文章中没有详细地说明这四个 log 文件具体是存放什么的，但我相信您从代码和文件名里应该能猜出来，这里做个简单地说明：

log-XXX.txt ：存放所有检测结果；
log-XXX-http-no-and-https-no.txt ：存放 http 和 https 均无法访问的检测结果；
log-XXX-http-no-but-https-yes.txt ：存放 https 可以访问但 http 无法访问的检测结果；
log-XXX-http-yes-but-https-no.txt ：存放 http 可以访问但 https 无法访问的检测结果；
XXX 为日期，精确到秒。

接下来是在每个进程里执行的函数 use_thread(thread_total_number, thread_number, url_of_list, log_of_list) ：

def use_thread(thread_total_number, thread_number, url_of_list, log_of_list):
    i = -1
    for url in url_of_list:
        i += 1
        if i % thread_total_number == thread_number:
            line = request_url(url)
            save_log(log_of_list, line)
            print 'thread_number = ' + str(thread_number) + '\n' + line + '\n'   # 查看当前是哪个进程在运行
    # print 'thread_number EXIT = ' + str(thread_number) + '\n'      # 查看当前是哪个进程已经运行结束

和启动进程的函数 open_thread(thread_total_number, url_of_list, log_of_list) ：

def open_thread(thread_total_number, url_of_list, log_of_list):
    threads_for_list = []
    for thread_number in range(thread_total_number):
        thread = threading.Thread(target=use_thread, args=(thread_total_number, thread_number, url_of_list, log_of_list))
        thread.start()
        threads_for_list.append(thread)

    # 等待所有进程全部退出
    for t in threads_for_list:
        t.join()

最后是主函数，主函数这里也提供了两种检测方法：一种是通过列表来读取出需要检测的 URL ，另一种是通过文本文件（ url.txt ）来读取出需要检测的 URL ：

def main():
    if not os.path.exists('logs'):
        os.mkdir('logs')

    now_time = time.strftime('%Y-%m-%d-%H-%M-%S', time.localtime(time.time()))
    log_of_list = [None, None, None, None]
    log_of_list[LOG_FILE] = open('logs/log-' + now_time + '.txt', 'a')
    log_of_list[LOG_HTTP_NO_AND_HTTPS_NO_FILE] = open('logs/log-' + now_time + '-http-no-and-https-no.txt', 'a')
    log_of_list[LOG_HTTP_NO_BUT_HTTPS_YES_FILE] = open('logs/log-' + now_time + '-http-no-but-https-yes.txt', 'a')
    log_of_list[LOG_HTTP_YES_BUT_HTTPS_NO_FILE] = open('logs/log-' + now_time + '-http-yes-but-https-no.txt', 'a')

    # 1、检测 list 里的链接
    print 'URL Testing For List :\n'
    url_of_list = ['www.baidu.com', 'news.baidu.com/guonei', 'www.qq.com', 'www.taobao.com']

    # 2、开启线程
    thread_total_number = 8  # 默认开启 8 个线程，这里可以根据实际情况修改
    open_thread(thread_total_number, url_of_list, log_of_list)

    # 1、检测 txt file 里的链接
    print '\nURL Testing For TXT File :\n'

    # 2、读取文本文件到列表
    url_of_list = []
    url_file = open("url.txt", 'r')
    for url in url_file.readlines():
        url_of_list.append(url)
    url_file.close()

    # 3、开启线程
    thread_total_number = 8  # 默认开启 8 个线程，这里可以根据实际情况修改
    open_thread(thread_total_number, url_of_list, log_of_list)

    # 4、将文件缓存写入磁盘，并关闭文件的读写
    log_of_list[LOG_FILE].flush()
    log_of_list[LOG_HTTP_NO_AND_HTTPS_NO_FILE].flush()
    log_of_list[LOG_HTTP_NO_BUT_HTTPS_YES_FILE].flush()
    log_of_list[LOG_HTTP_YES_BUT_HTTPS_NO_FILE].flush()

    log_of_list[LOG_FILE].close()
    log_of_list[LOG_HTTP_NO_AND_HTTPS_NO_FILE].close()
    log_of_list[LOG_HTTP_NO_BUT_HTTPS_YES_FILE].close()
    log_of_list[LOG_HTTP_YES_BUT_HTTPS_NO_FILE].close()


if __name__ == "__main__":
    main()

示例

url.txt 文件里有如下内容：

www.baidu.com
news.baidu.com/guonei
www.qq.com
www.taobao.com

用不到 200 行的 Python 代码编写一个批量检测 URL 是否可以访问的脚本（2） – 示例

C:\Users\Ricky\venv\url\Scripts\python.exe C:/Users/Ricky/PycharmProjects/url/url.py
URL Testing For List :

thread_number = 2
### HTTP_NO ### http://www.qq.com → （urllib : code 200 - OK ! ）（ urllib2 : code 208）
### HTTPS_NO ### https://www.qq.com → （urllib : code 200 - OK ! ）（ urllib2 : code 208）

thread_number = 0
### HTTP_YES ### http://www.baidu.com → （urllib : code 200 - OK ! ）（ urllib2 : code 200 - OK !）
### HTTPS_YES ### https://www.baidu.com → （urllib : code 200 - OK ! ）（ urllib2 : code 200 - OK !）

thread_number = 1
### HTTP_YES ### http://news.baidu.com/guonei → （urllib : code 200 - OK ! ）（ urllib2 : code 200 - OK !）
### HTTPS_YES ### https://news.baidu.com/guonei → （urllib : code 200 - OK ! ）（ urllib2 : code 200 - OK !）

thread_number = 3
### HTTP_YES ### http://www.taobao.com → （urllib : code 200 - OK ! ）（ urllib2 : code 200 - OK !）
### HTTPS_YES ### https://www.taobao.com → （urllib : code 200 - OK ! ）（ urllib2 : code 200 - OK !）


URL Testing For TXT File :

thread_number = 2
### HTTP_NO ### http://www.qq.com → （urllib : code 200 - OK ! ）（ urllib2 : code 208）
### HTTPS_NO ### https://www.qq.com → （urllib : code 200 - OK ! ）（ urllib2 : code 208）

thread_number = 0
### HTTP_YES ### http://www.baidu.com → （urllib : code 200 - OK ! ）（ urllib2 : code 200 - OK !）
### HTTPS_YES ### https://www.baidu.com → （urllib : code 200 - OK ! ）（ urllib2 : code 200 - OK !）

thread_number = 1
### HTTP_YES ### http://news.baidu.com/guonei → （urllib : code 200 - OK ! ）（ urllib2 : code 200 - OK !）
### HTTPS_YES ### https://news.baidu.com/guonei → （urllib : code 200 - OK ! ）（ urllib2 : code 200 - OK !）

thread_number = 3
### HTTP_YES ### http://www.taobao.com → （urllib : code 200 - OK ! ）（ urllib2 : code 200 - OK !）
### HTTPS_YES ### https://www.taobao.com → （urllib : code 200 - OK ! ）（ urllib2 : code 200 - OK !）


Process finished with exit code 0

其中 thread_number = X ，X 为当前启动的第几个进程的进程号，可根据自身需要修改代码来决定是否要打印该日志。

源代码下载

请参见文章下方的 Article Attachments 部分，或者点击右侧链接：url.py 和 url.txt（Python 2.7.15 ，亲测可执行）

其他相关文章：

用不到 200 行的 Python 代码编写一个批量检测 URL 是否可以访问的脚本

打赏

CCIE Engineer Community Knowledge Base

用不到 200 行的 Python 代码编写一个批量检测 URL 是否可以访问的脚本（ 2 ）

示例

源代码下载

文章附件

这篇文章对你有帮助吗？

发表评论? 取消回复