网页Email地址抓取脚本

今天帮你同学写了个shell脚本,抓取指定url地址的网页中的email地址。 这里字符编码转换还有点问题,只是考虑了gb2312一种情况。最好是从html源码截取字符编码,进行转换。

#!/bin/bash
# test page: http://zhidao.baidu.com/question/21631338.html

f=".tmp.txt"
f2=".tmp2.txt"

wget $1 -O $f2 >/dev/null 2>/dev/null
iconv -f gb2312 -t utf-8 $f2 > $f 2>/dev/null

sed -i -e 's/]*>//g;/<!--/d' $f
sed -i -e 's/ //g;s/°C//g;s/^s*//g;/^$/d' $f

sed -e "s/[^-.+_a-zA-Z0-9@]/n/g" $f |grep @|grep ^[a-zA-Z0-9]|grep [a-zA-Z0-9]$ |sort -u | tee
rm -rf $f $f2

运行效果:

kongove@ubuntu:~$ ./a.sh http://zhidao.baidu.com/question/21631338.html
aizi66742112@hainan.net
alading512@sina.com
bingbing43@163.com
caoyingtj@163.com
chenq@chinabyte.com
dnzb@pub2.qz
duanxiaosong@ah163.com
gxlworld@163.com
halfmay2691@sina.com
hbx@wxjt.com.cn

7 thoughts on “网页Email地址抓取脚本

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.