shell按文本内容分割文件的脚本

功能一个文本文件需要根据里面的时间戳拆分到不同的按日期命令的文件,可以计算剩余行数。可以断点续传(中途Ctrl+C退出可能下次会写少量已经写过的数据),脚本可以任意路径执行,比如../../split.sh或/tmp/split.sh,记录进度的split.data总是和脚本处于相同目录。脚本用要的技术比较复杂,有多NB自己体会,因为awk调用外部命令和每次关闭句柄,所以性能不是非常高。

数据示例:

1418103343455   175.25.244.178  new     -       javascript%3Avoid(0)    allgame_a       s851.sg2.ledu.com
1418103343575   175.25.244.178  new     -       javascript%3Avoid(0)    allgame_a       s851.sg2.ledu.com
1418103343751   175.25.244.178  new     -       javascript%3Avoid(0)    allgame_a       s851.sg2.ledu.com
1418103149566   119.176.238.210 new     -       javascript%3Avoid(0)    timely  s887.sg2.ledu.com
1418103163525   119.176.238.210 new     -       javascript%3Avoid(0)    close_notice+fr+png6    s887.sg2.ledu.com
1418103182940   218.57.139.22   new     -       javascript%3Avoid(0)    pack_age_a      s15.ttgj.ledu.com
1418103186528   218.57.139.22   new     -       javascript%3Avoid(0)    pack_age_a      s15.ttgj.ledu.com
1418103171841   180.104.22.40   new     -       javascript%3Avoid(0)    pack_age_a      s889.sg2.ledu.com
1418103196648   218.57.139.22   new     -       %23     put_cross+fl    s15.ttgj.ledu.com
141810320543a   218.57.139.22   new     -       javascript%3Avoid(0)    pack_age_a      s9.ttgj.ledu.com

脚本(split.sh):

#!/bin/bash
file=/data/tracepng_logs/access_tracepng.log
dir=/data/apache/_flume/crossbar/click
totalLines=`wc -l $file|cut -d " " -f1`
if [[ "$0" =~ "^/" ]];then
        fileData=$0
else
        fileData=`pwd`/$0
fi
fileData=`dirname $fileData`
fileData=$fileData/split.data
#中途取消可能造成最后一行写入不全(这是不close会出现的状况)
lastLine=(`tail -n 2 $fileData`)
#消除最后一个n的影响(close之后应该不会出现这种情况)
if [ "" = "${lastLine[1]}" ];then
        unset lastLine[1]
fi
if [ ${#lastLine[@]} -eq 1 ] || [ ${lastLine[0]} -gt ${lastLine[1]} ];then
        lastLine=${lastLine[0]}
else
        lastLine=${lastLine[1]}
fi
echo $lastLine > $fileData
remainLines=`echo "$totalLines-$lastLine"|bc`
tail -n +$lastLine $file| 
awk --re-interval '{
        printf "r%s remains",("'"$remainLines"'"-NR);
        if($1 ~ /[0-9]{13}/){
                time=substr($1,0,10);
                file="'"$dir"'""/"f".log"
                c="date +%Y-%m-%d -d @"time;
                c|getline f;
                print $0 >> file
                #没有close很有可能导致文件写入不全
                close(file)
                #没有close可能报错,cmd. line:6: (FILENAME=- FNR=113924) fatal: cannot open pipe `date +%Y-%m-%d -d @1418097392' (打开的文件过多)
                close(c)
        };
        if(NR%1==0){
                print "'"$lastLine"'"+NR > "'"$fileData"'"
                #没有close会导致$fileData文件是每行追加,并且很有可能写入不完整。
                close("'"$fileData"'")
        }
}'
echo

运行中:

[root@210.14.138.94 ~/xiepeng]# ./split.sh 
141435 remains

结果:

2001-11-25.log  2008-01-25.log  2011-11-24.log  2014-03-21.log  2014-11-17.log
2002-01-01.log  2008-02-25.log  2011-11-25.log  2014-08-25.log  2014-11-18.log
2002-01-02.log  2009-01-01.log  2012-05-30.log  2014-09-03.log  2014-11-19.log
2002-01-03.log  2009-02-03.log  2012-08-29.log  2014-09-10.log  2014-11-20.log
2002-01-04.log  2009-08-10.log  2013-02-14.log  2014-09-24.log  2014-11-21.log
2003-01-01.log  2009-09-16.log  2013-03-10.log  2014-10-01.log  2014-11-22.log
2003-01-02.log  2010-01-01.log  2013-03-12.log  2014-10-05.log  2014-11-23.log
2004-01-01.log  2010-01-19.log  2013-08-19.log  2014-10-18.log  2014-11-24.log
2004-11-23.log  2010-06-04.log  2013-08-20.log  2014-10-20.log  2014-11-25.log
2006-01-01.log  2010-09-05.log  2013-08-22.log  2014-10-25.log  2014-11-26.log
2006-07-23.log  2010-10-20.log  2013-08-24.log  2014-10-26.log  2014-11-27.log
2006-08-31.log  2011-01-01.log  2013-11-19.log  2014-11-01.log  2014-12-05.log
2006-11-25.log  2011-01-03.log  2013-11-20.log  2014-11-03.log  2014-12-20.log
2007-01-30.log  2011-01-05.log  2013-11-21.log  2014-11-04.log  2014-12-25.log
2007-02-01.log  2011-03-09.log  2013-11-22.log  2014-11-07.log  2015-01-24.log
2007-02-02.log  2011-04-04.log  2013-11-23.log  2014-11-08.log  2022-08-27.log
2007-02-08.log  2011-04-25.log  2013-11-24.log  2014-11-10.log
2007-09-11.log  2011-07-03.log  2013-11-25.log  2014-11-13.log
2008-01-10.log  2011-11-19.log  2014-01-05.log  2014-11-15.log
2008-01-14.log  2011-11-23.log  2014-01-25.log  2014-11-16.log

 

 

发表评论

电子邮件地址不会被公开。

*