logo

SCIENTIA SINICA Informationis, Volume 48, Issue 12: 1681-1696(2018) https://doi.org/10.1360/N112018-00138

A personalized mail re-filtering system based on the client

More info
  • ReceivedMay 28, 2018
  • AcceptedAug 22, 2018
  • PublishedDec 4, 2018

Abstract

Email is an essential communication tool, but a large number of spam emails can seriously affect the work and life of users and can even cause property damage. Due to different interests and hobbies, there may be huge differences in the definition of spam by users; the realization of personalized spam filtering has become an important issue in the field of spam filtering. When emails are misjudged, the user has to manually modify it, which brings great inconvenience to the user experience. In order to effectively solve the above problems and realize the functions of personalized email filtering and automatic correction of mis-filtered emails, this paper combined with rules and statistical methods presents a personalized email re-filtering system based on the client (PRFC) and implements the automatic modification of the mis-filtered emails. A large part of existing spam filters do not consider the difference between class prior probability and class imbalance problem; they only filter the mail online. Firstly, the proposed filter system processes the mails entering the inbox and the garbage and then designs two mutually learned filters based on the multi-task learning principle to be used for the automatic modification of the mis-filtered emails in inbox and garbage. To ensure the performance of the filter based on the interests of users and data distribution of mails varying with time, a multi-window learning framework that combines important weights to effectively implement the dynamic adaptation of the filter was designed. Finally, our proposed filtering system on the TREC 2006c and 2007p data sets that gets a significant filtering efficiency was verified.


Funded by

国家自然科学基金项目(61672281)

国家自然科学基金项目(61472186)


References

[1] Messaging Anti-Abuse Working Group. MAAWG email metrics program. First Quarter 2006 Report. 2006. http://www.maawg.org/about/FINAL_1Q2006_Metrics_Report.pdf. Google Scholar

[2] Teng W L, Teng W C. A personalized spam filtering approach utilizing two separately trained filters. In: Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. Washington: IEEE Computer Society, 2008. 125--131. Google Scholar

[3] Lin H Z, Wang J L, Wu J P, et al. Effect of cold-rolling cladding on microstructure and properties of composite aluminum alloy foil. J Commun, 2017, 34: 121--132. Google Scholar

[4] Huang G W, Liu Y X, Chen Z. Personalized spam filtering method based on users' feedback. Electron Design Eng, 2014, 22: 53--56. Google Scholar

[5] Guzella T S, Caminhas W M. A review of machine learning approaches to Spam filtering. Expert Syst Appl, 2009, 36: 10206-10222 CrossRef Google Scholar

[6] Liu W Y, Wang T. Ensemble learning and active learning based personal spam email filtering. Comput Eng Sci, 2011, 33: 34--41. Google Scholar

[7] Clark J, Koprinska I, Poon J. Linger-a smart personal assistant for e-mail classification. In: Proceedings of the 13th International Conference on Artificial Neural Networks (ICANN'03), 2003. 274--277. Google Scholar

[8] Sahami M, Dumais S, Heckerman D, et al. A Bayesian approach to filtering junk e-mail. In: Proceedings of AAAI Workshop on Learning for Text Categorization, 1998. 62: 98--105. Google Scholar

[9] Graham P. Better Bayesian filtering. 2003. http://www.paulgraham.com/better.html. Google Scholar

[10] Amayri O, Bouguila N. A study of spam filtering using support vector machines. Artif Intell Rev, 2010, 34: 73-108 CrossRef Google Scholar

[11] Sanghani G, Kotecha K. Personalized spam filtering using incremental training of support vector machine. In: Proceedings of Conference on Computing, Analytics and Security Trends (CAST), Pune, 2016. 323--328. Google Scholar

[12] Yeh C Y, Wu C H, Doong S H. Effective spam classification based on meta-heuristics. In: Proceedings of IEEE International Conference on Systems, Man and Cybernetics, Waikoloa, 2005. 4: 3872--3877. Google Scholar

[13] Toolan F, Carthy J. Feature selection for spam and phishing detection. In: Proceedings of Conference on eCrime Researchers Summit (eCrime), Dallas, 2010. 1--12. Google Scholar

[14] Cheng V, Li C H. Personalized spam filtering with semi-supervised classifier ensemble. In: Proceedings of the 2006 IEEE/WIC/ACM international Conference on Web intelligence. Washington: IEEE Computer Society, 2006. 195--201. Google Scholar

[15] Gomes H M, Barddal J P, Enembreck F, et al. A survey on ensemble learning for data stream classification. ACM Comput Surv, 2017, 50: 23. Google Scholar

[16] Wang S, Minku L L, Yao X. A Systematic Study of Online Class Imbalance Learning With Concept Drift.. IEEE Trans Neural Netw Learning Syst, 2018, 29: 4802-4821 CrossRef PubMed Google Scholar

[17] Syed N A, Liu H, Sung K K. Handling concept drifts in incremental learning with support vector machines. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 1999. 317--321. Google Scholar

[18] Wang Y W, Liu Y N, Feng L Z, et al. A novel online spam identification method based on user interest degree. J South China Univ Tech (Nat Sci Ed), 2014, 7: 21--27. Google Scholar

[19] Junejo K N, Karim A. Robust personalizable spam filtering via local and global discrimination modeling. Knowl Inf Syst, 2013, 34: 299-334 CrossRef Google Scholar

[20] Cohen L, Avrahami-Bakish G, Last M. Real-time data mining of non-stationary data streams from sensor networks. Inf Fusion, 2008, 9: 344-353 CrossRef Google Scholar

[21] Gama J, Medas P, Castillo G, et al. Learning with drift detection. In: Proceedings of Conference on Brazilian Symposium on Artificial Intelligence. Berlin: Springer, 2004. 286--295. Google Scholar

[22] Harel M, Mannor S, El-Yaniv R, et al. Concept drift detection through resampling. In: Proceedings of the 31st International Conference on Machine Learning, Beijing, 2014. 1009--1017. Google Scholar

[23] Bach S H, Maloof M A. Paired learners for concept drift. In: Proceedings of the 8th IEEE International Conference on Data Mining, Pisa, 2008. 23--32. Google Scholar

[24] Xu Y, Xu R, Yan W, et al. Concept drift learning with alternating learners. In: Proceedings of International Joint Conference on Neural Networks (IJCNN), Anchorage, 2017. 2104--2111. Google Scholar

[25] Wang J, Xu S, Duan B, et al. An ensemble classification algorithm based on information entropy for data streams. 2017,. arXiv Google Scholar

[26] Mandelbaum A, Shalev A. Word embeddings and their use in sentence classification tasks. 2016,. arXiv Google Scholar

[27] Sugiyama M, Nakajima S, Kashima H, et al. Direct importance estimation with model selection and its application to covariate shift adaptation. In: Proceedings of Conference on Advances in Neural Information Processing Systems, Vancouver, 2008. 1433--1440. Google Scholar

[28] Zhang K, Zheng V, Wang Q, et al. Covariate shift in hilbert space: a solution via sorrogate kernels. In: Proceedings of the 30th International Conference on Machine Learning, Atlanta, 2013. 388--395. Google Scholar

[29] Liu A, Ziebart B. Robust classification under sample selection bias. In: Proceedings of the Conference on Advances in Neural Information Processing Systems, Montreal, 2014. 37--45. Google Scholar

[30] Huang J, Gretton A, Borgwardt K M, et al. Correcting sample selection bias by unlabeled data. In: Proceedings of Conference on Advances in Neural Information Processing Systems, Vancouver, 2007. 601--608. Google Scholar

[31] Kawahara Y, Sugiyama M. Sequential change-point detection based on direct density-ratio estimation. Statistical Analy Data Min, 2012, 5: 114-127 CrossRef Google Scholar

[32] Kanamori T, Hido S, Sugiyama M. Efficient direct density ratio estimation for non-stationarity adaptation and outlier detection. In: Proceedings of Conference on Advances in Neural Information Processing Systems, Vancouver, 2009. 809--816. Google Scholar

[33] Kivinen J, Smola A J, Williamson R C. Online Learning with Kernels. IEEE Trans Signal Process, 2004, 52: 2165-2176 CrossRef ADS Google Scholar

[34] Junejo K N. Distribution shift resilient discrimination information space for SVM classification. In: Proceedings of the 8th IEEE Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, 2017. 378--383. Google Scholar

[35] Han Y, He X, Yang M, et al. Chinese spam filter based on relaxed online support vector machine. In: Proceedings of Conference on Asian Language Processing (IALP), Harbin, 2010. 185--188. Google Scholar

[36] Sun G, Li S, Chen T, et al. Active learning method for Chinese spam filtering. Int J Performability Eng, 2017, 17: 511. Google Scholar

  •   

    Algorithm 1 PRFC实现

    Require:有真实标记样本$\{{\rm~SW}^{i}\}_{i=1}^{N_m}$, 无真实标记样本$\{{\rm~TW}^{(i)}\}_{i=1}^{N_m}$, 解析后测试邮件email, LW的起始位置$T_0$和当前位置$T_1$, L模型的可接受错滤率阈值$\rho$, 预测标记的置信度阈 值$\xi$, 已初始化的过滤器Filter_inbox和Filter_junkbox;

    if 根据email`From'或email`Re'并基于规则能判定$y=0$ then

    return $y$;

    else

    利用Filter_inbox过滤器;

    基于主题过滤: 向量化email`Subject'为$x^s$;

    ${\rm~TW}^{(N_m+1)}~\leftarrow~x^s$,${\rm~SW}^{(N_m+1)}\leftarrow~{\rm~TW}^{(1)}$;

    ${\rm~SW}^{(i)}\leftarrow~{\rm~SW}^{(i+1)}$,${\rm~TW}^{(i)}\leftarrow~{\rm~TW}^{(i+1)},i=1,\ldots,N_m$;

    if ${\rm~SW}$中出现类不平衡 then

    MTFL学得模型参数$w^i,w^j$;

    更新L;

    end if

    利用${\rm~SW}^{(N_m)}$增量学习L;

    更新$\{{\rm~SW}^{(i)}\}_{i=1}^{N_m}$权重$\{\alpha~_i\}_{i=1}^{N_m}$;

    if 检测到协变量漂移发生 then

    重新计算$\{\alpha~_i\}_{i=1}^{N_m}$;

    end if

    利用加权$\{{\rm~SW}^{(i)}\}_{i=1}^{N_m}$更新S;

    if ${\rm~Err}(L)>~{\rm~Err}(S)$且${\rm~Err}(L)>\rho$ then

    $L\leftarrow~S$;

    $T_0~=~T_1-N_m$;

    else

    $T_1~=~T_1+1$;

    end if

    $L_.{\rm~predict}(x^s)\rightarrow~[y,{\rm~confidence}]$;

    if ${\rm~confidence}>\xi$ then

    return $y$;

    else

    基于正文过滤: 向量化email`Body'为$x^b$;

    同理, 重复6$\sim$24;

    return $y$;

    end if

    else

    利用Filter_junkbox过滤器;

    同理, 重复5$\sim$31;

    end if

  • Table 1   Experimental corpuses
    Corpus Normal Spam Total
    TREC 2006c 21766 42854 64620
    TREC 2007p 25220 50199 75419
  • Table 2   Multi-task vs. single task $^{\rm~a)}$
    Method G-mean ($\uparrow$) $F1$ ($\uparrow$) ($1-$ROCA)% ($\downarrow$) Accuracy ($\uparrow$)
    Multi-task 0.9895 0.9921 0.0104 0.9896
    Single task 0.9703 0.9775 0.0296 0.9705
  • Table 3   Evaluating different algorithms on TREC 2006c and TREC 2007p $^{\rm~a)}$
    Corpus TREC 2006c TREC 2007p
    Evaluation Accuracy FPR ($1-$ROCA)% lam% Accuracy FPR ($1-$ROCA)% lam%
    criteria ($\uparrow$) ($\downarrow$) ($\downarrow$) ($\downarrow$) ($\uparrow$) ($\downarrow$) ($\downarrow$) ($\downarrow$)
    DISvm [34] 0.9594 0.0107 0.0383 2.73 0.9658 0.0087 0.0321 2.10
    ROSVM [35] 0.9935 0.0036 0.0094 0.34 0.9848 0.0060 0.0108 0.86
    MLC [36] 0.9992 0.0021 0.0004 0.08 0.9855 0.0056 0.0096 0.64
    PRFC 0.9984 0.0013 0.0025 0.12 0.9865 0.0053 0.0068 0.45

Copyright 2019 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1