logo

SCIENTIA SINICA Informationis, Volume 48, Issue 12: 1681-1696(2018) https://doi.org/10.1360/N112018-00138

A personalized mail re-filtering system based on the client

More info
  • ReceivedMay 28, 2018
  • AcceptedAug 22, 2018
  • PublishedDec 4, 2018

Abstract

Email is an essential communication tool, but a large number of spam emails canseriously affect the work and life of users and can even cause property damage. Due to differentinterests and hobbies, there may be huge differences in the definition of spam by users; therealization of personalized spam filtering has become an important issue in the field of spamfiltering. When emails are misjudged, the user has to manually modify it, which brings greatinconvenience to the user experience. In order to effectively solve the above problems and realizethe functions of personalized email filtering and automatic correction of mis-filtered emails, thispaper combined with rules and statistical methods presents a personalized email re-filteringsystem based on the client (PRFC) and implements the automatic modification of the mis-filteredemails. A large part of existing spam filters do not consider the difference between class prior probabilityand class imbalance problem; they only filter the mail online. Firstly, the proposed filter systemprocesses the mails entering the inbox and the garbage and then designs two mutually learnedfilters based on the multi-task learning principle to be used for the automatic modification of themis-filtered emails in inbox and garbage. To ensure the performance of the filterbased on the interests of users and data distribution of mails varying with time, amulti-window learning framework that combines important weights to effectively implement thedynamic adaptation of the filter was designed. Finally, our proposed filtering system on the TREC2006c and 2007p data sets that gets a significant filtering efficiency was verified.


Funded by

国家自然科学基金项目(61672281,61472186)


References

[1] Messaging Anti-Abuse Working Group. MAAWG email metrics program. First Quarter 2006 Report. 2006. http://www.maawg.org/about/FINAL_1Q2006_Metrics_Report.pdf. Google Scholar

[2] Teng W L, Teng W C. A personalized spam filtering approach utilizing two separately trained filters. In: Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. Washington: IEEE Computer Society, 2008. 125--131. Google Scholar

[3] Lin H Z, Wang J L, Wu J P, et al. Effect of cold-rolling cladding on microstructure and properties of composite aluminum alloy foil. J Commun, 2017, 34: 121--132. Google Scholar

[4] Huang G W, Liu Y X, Chen Z. Personalized spam filtering method based on users' feedback. Electron Design Eng, 2014, 22: 53--56. Google Scholar

[5] Guzella T S, Caminhas W M. A review of machine learning approaches to Spam filtering. 2009, 36: 10206-10222 CrossRef Google Scholar

[6] Liu W Y, Wang T. Ensemble learning and active learning Based personal spam email filtering. Comput Eng Sci, 2011, 33: 34--41. Google Scholar

[7] Clark J, Koprinska I, Poon J. Linger-a smart personal assistant for e-mail classification. In: Proceedings of the 13th International Conference on Artificial Neural Networks (ICANN'03), 2003. 274--277. Google Scholar

[8] Sahami M, Dumais S, Heckerman D, et al. A Bayesian approach to filtering junk e-mail. In: Proceedings of AAAI Workshop on Learning for Text Categorization, 1998. 62: 98--105. Google Scholar

[9] Graham P. Better Bayesian filtering. 2003. http://www.paulgraham.com/better.html. Google Scholar

[10] Amayri O, Bouguila N. A study of spam filtering using support vector machines. 2010, 34: 73-108 CrossRef Google Scholar

[11] Sanghani G, Kotecha K. Personalized spam filtering using incremental training of support vector machine. In: Proceedings of Conference on Computing, Analytics and Security Trends (CAST), Pune, 2016. 323--328. Google Scholar

[12] Yeh C Y, Wu C H, Doong S H. Effective spam classification based on meta-heuristics. In: Proceedings of IEEE International Conference on Systems, Man and Cybernetics, Waikoloa, 2005. 4: 3872--3877. Google Scholar

[13] Toolan F, Carthy J. Feature selection for spam and phishing detection. In: Proceedings of Conference on eCrime Researchers Summit (eCrime), Dallas, 2010. 1--12. Google Scholar

[14] Cheng V, Li C H. Personalized spam filtering with semi-supervised classifier ensemble. In: Proceedings of the 2006 IEEE/WIC/ACM international Conference on Web intelligence. Washington: IEEE Computer Society, 2006. 195--201. Google Scholar

[15] Gomes H M, Barddal J P, Enembreck F, et al. A survey on ensemble learning for data stream classification. ACM Comput Surv, 2017, 50: 23. Google Scholar

[16] Wang S, Minku L L, Yao X. A Systematic Study of Online Class Imbalance Learning With Concept Drift.. 2018, 29: 4802-4821 CrossRef PubMed Google Scholar

[17] Syed N A, Liu H, Sung K K. Handling concept drifts in incremental learning with support vector machines. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 1999. 317--321. Google Scholar

[18] Wang Y W, Liu Y N, Feng L Z, et al. A novel online spam identification method based on user interest degree. J South China Univ Tech (Nat Sci Ed), 2014, 7: 21--27. Google Scholar

[19] Junejo K N, Karim A. Robust personalizable spam filtering via local and global discrimination modeling. 2013, 34: 299-334 CrossRef Google Scholar

[20] Cohen L, Avrahami-Bakish G, Last M. Real-time data mining of non-stationary data streams from sensor networks. 2008, 9: 344-353 CrossRef Google Scholar

[21] Gama J, Medas P, Castillo G, et al. Learning with drift detection. In: Proceedings of Conference on Brazilian symposium on artificial intelligence. Berlin: Springer, 2004. 286--295. Google Scholar

[22] Harel M, Mannor S, El-Yaniv R, et al. Concept drift detection through resampling. In: Proceedings of the 31st International Conference on Machine Learning, Beijing, 2014. 1009--1017. Google Scholar

[23] Bach S H, Maloof M A. Paired learners for concept drift. In: Proceedings of the 8th IEEE International Conference on Data Mining, Pisa, 2008. 23--32. Google Scholar

[24] Xu Y, Xu R, Yan W, et al. Concept drift learning with alternating learners. In: Proceedings of International Joint Conference on Neural Networks (IJCNN), Anchorage, 2017. 2104--2111. Google Scholar

[25] Wang J, Xu S, Duan B, et al. An ensemble classification algorithm based on information entropy for data streams. 2017,. arXiv Google Scholar

[26] Mandelbaum A, Shalev A. Word embeddings and their use in sentence classification tasks. 2016,. arXiv Google Scholar

[27] Sugiyama M, Nakajima S, Kashima H, et al. Direct importance estimation with model selection and its application to covariate shift adaptation. In: Proceedings of Conference on Advances in neural information processing systems, Vancouver, 2008. 1433--1440. Google Scholar

[28] Zhang K, Zheng V, Wang Q, et al. Covariate shift in hilbert space: A solution via sorrogate kernels. In: Proceedings of the 30th International Conference on Machine Learning, Atlanta, 2013. 388--395. Google Scholar

[29] Liu A, Ziebart B. Robust classification under sample selection bias. In: Proceedings of the Conference on Advances in neural information processing systems, Montreal, 2014. 37--45. Google Scholar

[30] Huang J, Gretton A, Borgwardt K M, et al. Correcting sample selection bias by unlabeled data. In: Proceedings of Conference on Advances in Neural Information Processing Systems, Vancouver, 2007. 601--608. Google Scholar

[31] Kawahara Y, Sugiyama M. Sequential change-point detection based on direct density-ratio estimation. 2012, 5: 114-127 CrossRef Google Scholar

[32] Kanamori T, Hido S, Sugiyama M. Efficient direct density ratio estimation for non-stationarity adaptation and outlier detection. In: Proceedings of Conference on Advances in neural information processing systems, Vancouver, 2009. 809--816. Google Scholar

[33] Kivinen J, Smola A J, Williamson R C. Online Learning with Kernels. 2004, 52: 2165-2176 CrossRef ADS Google Scholar

[34] Junejo K N. Distribution shift resilient discrimination information space for SVM classification. In: Proceedings of 8th IEEE Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, 2017. 378--383. Google Scholar

[35] Han Y, He X, Yang M, et al. Chinese spam filter based on relaxed online support vector machine. In: Proceedings of Conference on Asian Language Processing (IALP), Harbin, 2010. 185--188. Google Scholar

[36] Sun G, Li S, Chen T, et al. Active learning method for Chinese spam filtering. Int J Performability Eng, 2017, 17: 511. Google Scholar

  • Figure 1

    (Color online) System framework of PRFC

  • Figure 2

    (Color online) An illustration of multi-window framework

  • Figure 3

    (Color online) Parameter sensitivity of PRFC on TREC 2006c. (a) Sensitivity to the Multi-task learning parameter; (b) sensitivity to the ensemble size; (c) sensitivity to the sliding window size; (d) sensitivity to the threshold of drift detection

  • Figure 4

    The proportion of filtration based on PRFC different parts

  •   

    Algorithm 1 PRFC实现

    输入: 有真实标记样本$\{{\rm~SW}^{i}\}_{i=1}^{N_m}$, 无真实标记样本$\{{\rm~TW}^{(i)}\}_{i=1}^{N_m}$, 解析后测试邮件email, LW的起始位置$T_0$和当前位置$T_1$, L模型的可接受错滤率阈值$\rho$, 预测标记的置信度阈 值$\xi$, 已初始化的过滤器Filter_inbox和Filter_junkbox;

    1:if 根据email[‘From’]或email[‘Re’]并基于规则能判定$y=0$ then

    2: return $y$;

    3:else $\{email \in~T_i\}$

    4: 利用Filter_inbox过滤器;

    5: 基于主题过滤: 向量化email[‘Subject’]为$x^s$;

    6: ${\rm~TW}^{(N_m+1)}~\leftarrow~x^s$,${\rm~SW}^{(N_m+1)}\leftarrow~{\rm~TW}^{(1)}$;

    7: ${\rm~SW}^{(i)}\leftarrow~{\rm~SW}^{(i+1)}$,${\rm~TW}^{(i)}\leftarrow~{\rm~TW}^{(i+1)},i=1,\ldots,N_m$;

    8:if ${\rm~SW}$中出现类不平衡 then

    9: MTFL学得模型参数$w^i,w^j$;

    10: 更新L;

    11:end if

    12: 利用${\rm~SW}^{(N_m)}$增量学习L;

    13: 更新$\{{\rm~SW}^{(i)}\}_{i=1}^{N_m}$权重$\{\alpha~_i\}_{i=1}^{N_m}$;

    14:if 检测到协变量漂移发生 then

    15: 重新计算$\{\alpha~_i\}_{i=1}^{N_m}$;

    16:end if

    17: 利用加权$\{{\rm~SW}^{(i)}\}_{i=1}^{N_m}$更新S;

    18:if ${\rm~Err}(L)>~{\rm~Err}(S)$且${\rm~Err}(L)>\rho$ then

    19: $L\leftarrow~S$;

    20: $T_0~=~T_1-N_m$;

    21:else

    22: $T_1~=~T_1+1$;

    23:end if

    24: $L_.{\rm~predict}(x^s)\rightarrow~[y,{\rm~confidence}]$;

    25:if ${\rm~confidence}>\xi$ then

    26: return $y$;

    27:else

    28: 基于正文过滤: 向量化email[‘Body’]为$x^b$;

    29: 同理, 重复6$\sim$24;

    30: return $y$;

    31:end if

    32:else $\{email \in~T_i\}$

    33: 利用Filter_junkbox过滤器;

    34: 同理, 重复5$\sim$31;

    35:end if

    输出: email的预测标记$y$.

  • 1   Table 1Experimental corpuses
    Corpus Normal Spam Total
    TREC 2006c 21766 42854 64620
    TREC 2007p 25220 50199 75419
  • 2   Table 2Multi-task vs. single task $^{\rm~a)}$
    textMethod G-mean ($\uparrow$) $F1$ ($\uparrow$) ($1-$ROCA)% ($\downarrow$) Accuracy ($\uparrow$)
    textMulti-task 0.9895 0.9921 0.0104 0.9896
    textSingle task 0.9703 0.9775 0.0296 0.9705
  • 3   Table 3Evaluating different algorithms on TREC 2006c and TREC 2007p $^{\rm~a)}$
    Corpus TREC 2006c TREC 2007p
    Evaluation Accuracy FPR ($1-$ROCA)% lam% Accuracy FPR ($1-$ROCA)% lam%
    criteria ($\uparrow$) ($\downarrow$) ($\downarrow$) ($\downarrow$) ($\uparrow$) ($\downarrow$) ($\downarrow$) ($\downarrow$)
    DISvm [34] 0.9594 0.0107 0.0383 2.73 0.9658 0.0087 0.0321 2.10
    ROSVM [35] 0.9935 0.0036 0.0094 0.34 0.9848 0.0060 0.0108 0.86
    MLC [36] 0.9992 0.0021 0.0004 0.08 0.9855 0.0056 0.0096 0.64
    PRFC 0.9984 0.0013 0.0025 0.12 0.9865 0.0053 0.0068 0.45

Copyright 2019 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1