SCIENCE CHINA Information Sciences, Volume 63, Issue 3: 132104(2020) https://doi.org/10.1007/s11432-018-9839-6

Static tainting extraction approach based on information flow graph for personally identifiable information

More info
  • ReceivedSep 26, 2018
  • AcceptedMar 22, 2019
  • PublishedFeb 10, 2020


Personally identifiable information (PII) is widely used for many aspects such as network privacy leak detection, network forensics, and user portraits. Internet service providers (ISPs) and administrators are usually concerned with whether PII has been extracted during the network transmission process. However, most studies have focused on the extractions occurring on the client side and server side. This study proposes a static tainting extraction approach that automatically extracts PII from large-scale network traffic without requiring any manual work and feedback on the ISP-level network traffic. The proposed approach does not deploy any additional applications on the client side. The information flow graph is drawn via a tainting process that involves two steps: inter-domain routing and intra-domain infection that contains a constraint function (CF) to limit the “over-tainting". Compared with the existing semantic-based approach that uses network traffic from the ISP, the proposed approach performs better, with 92.37% precision and 94.04% recall. Furthermore, three methods that reduce the computing time and the memory overhead are presented herein. The number of rounds is reduced to 0.0883%, and the execution time overhead is reduced to 0.0153% of the original approach.


[1] Krishnamurthy B, Wills C E. On the leakage of personally identifiable information via online social networks. In: Proceedings of the 2nd ACM Workshop on Online Social Networks, 2009. 7--12. Google Scholar

[2] Mccallister E, Grance T, Scarfone K A. Guide to Protecting the Confidentiality of Personally Identifiable Information (PII). Special Publication (NIST SP)-800-122. 2010. Google Scholar

[3] Liu Y, Song H H, Bermudez I, et al. Identifying personal information in internet traffic. In: Proceedings of ACM on Conference on Online Social Networks, 2015. 59--70. Google Scholar

[4] Enck W, Gilbert P, Chun B G. TaintDroid: an information flow tracking system for real-time privacy monitoring on smartphones. Commun ACM, 2014, 57: 99-106 CrossRef Google Scholar

[5] Ball J, Schneier B, Greenwald G. NSA and GCHQ target Tor network that protects anonymity of web users. Guardian Web, 2013. Google Scholar

[6] Yang Z, Yang M, Zhang Y, et al. Appintent: analyzing sensitive data transmission in android for privacy leakage detection. In: Proceedings of ACM Sigsac Conference on Computer & Communications Security, 2013. 1043--1054. Google Scholar

[7] Arzt S, Rasthofer S, Fritz C, et al. Flowdroid: precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for android apps. In: Proceedings of ACM Sigplan Conference on Programming Language Design and Implementation, 2014. 259--269. Google Scholar

[8] Au K W Y, Zhou Y F, Huang Z, et al. Pscout: analyzing the android permission specification. In: Proceedings of ACM Conference on Computer and Communications Security, 2012. 217--228. Google Scholar

[9] Egele M, Kruegel C, Kirda E, et al. PiOS: detecting privacy leaks in iOS applications. In: Proceedings of NDSS, 2011. 177--183. Google Scholar

[10] Cao Y, Fratantonio Y, Bianchi A, et al. Edgeminer: automatically detecting implicit control flow transitions through the android framework. In: Proceedings of Network and Distributed System Security Symposium, 2015. Google Scholar

[11] Babil G S, Mehani O, Boreli R, et al. On the effectiveness of dynamic taint analysis for protecting against private information leaks on android-based devices. In: Proceedings of International Conference on Security and Cryptography (SECRYPT), 2013. Google Scholar

[12] Song Y, Hengartner U. Privacyguard: a vpn-based platform to detect information leakage on android devices. In: Proceedings of ACM CCS Workshop on Security and Privacy in Smartphones and Mobile Devices, 2015. 15--26. Google Scholar

[13] Ren J, Rao A, Lindorfer M, et al. Recon: revealing and controlling PII leaks in mobile network traffic. In: Proceedings of International Conference on Mobile Systems, Applications, and Services, 2016. 361--374. Google Scholar

[14] Razaghpanah A, Vallina-Rodriguez N, Sundaresan S, et al. Haystack: In Situ Mobile Traffic Analysis in User Space. 2015,. arXiv Google Scholar

[15] Le A, Varmarken J, Langhoff S, et al. Antmonitor: a system for monitoring from mobile devices. In: Proceedings of ACM SIGCOMM Workshop on Crowdsourcing and Crowdsharing of Big, 2015. 15--20. Google Scholar

[16] Continella A, Fratantonio Y, Lindorfer M, et al. Obfuscation-resilient privacy leak detection for mobile Apps through differential analysis. In: Proceedings of Network and Distributed System Security Symposium, 2017. Google Scholar

[17] Englehardt S, Han J, Narayanan A. I never signed up for this Privacy implications of email tracking. In: Proceedings of Privacy Enhancing Technologies, 2018. Google Scholar

[18] Srivastava G, Bhuwalka K, Sahoo S K, et al. Privacyproxy: leveraging crowdsourcing and in situ traffic analysis to detect and mitigate information leakage. 2017,. arXiv Google Scholar

[19] Seneviratne S, Kolamunna H, Seneviratne A. A Measurement Study of Tracking in Paid Mobile Applications. In: Proceedings of the 8th ACM Conference on Security & Privacy in Wireless and Mobile Networks, 2015. Google Scholar

[20] Chen T, Ullah I, Kaafar M A, et al. Information leakage through mobile analytics services. In: Proceedings of Workshop on Mobile Computing Systems & Applications, 2014. Google Scholar

[21] Leontiadis I, Efstratiou C, Picone M, et al. Don't kill my ads Balancing privacy in an ad-supported mobile application market. In: Proceedings of the 12th Workshop on Mobile Computing Systems & Applications, 2012. Google Scholar

[22] Georgiev M, Iyengar S, Jana S, et al. The most dangerous code in the world: validating ssl certificates in non-browser software. In: Proceedings of ACM Conference on Computer and Communications Security, 2012. 38--49. Google Scholar

[23] Fahl S, Harbach M, Muders T, et al. Why Eve and Mallory love Android: an analysis of Android SSL (in) security. In: Proceedings of ACM Conference on Computer and Communications Security, 2012. 50--61. Google Scholar

[24] Ren J J, Lindorfer M, Dubois D J, et al. Bug fixes, improvements, ... and privacy leaks -- a longitudinal study of PII leaks across Android App versions. In: Proceedings of Network and Distributed System Security Symposium (NDSS), 2018. Google Scholar

[25] Lindorfer M, Neugschwandtner M, Weichselbaum L, et al. Andrubis -- 1,000,000 apps later: a view on current android malware behaviors. In: Proceedings of International Workshop on Building Analysis Datasets & Gathering Experience Returns for Security, 2016. 3--17. Google Scholar

[26] Bell J, Kaiser G. Phosphor: illuminating dynamic data flow in commodity JVMs. Acm Sigplan Notices, 2014, 10: 83--101. Google Scholar

[27] Rastogi V, Qu Z Y, Mcclurg J, et al. Uranine: real-time privacy leakage monitoring without system modification for Android. In: Proceedings of International Conference on Security and Privacy in Communication Systems, 2015. 256--276. Google Scholar

[28] Hornyack P, Han S, Jung J, et al. “These aren't the droids you're looking for": retrofitting Android to protect data from imperious applications. In: Proceedings of ACM Conference on Computer and Communications Security (CCS), 2011. Google Scholar

[29] Zhu D Y, Jung J, Song D. TaintEraser. SIGOPS Oper Syst Rev, 2011, 45: 142 CrossRef Google Scholar

[30] Arefi Meisam N, Alexander G, Crandall J R. PIITracker: automatic tracking of personally identifiable information in windows. In: Proceedings of the 11th European Workshop on Systems Security (EuroSec'18), 2018. Google Scholar

[31] Machiry A, Tahiliani R, Naik M. Dynodroid: an input generation system for Android Apps. In: Proceedings of Joint Meeting on Foundations of Software Engineering, 2013. 224--234. Google Scholar

[32] Carter P, Mulliner C, Lindorfer M, et al. Curiousdroid: automated user interface interaction for android application analysis sandboxes. In: Proceedings of International Conference on Financial Cryptography and Data Security, 2016. 231--249. Google Scholar

[33] Hao S, Liu B, Nath S, et al. Puma: programmable Ui-automation for large-scale dynamic analysis of mobile Apps. In: Proceedings of the 12th Annual International Conference on Mobile Systems, Applications, and Services, 2014. 204--217. Google Scholar

[34] Starov O, Nikiforakis N. Extended tracking powers: measuring the privacy diffusion enabled by browser extensions. In: Proceedings of International Conference on World Wide Web, 2017. 1481--1490. Google Scholar

[35] Liu Y. Design and implementation of high performance IP network traffic capture system. J Yanan Univ (Natl Sci Edit), 2017, 36: 22--24. Google Scholar

[36] Liu Y, Zhan Y H. Research on mobile terminal equipment recognition method based on HTTP traffic. Modern Electron Tech, 2018, 41: 93--95. Google Scholar

[37] Dai S F, Tongaonkar A, Wang X Y, et al. NetworkProfiler: towards automatic fingerprinting of Android Apps. In: Proceedings of IEEE INFOCOM, 2013. 809--817. Google Scholar

  • Figure 1

    Data flow diagrams of the static tainting extraction approach.

  • Figure 2

    Information flow graph.

  • Table 1   Domain-value data table
    Domain Value
    $~{\rm~SK}_{1}~$ $~v_{1}~$ $~v_{2}~$
    $~{\rm~SK}_{2}~$ $~v_{1}~$ $~v_{3}~$ $~v_{4}~$
    $~{\rm~SK}_{3}~$ $~v_{3}~$ $~v_{7}~$
    $~{\rm~SK}_{4}~$ $~v_{7}~$ $~v_{8}~$ $~v_{9}~$ $~v_{10}~$ $~v_{11}~$
    $~{\rm~SK}_{5}~$ $~v_{10}~$ $~v_{12}~$ $~v_{13}~$
    $~{\rm~SK}_{6}~$ $~v_{2}~$ $~v_{12}~$ $~v_{14}~$
    $~{\rm~SK}_{7}~$ $~v_{5}~$ $~v_{9}~$ $~v_{14}~$ $~v_{15}~$
    $~{\rm~SK}_{8}~$ $~v_{8}~$ $~v_{11}~$ $~v_{13}~$
    $~{\rm~SK}_{9}~$ $~v_{5}~$ $~v_{6}~$ $~v_{8}~$
    vdots vdots

    Algorithm 1 Static tainting extraction algorithm


    Beginning of intra-domain infection. SK $\notin$ SKList SKList $~\Leftarrow~$ SK; TempValueList = $\emptyset$; Line $\gets$ each line of DataSet $//$ Put all values belonging to an SK in the TempValueList if the values conform to the rule of the conditional function. SK = LineSK AND CONDITION(LineValue) = constraint function TempValueList $~\Leftarrow~$ LineValue;

    Require:Tainting-value, DataSet, CONDITION_RULES(tainting-value);

    Output:ValueList = $\emptyset$, SKList = $\emptyset$;

    ValueList $~\Leftarrow~$ tainting-value;

    $//$ Select a tainting-value as input and put it in the ValueList.

    Constraint function $~\Leftarrow~$ CONDITION_RULES(tainting-value);

    $//$ You can choose an appropriate constraint function for the different types of tainting-values.

    for Value $\gets$ each value of ValueList

    $//$ Beginning of inter-domain routing.

    TempSKList = $\emptyset$;

    $//$ Searching each line of dataset.

    for Line $\gets$ each line of DataSet

    if (Value $\in$ Linevalue) AND (LineSK $~\notin~$ TempSKList) then

    $//$ Put new SK in the TempSKList if the new SK include a shared value.

    TempSKList $~\Leftarrow~$ LineSK;

    end if

    end for

    end for

    for SK $\gets$ each value of TempSKList

  • Table 2   Shared-value adjacency matrix
    Domain $~{\rm~SK}_{1}~$ $~{\rm~SK}_{2}~$ $~{\rm~SK}_{3}~$ $~{\rm~SK}_{4}~$ $~{\rm~SK}_{5}~$ $~{\rm~SK}_{6}~$ $~{\rm~SK}_{7}~$ $~{\rm~SK}_{8}~$ $~{\rm~SK}_{9}~$ $~{\rm~SK}_{n}~$
    $~{\rm~SK}_{1}~$ $~v_{1}~$ $~v_{2}~$
    $~{\rm~SK}_{2}~$ $~v_{1}~$ $~v_{2}~$ $~v_{4}~$
    $~{\rm~SK}_{3}~$ $~v_{3}~$ $~v_{7}~$ $~v_{6}~$
    $~{\rm~SK}_{4}~$ $~v_{7}~$ $~v_{10}~$ $~v_{9}~$ $~v_{11}~$ $~v_{8}~$
    $~{\rm~SK}_{5}~$ $~v_{10}~$ $~v_{12}~$ $~v_{13}~$
    $~{\rm~SK}_{6}~$ $~v_{2}~$ $~v_{12}~$ $~v_{14}~$
    $~{\rm~SK}_{7}~$ $~v_{5}~$ $~v_{9}~$ $~v_{14}~$ $~v_{5}~$
    $~{\rm~SK}_{8}~$ $~v_{11}~$ $~v_{13}~$
    $~{\rm~SK}_{9}~$ $~v_{4}~$ $~v_{6}~$ $~v_{8}~$ $~v_{5}~$
    vdots vdots
  • Table 3   Feature extraction
    Service Key Value
    mcgi.v.qq.com cmd 51
    mcgi.v.qq.com app_version_name 6.5.3
    mcgi.v.qq.com app_version_build 0
    mcgi.v.qq.com so_name p2p
    mcgi.v.qq.com so_ver V0.0.0.0
    mcgi.v.qq.com app_id 248
    mcgi.v.qq.com sdk_version V4.1.248.1730
    mcgi.v.qq.com imei 868129022933673
    mcgi.v.qq.com imsi 460023918121329
    mcgi.v.qq.com mac ec:df:3a:f3:50:66
    mcgi.v.qq.com numofcpucore 8
    mcgi.v.qq.com cpufreq 1363
    mcgi.v.qq.com null cpua
  • Table 4   Rules of the baseline method
    Category Type Rules(k-s:key-semantics, reg:regular expression)
    User identifiers User name/id, nick name k-s: substr. of user name/id, nick, login, or equal to “id" or “name"
    Password k-s: substr. of password, or equal to “pwd"
    Email reg: ^ [-~\_~$\backslash$w$\backslash$.]0,64@1([-$\backslash$w]1,63$\backslash$.)*[-$\backslash$w]1,63
    Device identifiers IMEI reg: value.length =15 and value.isdigit()
    MAC address reg: ^ ([0-9a-fA-F]2)?[-:]([0-9a-fA-F]2)5'
    IDFA reg: ^ ([0-9a-fA-F]8((-[0-9a-fA-F]4)3)-[0-9a-fA-F]12
    Contact information Phone number reg: ^ 1[3458]$\backslash$d9)
    Location GPS, reg: ^ -?((01?[0-7]?[0-9]?)(([.][0-9]1,6?)180(([.][0]1,6?))
    Latitude and longitude and k-s: substr. of lng, loc, long, loc, or equal to “x" or “y"
  • Table 5   Comparison between the proposed approach and the baseline method
    Type Baseline TP FP FN P R F1 Taint TP FP FN P R F1
    IMEI value 16559 6519 10040 0 0.3937 1.0000 0.5650 6798 6219 579 300 0.9148 0.9540 0.9340
    IMEI SK 4650 3025 1625 0 0.6505 1.0000 0.7883 3045 3009 36 16 0.9882 0.9947 0.9914
    Mac value 95703 5822 89881 0 0.0608 1.0000 0.1147 4925 4799 126 1023 0.9744 0.8243 0.8931
    Mac SK 5329 1024 4305 0 0.1922 1.0000 0.3224 904 834 70 190 0.9226 0.8145 0.8651
    IDFA value 115892 15432 100460 0 0.1332 1.0000 0.2350 13044 13044 0 2388 1.0000 0.8453 0.9161
    IDFA SK 3708 1876 1832 0 0.5059 1.0000 0.6719 1517 1517 0 359 1.0000 0.8086 0.8942
    Phone value 36680 849 35831 0 0.0231 1.0000 0.0452 541 539 2 310 0.9963 0.6349 0.7755
    Phone SK 1434 411 1023 0 0.2866 1.0000 0.4455 185 183 2 228 0.9892 0.4453 0.6141
    Email value 25208 443 24765 0 0.0176 1.0000 0.0345 191 191 0 252 1.0000 0.4312 0.6025
    Email SK 1850 223 1627 0 0.1205 1.0000 0.2151 141 141 0 82 1.0000 0.6323 0.7747
    Location value 15917 9761 6156 224 0.6132 0.9776 0.7537 12788 9719 3069 266 0.7600 0.9734 0.8536
    Location SK 770 642 128 170 0.8338 0.7906 0.8116 678 610 68 202 0.8997 0.7512 0.8188
    Name value 315206 214580 100626 0 0.6808 1.0000 0.8101 220385 204792 15593 9788 0.9292 0.9544 0.9416
    Name SK 8046 4190 3856 0 0.5208 1.0000 0.6849 4495 3932 563 258 0.8747 0.9384 0.9055
    Password value 904 631 273 137 0.6980 0.8216 0.7548 1104 575 529 193 0.5208 0.7487 0.6143
    Password SK 225 208 17 30 0.9244 0.8739 0.8985 254 223 31 15 0.8780 0.9370 0.9065
    Total 648081 265636 382445 561 0.4099 0.9979 0.5811 270995 250327 20668 15870 0.9237 0.9404 0.9320
  • Table 6   Sample sizes of the value types using CF-first
    Category Types Sample size (terms) Proportion (%)
    User identifiers User name/id, nick name 322754 0.9576
    Password 1118 0.0033
    Email 30944 0.0918
    Device identifiers IMEI 107077 0.3177
    MAC address 145472 0.4316
    IDFA 145788 0.4326
    Contact information Phone number 40060 0.1189
    Location GPS, latitude and longitude 24992 0.0742
    Total 818295 2.4279

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有