logo

SCIENCE CHINA Information Sciences, Volume 60, Issue 8: 082104(2017) https://doi.org/10.1007/s11432-016-0442-1

Embedding differential privacy in decision tree algorithm with different depths

More info
  • ReceivedOct 13, 2016
  • AcceptedNov 14, 2016
  • PublishedJul 6, 2017

Abstract

Differential privacy (DP) has become one of the most important solutions for privacy protection in recent years. Previous studies have shown that prediction accuracy usually increases as more data mining (DM) logic is considered in the DP implementation. However, although one-step DM computation for decision tree (DT) model has been investigated, existing research has not studied the scenarios when the DP is embedded in two-step DM computation, three-step DM computation until the whole model DM computation. It is very challenging to embed DP in more than two steps of DM computation since the solution space exponentially increases with the increase of computational complexity. In this work, we propose algorithms by making use of Markov Chain Monte Carlo (MCMC) method, which can efficiently search a computationally infeasible space to embed DP into DT generation algorithm. We compare the performance when embedding DP in DT with different depths, i.e., one-step DM computation (previous work), two-step, three-step and the whole model. We find that the deep combination of DP and DT does help to increase the prediction accuracy. However, when the privacy budget is very large (e.g., $\epsilon=10$), this may overwhelm the complexity of DT model, and the increasing trend is not obvious. We also find that the prediction accuracy decreases with the increase of model complexity.


Acknowledgment

This work was supported in part by National Natural Science Foundation of China (Grant Nos. 61525204, 61572322), Science and Technology Commission of Shanghai Municipality Project (Grant Nos. 14510722600, 16QA1402200), Aeronautical Science Foundation of China (Grant No. 20145557010), and NRF Singapore CREATE Program E2S2.

  • Figure 1

    DP implementation skeleton of decision tree. (a) Previous work; (b) this work.

  • Figure 2

    (Color online) Embedding DP in DT with different depths. (a) One-step computation (previous work); (b) two-step computation; (c) three-step computation; (d) whole tree computation.

  • Figure 3

    (Color online) Trace of $st.score$ for $mushroom$. (a) $\epsilon = 0.01$ or 0.1; (b) $\epsilon = 1.0$ or 10.

  • Figure 4

    (Color online) Accuracy of algorithms on $vote$ with different embedding depths and different privacy budgets. (a) $\epsilon=0.01$; (b) $\epsilon=0.1$; (c) $\epsilon=1.0$; (d) $\epsilon=10$.

  • Figure 5

    (Color online) Accuracy of algorithms on $mushroom$ with different embedding depths and different privacy budgets. (a) $\epsilon=0.01$; (b) $\epsilon=0.1$; (c) $\epsilon=1.0$; (d) $\epsilon=10$.

  • Figure 6

    (Color online) Accuracy of algorithms on $tic\text{-}tac\text{-}toe$ with different embedding depths and different privacy budgets. (a) $\epsilon=0.01$; (b) $\epsilon=0.1$; (c) $\epsilon=1.0$; (d) $\epsilon=10$.

  • Figure 7

    (Color online) Comparison of $Private$-$RDT$ and our algorithms on two datasets. (a) $vote$; (b) $mushroom$.

  • Figure 8

    (Color online) Running time of algorithms on $mushroom$ with different number of instances and different privacy budgets. (a) $\epsilon=0.01$; (b) $\epsilon=0.1$; (c) $\epsilon=1.0$; (d) $\epsilon=10$.

  •   

    Algorithm 1 DP Embedding algorithm—DPEA($\mathcal{T}$, $A$, $C$, $D$, $d$, $\epsilon$)

    Require:$\mathcal{T}$—training dataset, $A$—attribute set, $C$—class label, $D$—maximal tree depth, $d$—steps of computation to embed DP, $\epsilon$—privacy budget;

    Output:$dt$—a decision tree;

    $\epsilon' = \frac{\epsilon}{\lceil\frac{D}{d}\rceil+1}$

    $dt$ = root

    $innerNodes$ = root, stopFlag=False

    while $|innerNodes| > 0$ do

    $innerNode$ = pop one node from $innerNodes$

    $left = D - \text{(current depth)}$

    if $left < d$ then

    if $left \leq 2$ then

    subTree $st$ = exhaustive_search_Exp($innerNode$, $\mathcal{T}$, $A$, $C$, $D$, $left$, $\epsilon'$)

    else

    subTree $st$ = mcmc_search_Exp($innerNode$, $\mathcal{T}$, $A$, $C$, $D$, $left$,$\epsilon'$)

    end if

    else

    if $d \leq 2$ then

    subTree $st$ = exhaustive_search_Exp($innerNode$, $\mathcal{T}$, $A$, $C$, $D$, $d$, $\epsilon'$)

    else

    subTree $st$ = mcmc_search_Exp($innerNode$, $\mathcal{T}$, $A$, $C$, $D$, $d$, $\epsilon'$)

    end if

    end if

    for each leaf node $ln$ in $st$

    if $ln.stopFlag$ = False then

    $innerNodes$ = $innerNodes$ $\cup$ $\{ln\}$

    end if

    end for

    extend $dt$ with $st$ from $innerNode$

    end while

    return $dt$

  •   

    Algorithm 2 Exhaustively search the subtree space—exhaustive_search_Exp($innerNode$, $\mathcal{T}$, $A$, $C$, $D$, $d$, $\epsilon'$)

    Require:$innerNode$—the node to split, $\mathcal{T}$—training dataset, $A$—attribute set, $C$—class label, $D$—maximal tree depth, $d$—steps of computation to embed DP, $\epsilon'$—privacy budget;

    Output:st—a subtree;

    $ST[] = $ all ($d$+1)-layer subtrees rooted at $innerNode$ whose attributes do not appear in $innerNode$'s ancestors

    for each $ST[i] \in ST$

    calculate $ST[i].score$

    $P[i] =$ $\exp(\frac{\epsilon'*ST[i].score}{2*\bigtriangleup q'})$

    end for

    $sum = \sum_{i=0}^{|P|-1} P[i]$

    for each $P[i]$

    $P[i] = P[i]/sum$

    end for

    $r = random(0,1)$

    $st = ST[i] | \text{where $r$ falls in P[i]}$

    for each leaf node $ln$ in $st$

    if $ln$ reach the stopping condition then

    $\forall c \in C : N_c = NoiseCount(\text{c in $ln$}, \epsilon')$

    $ln.label$ = $argmax_c(N_c)$

    $ln.stopFlag$ = True

    else

    $ln.stopFlag$ = False

    end if

    end for

    return $st$

  •   

    Algorithm 3 Search the subtree space with MCMC—mcmc_search_Exp($innerNode$, $\mathcal{T}$, $A$, $C$, $D$, $d$, $\epsilon'$)

    Require:$innerNode$—the node to split, $\mathcal{T}$—training dataset, $A$—attribute set, $C$—class label, $D$—maximal tree depth, $d$—steps of computation to embed DP, $\epsilon'$—privacy budget;

    Output:st—a subtree;

    $st$ = a random ($d$+1)-layer subtree rooted at $innerNode$ whose attributes do not appear in $innerNode$'s ancestors

    $t$ = $tag$ = 0

    while $buffer.varience$ $>$ $threshold$ and $t$ $<$ $stepLimit$ do

    compute $st.score$

    $buffer[(tag$+$)$ % $buffer.size$ $]$ = $st.score$

    $st'$ = randomly replace one inner node in $st$

    compute $st'.score$

    $st = st'$ with probability

    $t$+;

    end while

    for each leaf node $ln$ in $st$

    if $ln$ reach the stopping condition then

    $\forall c \in C : N_c = NoiseCount(\text{c in $ln$}, \epsilon')$

    $ln.label$ = $argmax_c(N_c)$

    $ln.stopFlag$ = True

    else

    $ln.stopFlag$ = False

    end if

    end for

    return $st$

Copyright 2019 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1