SCIENTIA SINICA Informationis, https://doi.org/10.1360/SSI-2020-0029

Analysis of COVID-19 spread characteristics and infection numbers based on the large-scale structured case data

More info
  • ReceivedFeb 17, 2020
  • AcceptedMar 20, 2020
  • PublishedMay 7, 2020


In early 2020, the new coronavirus COVID-2019 burst out in Wuhan. The Chinese people took the most comprehensive and rigorous control measures to fight against the coronavirus. Local health control departments report the infection data in a timely manner, which helps the public understand the development of the epidemic and take protective measures in advance. However, there is currently no literature that analyzes the transmission characteristics of the new coronavirus COVID-2019 based on structured data of large-scale patient cases and artificial intelligence. The detailed case data of patients in various regions are mainly recorded in text form, and the format of report data in different provinces and cities is different, which is difficult to be handled with. To analysis around the half anonymous patient case data from provinces except Hubei, we propose a method based on natural language processing technology to structure the case data. This method is able to extract the key information in the cases accurately and effectively under the help of a pre-trained model and a small number of labelled samples. By mining the patient's structured case data, this article analyzes the gender and age distribution, the main causes of infection, the characteristics of the incubation period and the trend of the epidemic situation in details. Under the help of traveling big data, a method that estimates the number of people infected in Wuhan before the restrictions of the city. This method helps people to understand the real epidemic situation and take protective measures in early time. It is also beneficial to the government departments to make scientific decisions, dispatch medical staff and allocate medical resources as early as possible.

Funded by







感谢安徽省卫生健康委员会、安徽省共青团、广西省卫生健康委员会、湖南省 卫生健康委员会、江苏省湖南省卫生健康委员会、山东省卫生健康委员会、陕西省卫生健康委 员会、深圳市卫生健康委员会、天津市卫生 健康委员会等提供的匿名病例数据支持, 感谢部分地级市卫生健康委员会对数据确认的支持, 感谢百度出行提供出行大数据支持. 感谢匿名评审专家耐心细致的指导. 特别致敬奋战一线的医务人员和志愿者


[1] Li W. Research on key algorithms of mining texts of electronic medical cases. 2014. Google Scholar

[2] Lu S Q, Dou Z C, Wen J R. Research on structured data extraction in surgical cases. Chin J Comput, 2019, 42: 2754--2768. Google Scholar

[3] Imai N, Dorigatti I, Cori A, et al. Estimating the potential total number of novel Coronavirus cases in Wuhan City, China. Imperial College London, 2020. https://www.imperial.ac.uk/media/imperial-college/medicine/sph/ide/gida-fellowships/Imperial-College-COVID19-epidemic-size-17-01-2020.pdf. Google Scholar

[4] Chinazzi M, Davis J T, Corrado G, et al. Preliminary assessment of the international spreading risk associated with the 2019 novel coronavirus (2019-nCoV) outbreak in Wuhan city. 2020. https://www.apprise.org.au/wp-content/uploads/2020/01/Chinazzi-CIDID20_nCoVExportation.pdf. Google Scholar

[5] Wu F, Zhao S, Yu B. A new coronavirus associated with human respiratory disease in China.. Nature, 2020, 579: 265-269 CrossRef PubMed Google Scholar

[6] Zhou P, Yang X L, Xian G, et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature, 2020, 3: 1--4. Google Scholar

[7] Xu X, Chen P, Wang J. Evolution of the novel coronavirus from the ongoing Wuhan outbreak and modeling of its spike protein for risk of human transmission.. Sci China Life Sci, 2020, 63: 457-460 CrossRef PubMed Google Scholar

[8] Li Q, Guan X, Wu P. Early Transmission Dynamics in Wuhan, China, of Novel Coronavirus-Infected Pneumonia.. N Engl J Med, 2020, 382: 1199-1207 CrossRef PubMed Google Scholar

[9] Huang C, Wang Y, Li X, et al. Clinical features of patients with 2019 novel coronavirus in Wuhan, China. Lancet, 2020, 395: 497--506. Google Scholar

[10] Holshue M L, DeBolt C, Lindquist S, et al. First case of 2019 novel coronavirus in the United States. New England J Med, 2020, 382: 929--936. Google Scholar

[11] Chen N, Zhou M, Dong X, et al. Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study. Lancet, 2020, 395: 507--513. Google Scholar

[12] Guan W, Ni Z, Hu Y. Clinical Characteristics of Coronavirus Disease 2019 in China.. N Engl J Med, 2020, : NEJMoa2002032 CrossRef PubMed Google Scholar

[13] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017. 6000--6010. Google Scholar

[14] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 770--778. Google Scholar

[15] Devlin J, Chang M, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019. 4171--4186. Google Scholar

[16] Ma X Z, Eduard H. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016. 1064--1074. Google Scholar

[17] Tang X, Wu C, Li X. On the origin and continuing evolution of SARS-CoV-2. Natl Sci Rev, 2020, CrossRef Google Scholar

[18] Fang B X. Online Social Network Analysis. Beijing: Electronic Industry Press, 2014. Google Scholar

[19] Huang Z, Wang Z, Zhu Y, et al. Prediction of cascade structure and outbreaks recurrence in microblogs n. In: Proceedings of Chinese National Conference on Social Media Processing, 2017. 53--64. Google Scholar

[20] Xu X K, Hu H B, Zhang L, et al. Computational Communication on Social Networks. Beijing: Higher Education Press, 2015. Google Scholar

  • Figure 1

    (Color online) The deep learning model to extract patient information including: gender, age and date

  • Figure 2

    (Color online) An example of the structured Covid-19 case details

  • Figure 3

    (Color online) Causes of infection

  • Figure 4

    (Color online) (a) Proportion of infected “travellers to Wuhan"; (b) proportion of infected cases with “Close physical contact with an infector"

  • Figure 5

    (Color online) (a) Gender proportion of patients; (b) comparison of patient age distribution and China's natural popularity distribution

  • Figure 6

    (Color online) (a) Data distribution and fit distribution during the latent period; (b) data distribution and fit distribution of medical consultation delay

  • Figure 7

    (Color online) (a) Data distribution and fit distribution of confirm delay after visiting. (b) Confirm delay reduces with visit dates

  • Figure 8

    (Color online) (a) Data of distribution and fit distribution from infection to confirmation; (b) an overall data estimation based on early samples

  • Figure 9

    (Color online) An overall data estimation based on early samples

  • Figure 10

    (Color online) Infection roadmap for the COVID-19 case occurring at a shopping mall in Baodi District, Tianjin

  • Figure 11

    (Color online) Partial sketch of dynamic Wuhan traveling data (provided by Baidu Travel)

  • Table 1   Samples of anonymous COVID-19 cases from different places
    Province/City Case examples
    Anhui xxx, male, 45 years old, Xiantao City, Hubei Province $\cdots$. When the patient returned to his hometown by car from Hubei on Jan. 10th, he first hung out for 3 hours on Wuhan Hanzheng Street, and then returned to xxx town of Mengcheng on the 11th. He began to cough, mainly dry cough, on the 24th, and transferred to the First People's Hospital of Mengcheng for treatment on the 26th.
    Guangxi xxx, female, 24 years old, xxx from Guilin, is the wife of patient xxx who was confirmed on Feb. 1st. On Jan. 27th, she and xxx returned to Guilin via Wuhan. On the 25th, she showed symptoms such as fever and sputum appeared. On Feb. 1st, she was hospitalized in the Third People's Hospital. On Feb. 4th, she was tested positive $\cdots$.
    Shenzhen 36 years old male patient, resident in Shenzhen Nanshan. He drove to Ezhou, Hubei on January 20th and returned to Shenzhen on the 25th. He began to show symptoms on February 1st and was hospitalized on February 3rd. He is now in a stable condition $\cdots$.

    Algorithm 1 The estimated infected case number in Wuhan based on early confirmed case data

    Require:City B structured cases, date of city A start to export $T_0$, closing date $T_{\rm~end}$, left popularity $N_{\rm~A}$ in city A, daily net outflow population $N_F$, proportion popularity $\alpha_T$ export to city $B$, the growth rate of infection cases $b$, Gaussian distribution $~\mu~$ and $~\delta~$ from infection to confirm, the predicted time $T_p$;

    Output:Infection rate $P_{\rm~A}$, infection quantity $N_{\rm~ga}$ of City $A$;

    Current date $T~\leftarrow~T_0$, max confirm delay $\triangle~T~\leftarrow~20$ day, current infection quantity $N_0$, current net outflow population $N_{F0}$;

    while $~T~<=~T_{\rm~end}$ do

    Outflow cases $C~\leftarrow~0$;

    for $T_i=T$ to $T+\triangle~T$

    According to $\mu$, $\delta$ to calculate the probability of $P_{T_i}$ at confirm dates $T_i$;

    Estimate the cases export to the city B with confirmed time $T_i~$,$N_{T_i}=P_{T_i}\times~N_0~\times~\alpha_T~\times~N_{F0}~/~N_{\rm~A}~$;

    Update confirmed cases in date $T_i$ according to $N_{T_i}$;

    Update outflow cases $C~\mathrel{+}=~N_{T_i}$;

    end for

    Update infection cases in city A: $N_0~=~(1+b)~\times~N_0-C$;

    Update population in city A: $N_{\rm~A}~\mathrel{-}=~N_{F0}$;

    end while

    Calculate the probability $P$ according to the distribution of the confirmed cases;

    Calculate $N_~{\rm~gb}$ and $P_{\rm~B}$ based on $P$ and the structured confirmed cases of city B;

    Calculate $P_{\rm~A}~\leftarrow~P_{\rm~B}$ and $N_{\rm~ga}=N_{\rm~A}\times~P_{\rm~A}$.

  • Table 2   Cases floating from Wuhan to destination cities
    City Floating population rate (%) Floating cases Total cases Cases before Jan. 25th Cases before Jan. 28th
    Hefei 0.4 41 104 9 16
    Fuyang 0.35 59 105 10 19
    Qingdao 0.12 19 43 6 10
    Jinan 0.15 17 39 2 4
    Heze 0.1 10 13 1 6
    Tianjing 0.15 22 79 8 13
    Shenzhen 1.87 261 334 26 61
    Nanning 0.19 17 32 1 6
    Guilin 0.13 22 28 10 15
    Beihai 0.09 28 31 7 13
    Zhumadian 0.66 74 107 6 13
    Taizhou 0.54 76 124 21 39
    Changde 0.33 67 93 10 31
    Huaihua 0.11 17 38 5 10
    Total 5.19 712 1170 122 256
  • Table 3   The estimated the infection rate and infection cases in Wuhan based on data in different cities
    City Infection rate (%) Infection cases Infection rate Infection cases
    ($\times~10^4$) (%, unstructured) ($\times~10^4$, unstructured)
    Hefei 0.21 2.23 0.52 5.77
    Fuyang 0.34 3.63 0.60 6.61
    Qingdao 0.32 3.41 0.72 7.85
    Jinan 0.23 2.46 0.52 5.77
    Heze 0.18 1.96 0.26 3.01
    Tianjing 0.29 3.16 1.05 11.42
    Shenzhen 0.28 3.01 0.36 4.04
    Nanning 0.18 1.95 0.34 3.83
    Guilin 0.34 3.64 0.43 4.82
    Beihai 0.62 6.65 0.69 7.56
    Zhumadian 0.22 2.43 0.32 3.69
    Taizhou 0.28 3.04 0.46 5.12
    Changde 0.41 4.36 0.56 6.23
    Huaihua 0.31 3.33 0.69 7.58
    Average 0.30 3.23 0.54 5.95
    Median 0.29 3.16 0.52 5.77

Copyright 2020  CHINA SCIENCE PUBLISHING & MEDIA LTD.  中国科技出版传媒股份有限公司  版权所有

京ICP备14028887号-23       京公网安备11010102003388号