SCIENTIA SINICA Informationis, Volume 50 , Issue 12 : 1882(2020) https://doi.org/10.1360/SSI-2020-0029

## Analysis of COVID-19 spread characteristics and infection numbers based on large-scale structured case data

• AcceptedMar 20, 2020
• PublishedMay 7, 2020
Share
Rating

### Abstract

In early 2020, the novel coronavirus, referred to as COVID-19 burst out. The Chinese people took the most comprehensive and rigorous control measures to fight against the COVID-19. Local health control departments reported infection data in a timely manner, which helped the public understand the development of the epidemic and take protective measures in advance. However, currently, no literature has analyzed the transmission characteristics of COVID-19 based on the structured data of large-scale patient cases and artificial intelligence. The detailed case data of patients in various regions are primarily recorded in text form, and the formats of report data in different provinces and cities differ, which makes it difficult to handle such data. To analysis around a large anonymous patient case data, we propose a method based on natural language processing technology to structure the case data. The proposed method can extract key information in the cases accurately and effectively with the help a pretrained model and a small number of labeled samples. By mining the patient's structured case data, we analyze the gender and age distribution, the main causes of infection, the characteristics of the incubation period, and epidemic trends in detail. Using big data on travel, a method was developed to estimate the number of infected individuals in Wuhan prior the restrictions were put into effect. This method helps people understand the real epidemic situation and take execute early protective measures. It is also helps government departments make evidence-based decisions, dispatch medical staff, and allocate medical resources as quickly as possible.

### References

[1] Li W. Research on key algorithms of mining texts of electronic medical cases. 2014. Google Scholar

[2] Lu S Q, Dou Z C, Wen J R. Research on structured data extraction in surgical cases. Chin J Comput, 2019, 42: 2754--2768. Google Scholar

[3] Imai N, Dorigatti I, Cori A, et al. Estimating the potential total number of novel Coronavirus cases in Wuhan City, China. Imperial College London, 2020. https://www.imperial.ac.uk/media/imperial-college/medicine/sph/ide/gida-fellowships/Imperial-College-COVID19-epidemic-size-17-01-2020.pdf. Google Scholar

[4] Chinazzi M, Davis J T, Corrado G, et al. Preliminary assessment of the international spreading risk associated with the 2019 novel coronavirus (2019-nCoV) outbreak in Wuhan city. 2020. https://www.apprise.org.au/wp-content/uploads/2020/01/Chinazzi-CIDID20_nCoVExportation.pdf. Google Scholar

[5] Wu F, Zhao S, Yu B. A new coronavirus associated with human respiratory disease in China.. Nature, 2020, 579: 265-269 CrossRef PubMed Google Scholar

[6] Zhou P, Yang X L, Xian G, et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature, 2020, 3: 1--4. Google Scholar

[7] Xu X, Chen P, Wang J. Evolution of the novel coronavirus from the ongoing Wuhan outbreak and modeling of its spike protein for risk of human transmission.. Sci China Life Sci, 2020, 63: 457-460 CrossRef PubMed Google Scholar

[8] Li Q, Guan X, Wu P. Early Transmission Dynamics in Wuhan, China, of Novel Coronavirus-Infected Pneumonia.. N Engl J Med, 2020, 382: 1199-1207 CrossRef PubMed Google Scholar

[9] Huang C, Wang Y, Li X, et al. Clinical features of patients with 2019 novel coronavirus in Wuhan, China. Lancet, 2020, 395: 497--506. Google Scholar

[10] Holshue M L, DeBolt C, Lindquist S, et al. First case of 2019 novel coronavirus in the United States. New England J Med, 2020, 382: 929--936. Google Scholar

[11] Chen N, Zhou M, Dong X, et al. Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study. Lancet, 2020, 395: 507--513. Google Scholar

[12] Guan W, Ni Z, Hu Y. Clinical Characteristics of Coronavirus Disease 2019 in China.. N Engl J Med, 2020, : NEJMoa2002032 CrossRef PubMed Google Scholar

[13] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017. 6000--6010. Google Scholar

[14] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 770--778. Google Scholar

[15] Devlin J, Chang M, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019. 4171--4186. Google Scholar

[16] Ma X Z, Eduard H. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016. 1064--1074. Google Scholar

[17] Tang X, Wu C, Li X. On the origin and continuing evolution of SARS-CoV-2. Natl Sci Rev, 2020, CrossRef Google Scholar

[18] Fang B X. Online Social Network Analysis. Beijing: Electronic Industry Press, 2014. Google Scholar

[19] Huang Z, Wang Z, Zhu Y, et al. Prediction of cascade structure and outbreaks recurrence in microblogs n. In: Proceedings of Chinese National Conference on Social Media Processing, 2017. 53--64. Google Scholar

[20] Xu X K, Hu H B, Zhang L, et al. Computational Communication on Social Networks. Beijing: Higher Education Press, 2015. Google Scholar

• Figure 1

(Color online) The deep learning model to extract patient information including: gender, age and date

• Figure 2

(Color online) An example of the structured Covid-19 case details

• Figure 3

(Color online) Causes of infection

• Figure 4

(Color online) (a) Proportion of infected “travellers to Wuhan"; (b) proportion of infected cases with “Close physical contact with an infector"

• Figure 5

(Color online) (a) Gender proportion of patients; (b) comparison of patient age distribution and China's natural popularity distribution

• Figure 6

(Color online) (a) Data distribution and fit distribution during the latent period; (b) data distribution and fit distribution of medical consultation delay

• Figure 7

(Color online) (a) Data distribution and fit distribution of confirm delay after visiting. (b) Confirm delay reduces with visit dates

• Figure 8

(Color online) (a) Data of distribution and fit distribution from infection to confirmation; (b) an overall data estimation based on early samples

• Figure 9

(Color online) An overall data estimation based on early samples

• Figure 10

(Color online) Infection roadmap for the COVID-19 case occurring at a shopping mall in Baodi District, Tianjin

• Figure 11

(Color online) Partial sketch of dynamic Wuhan traveling data (provided by Baidu Travel)

• Table 1   Samples of anonymous COVID-19 cases from different places
 Province/City Case examples Anhui xxx, male, 45 years old, Xiantao City, Hubei Province $\cdots$. When the patient returned to his hometown by car from Hubei on Jan. 10th, he first hung out for 3 hours on Wuhan Hanzheng Street, and then returned to xxx town of Mengcheng on the 11th. He began to cough, mainly dry cough, on the 24th, and transferred to the First People's Hospital of Mengcheng for treatment on the 26th. Guangxi xxx, female, 24 years old, xxx from Guilin, is the wife of patient xxx who was confirmed on Feb. 1st. On Jan. 27th, she and xxx returned to Guilin via Wuhan. On the 25th, she showed symptoms such as fever and sputum appeared. On Feb. 1st, she was hospitalized in the Third People's Hospital. On Feb. 4th, she was tested positive $\cdots$. Shenzhen 36 years old male patient, resident in Shenzhen Nanshan. He drove to Ezhou, Hubei on January 20th and returned to Shenzhen on the 25th. He began to show symptoms on February 1st and was hospitalized on February 3rd. He is now in a stable condition $\cdots$.
•

Algorithm 1 The estimated infected case number in Wuhan based on early confirmed case data

Require:City B structured cases, date of city A start to export $T_0$, closing date $T_{\rm~end}$, left popularity $N_{\rm~A}$ in city A, daily net outflow population $N_F$, proportion popularity $\alpha_T$ export to city $B$, the growth rate of infection cases $b$, Gaussian distribution $~\mu~$ and $~\delta~$ from infection to confirm, the predicted time $T_p$;

Output:Infection rate $P_{\rm~A}$, infection quantity $N_{\rm~ga}$ of City $A$;

Current date $T~\leftarrow~T_0$, max confirm delay $\triangle~T~\leftarrow~20$ day, current infection quantity $N_0$, current net outflow population $N_{F0}$;

while $~T~<=~T_{\rm~end}$ do

Outflow cases $C~\leftarrow~0$;

for $T_i=T$ to $T+\triangle~T$

According to $\mu$, $\delta$ to calculate the probability of $P_{T_i}$ at confirm dates $T_i$;

Estimate the cases export to the city B with confirmed time $T_i~$,$N_{T_i}=P_{T_i}\times~N_0~\times~\alpha_T~\times~N_{F0}~/~N_{\rm~A}~$;

Update confirmed cases in date $T_i$ according to $N_{T_i}$;

Update outflow cases $C~\mathrel{+}=~N_{T_i}$;

end for

Update infection cases in city A: $N_0~=~(1+b)~\times~N_0-C$;

Update population in city A: $N_{\rm~A}~\mathrel{-}=~N_{F0}$;

end while

Calculate the probability $P$ according to the distribution of the confirmed cases;

Calculate $N_~{\rm~gb}$ and $P_{\rm~B}$ based on $P$ and the structured confirmed cases of city B;

Calculate $P_{\rm~A}~\leftarrow~P_{\rm~B}$ and $N_{\rm~ga}=N_{\rm~A}\times~P_{\rm~A}$.

• Table 2   Cases floating from Wuhan to destination cities
 City Floating population rate (%) Floating cases Total cases Cases before Jan. 25th Cases before Jan. 28th Hefei 0.4 41 104 9 16 Fuyang 0.35 59 105 10 19 Qingdao 0.12 19 43 6 10 Jinan 0.15 17 39 2 4 Heze 0.1 10 13 1 6 Tianjing 0.15 22 79 8 13 Shenzhen 1.87 261 334 26 61 Nanning 0.19 17 32 1 6 Guilin 0.13 22 28 10 15 Beihai 0.09 28 31 7 13 Zhumadian 0.66 74 107 6 13 Taizhou 0.54 76 124 21 39 Changde 0.33 67 93 10 31 Huaihua 0.11 17 38 5 10 Total 5.19 712 1170 122 256
• Table 3   The estimated the infection rate and infection cases in Wuhan based on data in different cities
 City Infection rate (%) Infection cases Infection rate Infection cases ($\times~10^4$) (%, unstructured) ($\times~10^4$, unstructured) Hefei 0.21 2.23 0.52 5.77 Fuyang 0.34 3.63 0.60 6.61 Qingdao 0.32 3.41 0.72 7.85 Jinan 0.23 2.46 0.52 5.77 Heze 0.18 1.96 0.26 3.01 Tianjing 0.29 3.16 1.05 11.42 Shenzhen 0.28 3.01 0.36 4.04 Nanning 0.18 1.95 0.34 3.83 Guilin 0.34 3.64 0.43 4.82 Beihai 0.62 6.65 0.69 7.56 Zhumadian 0.22 2.43 0.32 3.69 Taizhou 0.28 3.04 0.46 5.12 Changde 0.41 4.36 0.56 6.23 Huaihua 0.31 3.33 0.69 7.58 Average 0.30 3.23 0.54 5.95 Median 0.29 3.16 0.52 5.77

Citations

Altmetric