logo

SCIENCE CHINA Information Sciences, Volume 62 , Issue 7 : 072101(2019) https://doi.org/10.1007/s11432-017-9295-4

LCCFS: a lightweight distributed file systemfor cloud computing without journaling and metadata services

More info
  • ReceivedMay 15, 2017
  • AcceptedNov 23, 2017
  • PublishedApr 4, 2019

Abstract

The major usage of a file system integrated with a cloud computing platform is toprovide the storage for VM (virtual machine) instances. Distributed file systems,especially those implemented on top of object storage have many potential advantagesover traditional local file systems for VM instance storage.In this paper, we make an investigation in the requirementsimposed on a file system in cloud computing scenario, and claim that the implementation ofa file system for VM instance storage could be reasonably simplified. We demonstrate thaton top of an object storage with simpleobject-granularity transaction support, a lightweightdistributed file system, which requires neither journaling nor dedicatedmetadata services, can be developed for cloud computing.We have implemented such a distributed file system, called LCCFS (lightweightcloud computing file system), based on the RADOS (reliable autonomic distributed object storage) object storage.Our experimental results show that for the main workloads in cloud computing,LCCFS achieves almost the same or slightly higher performance than CephFS (ceph filesystem), another published distributed file system basedon RADOS. Compared to CephFS,LCCFS has only one tenth of its LOCs (lines of code). This theoretical simplicity makes it easy to implement LCCFScorrectly and stably by avoiding the sheer design and implementation complexitybehind CephFS, thereby making LCCFS a promising candidate in the cloud computing production environment.


Acknowledgment

This work was supported by National Natural Science Foundation of China (Grant No. 61370018).


References

[1] Mesnier M, Ganger G R, Riedel E. Object-based storage. IEEE Commun Mag, 2003, 41: 84--90. Google Scholar

[2] Schwan P. Lustre: building a file system for 1000-node clusters. In: Proceedings of the Linux Symposium, Ottawa, 2003. Google Scholar

[3] Welch B, Gibson G. Managing scalability in object storage systems for HPC linux clusters. In: Proceedings of the 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies, Greenbelt, 2004. 433--445. Google Scholar

[4] Weil S A, Leung A W, Brandt S A, et al. Rados: a scalable, reliable storage service for petabyte-scale storage clusters. In: Proceedings of the 2nd International Workshop on Petascale Data Storage: Held in Conjunction with Supercomputing, Reno, 2007. 35--44. Google Scholar

[5] Weil S A, Brandt S A, Miller E L, et al. Ceph: a scalable, high-performance distributed file system. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation, Seattle, 2006. 307--320. Google Scholar

[6] Weil S A, Pollack K T, Brandt S A, et al. Dynamic metadata management for petabyte-scale file systems. In: Proceedings of the 2004 ACM/IEEE Conference on Supercomputing, Pittsburgh, 2004. 6--12. Google Scholar

[7] Sevilla M. Mds has inconsistent performance. 2015. http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/22674. Google Scholar

[8] Nagle D, Factor M E, Iren S. The ANSI T10 object-based storage standard and current implementations. IBM J Res Dev, 2008, 52: 401-411 CrossRef Google Scholar

[9] Rodeh O, Teperman A. zFS — a scalable distributed file system using object disks. In: Proceedings of the 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, San Diego, 2003. 207--218. Google Scholar

[10] Abd-El-Malek M, Courtright II W V, Cranor C, et al. Ursa minor: versatile cluster-based storage. In: Proceedings of the 4th USENIX Conference on File and Storage Technologies, San Francisco, 2005. 59--72. Google Scholar

[11] Kubiatowicz J, Bindel D, Chen Y, et al. Oceanstore: an architecture for global-scale persistent storage. In: Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, New York, 2000. 190--201. Google Scholar

[12] Adya A, Bolosky W J, Castro M, et al. Farsite: federated, available, and reliable storage for an incompletely trusted environment. In: Proceedings of the 5th Symposium on Operating Systems Design and implementation, New York, 2002. 1--14. Google Scholar

[13] Haeberlen A, Mislove A, Druschel P. Glacier: highly durable, decentralized storage despite massive correlated failures. In: Proceedings of the 2nd Conference on Symposium on Networked Systems Design & Implementation, Berkeley, 2005. 143--158. Google Scholar

[14] Beaver D, Kumar S, Li H C, et al. Finding a needle in haystack: facebook's photo storage. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, Vancouver, 2010. 1--8. Google Scholar

[15] Card R, Ts T, Tweedie S. Design and implementation of the second extended filesystem. In: Proceedings of the 1st Dutch International Symposium on Linux, 1994. Google Scholar

[16] Mathur A, Cao M, Bhattacharya S, et al. The new ext4 lesystem: current status and future plans. In: Proceedings of the Linux Symposium, Ottawa, 2007. 21--33. Google Scholar

[17] Ts'o T Y, Tweedie S. Planned extensions to the Linux EXT2/EXT3 filesystem. In: Proceedings of the FREENIX Track: 2002 USENIX Annual Technical Conference, Berkeley, 2002. 235--243. Google Scholar

[18] Tweedie S. Ext3, journaling filesystem. In: Proceedings of the Linux Symposium, Ottawa, 2000. Google Scholar

[19] Konishi R, Amagai Y, Sato K. The Linux implementation of a log-structured file system. SIGOPS Oper Syst Rev, 2006, 40: 102-107 CrossRef Google Scholar

[20] Rosenblum M, Ousterhout J K. The design and implementation of a log-structured file system. ACM Trans Comput Syst, 1992, 10: 26-52 CrossRef Google Scholar

[21] Hitz D, Lau J, Malcolm M. File system design for an nfs file server appliance. In: Proceedings of the USENIX Winter 1994 Technical Conference on Technical Conference, San Francisco, 1994. 19. Google Scholar

[22] Rodeh O, Bacik J, Mason C. Btrfs: the linux b-tree filesystem. ACM Trans Storage, 2013, 9: 1--32. Google Scholar

[23] Shen K, Park S, Zhu M. Journaling of journal is (almost) free. In: Proceedings of the 12th USENIX Conference on File and Storage Technologies, Santa Clara, 2014. 287--293. Google Scholar

[24] Fryer D, Qin D, Sun J, et al. Checking the integrity of transactional mechanisms. In: Proceedings of the 12th USENIX Conference on File and Storage Technologies, Berkeley, 2014. 295--308. Google Scholar

[25] Aghayev A, Theodore Y, Gibson G, et al. Evolving ext4 for shingled disks. In: Proceedings of the 15th Usenix Conference on File and Storage Technologies, Santa clara, 2017. 105--119. Google Scholar

[26] Wang L, Liao X K, Xue J L, et al. Enhancement of cooperation between file systems and applications-VFS extensions for optimized performance. Sci China Inf Sci, 2015, 58: 092104. Google Scholar

[27] Zhang S, Catanese H, Wang A I A. The composite-file file system: decoupling the one-to-one mapping of files and metadata for better performance. In: Proceedings of the 14th Usenix Conference on File and Storage Technologies, Santa Clara, 2016. 15--22. Google Scholar

[28] Xu Q, Arumugam R V, Yong K L. Efficient and Scalable Metadata Management in EB-Scale File Systems. IEEE Trans Parallel Distrib Syst, 2014, 25: 2840-2850 CrossRef Google Scholar

[29] Thomson A, Abadi D J. Calvinfs: consistent wan replication and scalable metadata management for distributed file systems. In: Proceedings of the 13th USENIX Conference on File and Storage Technologies, Santa Clara, 2015. 1--14. Google Scholar

[30] Niazi S, Ismail M, Haridi S, et al. Hopsfs: scaling hierarchical file system metadata using newsql databases. In: Proceedings of the 15th USENIX Conference on File and Storage Technologies, Santa Clara, 2017. 89--104. Google Scholar

  • Figure 1

    The integration of LCCFS with OpenStack.

  • Figure 2

    A namespace example.

  • Figure 3

    The usage of LCCFS.

  • Figure 4

    Performance speedups of LCCFS over CephFS under three different record sizes for (a) sequential read, protectłinebreak (b) random read, (c) sequential write, and (d) random write.

  • Figure 5

    Performance speedups of LCCFS over CephFS for read and write measured by dd.

  • Figure 6

    The time of directory creation and deletion for LCCFS and CephFS.

  • Figure 7

    Normalized IOPS under random (a) read and (b) write with a record size of 4 KB. Normalized bandwidth under sequential (c) read and (d) write with a record size of 4 MB. Normalized bandwidth under random (e) read and (f) write with a record size of 4 MB.

  •   

    Algorithm 1 Create a file

    procedure create

    algorithmicrequire inodeno_t pino, const char * name

    Generate a new inode number ino;

    Insert a reclaim entry (pino, ino);

    bt

    Create an object Inode.(ino);

    Inode.(ino)$\rightarrow$pino = pino;

    et

    bt

    exist = entry_exist(Inode.(pino), name);

    if exist = FALSE then

    Insert an entry name in the object Inode.(pino);

    end if

    et

    if exist = TRUE then

    bt

    Delete the object Inode.(ino);

    et

    end if

    end procedure

  •   

    Algorithm 2 Remove a file

    procedure unlink

    algorithmicrequire inodeno_t pino, inodeno_t ino

    Insert a reclaim item (pino, ino);

    bt

    if entry_exist(Inode.(pino), ino) = TRUE then

    Remove the entry ino from the object Inode.(pino);

    end if

    et

    end procedure

  •   

    Algorithm 3 Rename a file

    procedure rename

    algorithmicrequire inodeno_t pino, const char * oldname, const char *newname

    bt

    if entry_exist(Inode.(pino), oldname) = TRUE then

    if entry_exist(Inode.(pino), newname) = FALSE then

    Remove the entry oldname from the object Inode.(pino);

    Insert an entry newname to the object Inode.(pino);

    end if

    end if

    et

    end procedure

  •   

    Algorithm 4 Add reclaim item

    procedure add_reclaim_item

    algorithmicrequire (pino, ino)

    again:

    wp = Reclaim.counter$\rightarrow$wp;

    if object_exist(Reclaim.(wp)) = FALSE then

    bt

    if create_object(Reclaim.(wp)) = SUCCESS then

    Reclaim.(wp)$\rightarrow$state = WRITE;

    Reclaim.(wp)$\rightarrow$timestamp = current_time;

    end if

    et

    end if

    bt

    if Reclaim.(wp)$\rightarrow$state = WRITE Reclaim.(wp) is not full then

    Insert (pino, ino) into Reclaim.(wp);

    Reclaim.(wp)$\rightarrow$timestamp = current_time;

    Return;

    end if

    et

    wp+;

    bt

    if Reclaim.counter$\rightarrow$wp $<$ wp then

    Reclaim.counter$\rightarrow$wp+;

    end if

    et

    goto again.

    end procedure

  •   

    Algorithm 5 Consume reclaim item

    procedure consume_reclaim_item

    again:

    rp = Reclaim.counter$\rightarrow$rp;

    bt

    result = (Reclaim.(rp)$\rightarrow$state = READ);

    if result = FALSE then

    if (current_time$-$Reclaim.(rp)$\rightarrow$timestamp) $>$ REC_THRESHOLD then

    Reclaim.(rp)$\rightarrow$state = READ;

    result = TRUE;

    end if

    end if

    et

    if result = FALSE then

    Return;

    end if

    for each item $m$ in Reclaim.(rp)

    process_reclaim_item($m$);

    Remove the item $m$ from the object Reclaim.(rp);

    end for

    Delete the object Reclaim.(rp);

    Reclaim.counter$\rightarrow$rp+;

    goto again;

    end procedure

    procedure process_reclaim_entry

    algorithmicrequire (pino, ino)

    if object_exist(Inode.(pino)) = TRUE object_exist(Inode.(ino)) = TRUE entry_exist(Inode.(pino), ino) = TRUE then

    Return;

    end if

    if object_exist(Inode.(ino)) = FALSE then

    Return;

    end if

    if (file_type(Inode.(ino)) = FILE) then

    count = Inode.(ino)$\rightarrow$ size/OBJECT_SIZE;

    for m = 0 to count

    Delete the object Data.(ino).($m$);

    end for

    end if

    if (file_type(Inode.(ino)) = DIRECTORY) then

    for each entry cino in Inode.(ino)

    Insert a reclaim entry (ino, cino);

    Remove the entry cino from the object Inode.(ino);

    end for

    end if

    Delete the object Inode.(ino);

    end procedure

  •   

    Algorithm 6 Offline file system check

    procedure fsck

    for each object $o$ in the object storage

    if $o$ is a data object named Data.(ino).(index) then

    if object_exist(Inode.(ino)) = FALSE $||$ ${\rm~index*OBJECT\_SIZE>Inode.(ino)}$$\rightarrow$size then

    Delete the object $o$; //orphan object.

    end if

    end if

    if $o$ is an inode object named Inode.(ino) then

    pino = $o$$\rightarrow$pino; //the inode number of parent inode.

    if object_exist(Inode.(pino)) = FALSE $||$ entry_exist(Inode.(pino), ino) = FALSE then

    Delete the object $o$; //orphan object.

    continue

    end if

    if file_type(Inode.(ino)) = DIRECTORY then

    for each entry $m$ in Inode.(ino)

    if object_exist(Inode.($m$)) = FALSE then

    Remove the entry $m$ from the object Inode.(ino); //orphan entry.

    end if

    end for

    end if

    end if

    end for

    end procedure

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备17057255号       京公网安备11010102003388号