LWN.net needs you!
![]() Advertisement
Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing |
In a thought-provoking—and characteristically amusing—talk at the Vault conference, Dave Chinner looked at the history of XFS, its current status, and where the filesystem may be heading. In keeping with the title of the talk (shared by this article), he sees parallels in what drove the original development of XFS and what will be driving new filesystems. Chinner's vision of the future for today's filesystems, and not just of XFS, may be a bit surprising or controversial—possibly both.
In the early 1990s, "before I knew who SGI was", Chinner said, storage was exceeding capacities that could be addressed with 32 bits. The existing filesystem for SGI's Unix, IRIX, was the Extent File System (EFS) that only supported 32-bit addresses. At that time, 64-bit CPUs with large-scale multiprocessing were coming on, but systems with hundreds of disks were not performing well.
XFS came about to fix those problems. The "x" originally just meant "undefined". This new undefined filesystem, xFS, had to support a number of features: fast crash recovery, large filesystems with both sparse and contiguous large files, and filesystems and directories that could hold huge numbers of files. The filesystem would need to support terabyte and petabyte capacities in the foreseeable future at that time, he said.
Chinner showed a graph of the lines of code changed in XFS over time that went all the way back to the beginning and up through commits from the previous week. Because the full commit history for XFS is available, graphs like that can be made, he said. Another way to use the history is to track bugs back to their introduction. The oldest bug anyone has found in recent times was 19 years old, he said.
The first production release of XFS was in December 1994 in conjunction with the release of IRIX 5.3. In mid-1996, XFS became the default filesystem in IRIX 6.2. The on-disk format had already gone through four versions. Even those early versions had the "feature mask" that allows XFS filesystems to turn on or off individual filesystem features. Those can be set for a particular filesystem at mkfs time; the mask will be consulted for various types of operations. It is a feature of XFS that makes it particularly flexible.
The "IRIX years" of the late 1990s saw various feature additions. Hardware RAID array support was added in 1997, which resulted in XFS gaining the ability to align its allocations to match the geometry of the underlying storage. By 1999, the original directory structure was showing its age, so version 2 directories were added, increasing directory scalability to tens of millions of files. Chinner has personally tried 350 million files in a directory, but he doesn't recommend it; running ls on such a directory takes twelve hours, he said with a laugh.
Shortly thereafter, feature development in XFS on IRIX slowed down. SGI had shifted its focus to large NUMA machines and distributed IRIX; development moved from XFS to the new CXFS cluster filesystem. But SGI also had a new product line that was based on Linux, which didn't have a filesystem with the features of XFS. So a team was formed in Melbourne, Australia to port XFS to Linux.
In 2000, XFS was released under the GPL. The first stable release of XFS for Linux came in 2001. And, in 2002, XFS was merged into the mainline for Linux 2.5.36.
One of the unfortunate side effects of XFS and other filesystems being merged into Linux was the proliferation of what Chinner called "Bonnie++ speed racing". People were using the filesystem benchmarking tool with little knowledge, which resulted in them "twiddling knobs" in the filesystem that they did not understand. The problem with that is that Google still finds those posts today, so those looking for better XFS performance are finding these old posts and following the instructions therein, which leads to the never-ending "noatime,nodiratime,logbufs=8 meme" for XFS mount options.
In the early 2000s, the Linux version of XFS started to diverge from the IRIX version with the group quotas feature. The version 2 log format was added in 2002, which helped make metadata performance much better. Also in 2002 came inode cluster delete and configurable sector sizes for XFS. In 2004, the 2.4 and 2.6 development trees were unified and the feature mask was expanded. XFS had run out of bits in the feature mask, he said, so the last bit was used to indicate that there is an additional feature mask to be consulted.
A "major achievement" was unlocked in 2004, when SUSE shipped full XFS support in SUSE Linux Enterprise Server (SLES) 9. It was a validation of all the work that SGI had done on the filesystem, he said. Someone from the audience spoke up to say that SLES 8 had shipped with XFS support in 2001, which seemed to surprise Chinner a bit.
The mid-2000s saw lots of misinformation spread about XFS (and other filesystems) during the "filesystem wars", Chinner said. His slides [PDF] contain several quotes from that time about "magical" features, including large capacitors in SGI power supplies that caused XFS not to lose data on power failure, zeroing of data after an unlink operation so that undelete was not possible, and the zeroing of all open files when there is an unclean shutdown. None of those were true, but they have become part of the XFS lore.
More features came about in 2005 and 2006. Extended attributes were added into the inode, for example, which resulted in an order-of-magnitude performance improvement for Samba atop XFS. That time frame was also the crossover point where Linux XFS started outperforming IRIX XFS. On Linux 2.6.16, XFS achieved a 10GB/second performance on a 24-processor Altix machine, he said.
A few years further on, the "O_PONIES wars" began. Chinner pointed to a Launchpad bug about an XFS file data corruption problem under certain conditions. The workaround was to use fdatasync() after renaming the file, but that would require changes in user space. The bug was closed as a "Won't fix", but none of the XFS developers were ever even consulted about it. It turns out that it actually was a bug in XFS that was fixed a year later.
"Sad days" arrived in 2009, when the SGI XFS engineering team was disbanded. The company had been losing money since 1999, he said. The community stepped in to maintain the XFS tree while SGI reorganized. After that, SGI intermittently maintained XFS until the end of 2013; he didn't mention that most of the work during that time was being done by developers (like himself) working at other companies. Chinner said he went away to the racetrack for a weekend and came back to find out that he had been nominated to take over as maintainer.
Since the shutdown of the XFS team, development on the filesystem has largely shifted to the wider community, he said. Development did not slow down as a result, however; if anything, it accelerated. But all of that work is not just about adding code, a lot of code has been removed along the way, as well.
He rhetorically asked if XFS is "still that big, bloated SGI thing"? He put up a graph showing the number of lines of code (LOC) in XFS for each kernel release. The size of the XFS codebase started dropping around 2.6.14, bottomed out in 3.6, rose again until 3.15 or so, and has leveled off since. The level (just under 70,000 LOC) is lower than where the graph started with 2.6.12 (roughly 75,000 LOC). The line for Btrfs crossed that of XFS around 3.5 and it is at 80,000+ and still climbing; it hasn't leveled off, Chinner said.
He listed the top developers for XFS, with Christoph Hellwig at the top, followed closely by Chinner himself. He called out early XFS developers Adam Sweeney and Doug Doucette, noting that they had done "tons of work" in a fairly small number of years. He quoted Isaac Newton ("If I have seen further than others, it is by standing upon the shoulders of giants.") and said that XFS came about "not because of me", but from the work of all of those others on the list.
Recent and ongoing developments in XFS include sparse inode chunk allocation to support GlusterFS and Ceph, unification of the quota API, and reverse mapping for the internal B-trees (more reasons for doing so cropped up at the recently completed LSFMM Summit, he said). In addition, reflink support for per-file snapshots has been added, defragmentation improvements have been made, and support for the direct access block layer (DAX) has been added. There is "lots going on at the moment", he said.
In among working on all of that, Chinner has been trying to plot out the next five years or so of XFS development. He did a similar exercise back in 2008 and posted some documents to XFS.org. Over the years since then, all of the features in those documents have been ticked off the list, though reverse B-trees, which are 95% complete, are "the last bit". So, now is the time to plan for the future.
There are known storage technologies that are pulling filesystem developers in opposite directions, he said. Shingled magnetic recording (SMR) devices and persistent memory are going to change things radically. The key for developers of existing filesystems is to figure out "good enough" solutions to allow those filesystems to work on SMR and persistent memory devices. Eventually, someone will come up with a filesystem that specifically targets SMR devices or persistent memory that "blows everything else away".
Over the next five or more years, XFS needs to have better integration with the block devices it sits on top of. Information needs to pass back and forth between XFS and the block device, he said. That will allow better support of thin provisioning. It will also help with offloading block clone, copy, and compress operations. There are more uses for that integration as well, including better snapshot awareness and control at the filesystem layer.
Improving reliability is another area where XFS will need more work. Reconnecting orphaned filesystem objects will become possible with the B-tree reverse mapping abilities. That will also allow repairing the filesystem while it is online, rather than having to unmount it to do repairs. Proactive corruption detection is also on the list. When corruption is found, the affected part of the filesystem could be isolated for repair. These features would make XFS "effectively self-healing", Chinner said.
There is a "fairly large amount of work" on the list, he said. While he is targeting five years, he can't do it alone in that time period. If it is just him, it will probably take six or seven years, he said with a grin.
But we also need to be thinking a little further ahead. Looking at the progression of capacities and access times for "spinning rust" shows 8GB, 7ms drives in the mid-1990s and 8TB, 15ms drives in the mid-2010s. That suggests that the mid-2030s will have 8PB (petabyte, 1000 terabytes) drives with 30ms access times.
The progression in solid-state drives (SSDs) shows slow, unreliable, and "damn expensive" 30GB drives in 2005. Those drives were roughly $10/GB, but today's rack-mounted (3U) 512TB SSDs are less than $1/GB and can achieve 7GB/second performance. That suggests to him that by 2025 we will have 3U SSDs with 8EB (exabyte, 1000 petabytes) capacity at $0.1/GB.
SSDs have lots of compelling features (e.g. density, power consumption, performance) and are cost-competitive. But persistent memory will be even denser and faster, while still being competitive in terms of cost. It is at least five years before we see pervasive persistent memory deployment, however. Chinner did caution that memristors are a wild card that could be "a game changer" for storage.
Those projections indicate that focusing on spinning rust would be largely misplaced. XFS (and other filesystems) have to follow where the hardware is taking them. SSD storage will push capacity, scalability, and performance much faster than SMR technologies will. Though "most of the cat pictures" on the internet will likely be stored on SMR drives before too long, he said with a chuckle.
8EB is about as large as XFS can go. In fact, the 64-bit address space will be exhausted for filesystems and CPUs in the next 10-15 years. Many people thought that the 128-bit addresses used by ZFS were "crazy", but it doesn't necessarily look that way now.
By 2025-2030, XFS will be running up against capacity and addressability limits. It will also be hitting architectural performance limits in addition to complexity limits from continuing to support spinning rust disks. The architecture of XFS will be unsuited to the emerging storage technologies at that time.
All of those projections place a hard limit on the development life of XFS. It doesn't make sense to do a large-scale rework of the filesystem to support SMR devices, for example. SMR will be supported, but the project is "not going to turn everything upside down for it". There is a limited amount of life left in the filesystem at this point and SMR technologies will be outrun by other storage types.
From Btrfs, GlusterFS, Ceph, and others, we know that it takes 5-10 years for a new filesystem to mature. That implies we need to start developing any new filesystems soon. For XFS, there is the definite possibility that this 5-7 year development plan that he is working on will be the last. In 20 years, XFS will be a legacy filesystem. That's likely true for any current filesystem; he may not be right, he said, but if he is, we may well be seeing the last major development cycle for all of the existing filesystems in Linux.
[I would like to thank the Linux Foundation for travel support to Boston for Vault.]
XFS: There and back ... and there again?
Posted Apr 1, 2015 18:54 UTC (Wed) by jdub (subscriber, #27) [Link]
XFS: There and back ... and there again?
Posted Apr 2, 2015 9:10 UTC (Thu) by lkundrak (subscriber, #43452) [Link]
How did you notice?
XFS: There and back ... and there again?
Posted Apr 2, 2015 9:17 UTC (Thu) by jdub (subscriber, #27) [Link]
XFS: There and back ... and there again?
Posted Apr 1, 2015 20:30 UTC (Wed) by rodgerd (guest, #58896) [Link]
Perhaps the (non-XFS) kernel devs should have spent rather less time with their O_PONIES mockery and a little more time actually studying the problems...
XFS: There and back ... and there again?
Posted Apr 1, 2015 20:59 UTC (Wed) by zanak (guest, #101571) [Link]
XFS: There and back ... and there again?
Posted Apr 2, 2015 5:52 UTC (Thu) by reubenhwk (guest, #75803) [Link]
XFS: There and back ... and there again?
Posted Apr 2, 2015 16:50 UTC (Thu) by deater (subscriber, #11746) [Link]
I wouldn't go that far. I definitely have some XFS images lying around that were created on old IRIX systems that mounted just fine under Linux 2.4 but will not mount with newer kernels due to support for older features being dropped. So some sort of versioning difference has happened even if it's not reflected in the filesystem name.
XFS: There and back ... and there again?
Posted Apr 2, 2015 21:24 UTC (Thu) by dgc (subscriber, #6611) [Link]
-Dave.
XFS: There and back ... and there again?
Posted Apr 2, 2015 7:16 UTC (Thu) by epa (subscriber, #39769) [Link]
XFS: There and back ... and there again?
Posted Apr 2, 2015 9:09 UTC (Thu) by eru (subscriber, #2753) [Link]
One could say that in the Gartner Group's "hype cycle", XFS is now at the "plateau of productivity". One evidence for this is that RHEL7 at last suggests it as the default filesystem (or at least CentOS7 did, when I installed it recently, and I guess it is following RHEL there).
XFS: There and back ... and there again?
Posted Apr 1, 2015 21:16 UTC (Wed) by ejr (subscriber, #51652) [Link]
Memristors schmemristors
Posted Apr 2, 2015 3:33 UTC (Thu) by ncm (subscriber, #165) [Link]
But, hey, what's up with spintronics lately?
Memristors schmemristors
Posted Apr 3, 2015 0:06 UTC (Fri) by rahvin (subscriber, #16953) [Link]
The challenge has always been (just like most things), if the now proven item can actually be constructed at a reasonable cost and are conducive towards mass production. In other words the actual engineering. Even if a researcher can build a single memsistor in a lab for a few million bucks it doesn't mean anyone is going to be able to build them in massive quantity at sizes that will make their use practical.
It's entirely possible memsistors could be impossible to construct in volume or smaller than a fridge which would make them irrelevant. HP seems to think they are not only constructable but economical but only time will tell. If they are successful though memsistors could change the entire computer industry.
XFS: There and back ... and there again?
Posted Apr 2, 2015 13:21 UTC (Thu) by walters (subscriber, #7396) [Link]
reflink support for per-file snapshots has been addedReally? I only see reflink bits in mainline for BTRFS and OCFS2.
XFS: There and back ... and there again?
Posted Apr 2, 2015 21:17 UTC (Thu) by dgc (subscriber, #6611) [Link]
-Dave.
XFS: Why
Posted Apr 2, 2015 14:39 UTC (Thu) by Felix.Braun (subscriber, #3032) [Link]
So when would you choose XFS over ext4?
XFS: Why
Posted Apr 2, 2015 16:07 UTC (Thu) by bfields (subscriber, #19510) [Link]
A lot of the really important differences are hard to summarize on a spec sheet: how they perform on a wide variety of workloads, how reliable they are, how easy it is for developers to support them. So, not being an expert on either filesystem, but working with people who are--I almost always rely on their judgement, which means going with whatever they chose for the distro's default. Boring answer, I apologize....
XFS: Why
Posted Apr 2, 2015 17:22 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]
XFS is going to start showing it's advantages when you get to more advanced/larger disk systems (large RAID arrays)
ext* is a filesystem designed for small systems that's growing towards supporting large systems. XFS is a filesystem designed for large systems that's been getting slimmed down and optimized so that it also works well on small systems.
One example of this is that there have been bugs reported on ext* that the developers could not duplicate because none of them have access to similar hardware (where similar hardware is a fairly modest SCSI RAID array)
I've been using XFS for quite a few years on systems (since well before ext4) and so for me the question is the opposite, what's so compelling about ext4 to cause me to move away from xfs? :-)
I made the shift during the ext3 days when ext3 was crippling for anything that did fsyncs for data reliability. That particular bug has been fixed in ext4, but that's not enough reason to switch.
XFS: Why
Posted Apr 3, 2015 0:23 UTC (Fri) by rahvin (subscriber, #16953) [Link]
XFS: Why
Posted Apr 3, 2015 0:34 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]
If you need cross OS compatibility, you need some fat variation.
XFS: Why
Posted Apr 5, 2015 2:05 UTC (Sun) by marcH (subscriber, #57642) [Link]
or UDF.
XFS: Why
Posted Apr 5, 2015 21:19 UTC (Sun) by flussence (subscriber, #85566) [Link]
The hardest part for me is remembering which of four different mkudffs switches set the string that shows up in /dev/disk/by-label/...
XFS: Why
Posted Apr 6, 2015 16:57 UTC (Mon) by jem (subscriber, #24231) [Link]
UDF is nice, but has its own problems. Depending on the type of removable media, either the whole disk should be used for the file system, or the disk should be partitioned. This blog post from 2010 explains a hack making the disk appear both partitioned and unpartitioned at the same time: http://sipa.ulyssis.org/2010/02/filesystems-for-portable-...
But there are more dark clouds on the horizon; see this page: http://askubuntu.com/questions/27936/can-and-should-udf-b...
XFS: Why
Posted Apr 6, 2015 17:07 UTC (Mon) by marcH (subscriber, #57642) [Link]
I think the main reasons these issues haven't been solved yet:
- lack of interest in true cross-platform compatibility
- lack of interest in removable media. Everything more conveniently stored in the NSA-monitored cloud.
So we'll probably never have a very good solution here.
XFS: Why
Posted Apr 7, 2015 7:29 UTC (Tue) by jem (subscriber, #24231) [Link]
Microsoft has even managed to make the exFAT file system "mandatory" on SDXC cards. There is no technical reason an SDXC card can't be formatted using FAT32. In fact, a FAT32 volume can easily span 2 TiB, which is the upper limit of SDXC cards. Of course, although perfectly valid, there's no guarantee devices support e.g. a 64 GB FAT formatted SD card, because they are not "supposed" to.
XFS: Why
Posted Apr 7, 2015 7:38 UTC (Tue) by marcH (subscriber, #57642) [Link]
UDF has a major difference: it's not going to be possible to make all these DVDs and other disks suddenly unreadable or patented.
XFS: Why
Posted Apr 7, 2015 12:39 UTC (Tue) by jem (subscriber, #24231) [Link]
If there are problems with Windows interoperability, I doubt asking for help from Microsoft is going to help. I bet the answer you'll get is: "use exFAT".
Don't get me wrong, I would love to use UDF as a universal file system that is interoperable between different operating systems. Unfortunately, to me this doesn't seem realistic at the moment. I'd be happy if someone could prove me wrong.
XFS: Why
Posted Apr 7, 2015 18:08 UTC (Tue) by marcH (subscriber, #57642) [Link]
SDXC right now looks a little bit like a repeat of *data* Minidisc in the 90s. Minidiscs were technically good, cheap and had a really good window of opportunity for data before flash memory became cheap and ubiquitous. For data they never got off the ground because of artificial crippling and became completely irrelevant once flash was there.
With the rise of The Cloud, NAS appliances and all other networked things SDXC will very slowly die like every other removable media, leaving the minority of users who really need removable media forever in this painful cross-compatibility situation.
> Unfortunately, to me this doesn't seem realistic at the moment. I'd be happy if someone could prove me wrong.
Different analysis and reasons - same pessimistic feelings and conclusion :-)
XFS: Why
Posted Apr 10, 2015 8:13 UTC (Fri) by Wol (guest, #4433) [Link]
And then you take a device (like a camera) somewhere where there is no network capability and you're stuffed. This blind faith in the ever-present external support network gets boring. That's how you get disasters - a failure somewhere else knocks your system out and you end up running around like a headless chicken because you have NO BACKUP.
fwiw, my current camera currently has 48 *G*B of SD-card capacity, and that is enough to store 600 photos - yes, six *hundred*. wtf am I supposed to do if I go on the holiday of a lifetime, with minimal access to civilisation (yes I'm that sort of a person), and go mad with the camera?
Cheers,
Wol
XFS: Why
Posted Apr 10, 2015 12:15 UTC (Fri) by pizza (subscriber, #46) [Link]
Yeah, 48MB per (losslessly compressed 14-bit RAW) image really adds up, doesn't it?
XFS: Why
Posted Apr 12, 2015 13:47 UTC (Sun) by Wol (guest, #4433) [Link]
Cheers,
Wol
XFS: Why
Posted Apr 10, 2015 12:26 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]
it takes a long time to download this data from the SD card when directly read from a computer, and a long time to send this much data over a wired Gig-E network. I sure wouldn't want to depend on having a cell or wifi network available that could handle me (and others around me) shooting at these sorts of volumes.
XFS: Why
Posted Apr 10, 2015 22:38 UTC (Fri) by marcH (subscriber, #57642) [Link]
From that respect you should be able to ignore the removable and cross-compatibility aspects of the SDXC card in your camera?
Windows users better do this anyway for other reasons: https://en.wikipedia.org/wiki/Windows_Photo_Viewer#Issues
XFS: Why
Posted Apr 10, 2015 5:24 UTC (Fri) by yuhong (guest, #57183) [Link]
XFS: Why
Posted Apr 4, 2015 10:53 UTC (Sat) by deepfire (guest, #26138) [Link]
XFS: Why
Posted Apr 4, 2015 12:32 UTC (Sat) by cesarb (subscriber, #6266) [Link]
XFS: Why
Posted Apr 10, 2015 17:35 UTC (Fri) by nerdshark (guest, #101876) [Link]
https://www.paragon-software.com/home/extfs-windows/
http://www.ext2fsd.com/
XFS: Why
Posted Apr 3, 2015 10:47 UTC (Fri) by sdalley (subscriber, #18550) [Link]
Creating large numbers of tiny files on ext4 ran me into an ENOSPC when the disk blocks were still less than half used, simply because the (allocated-at-mkfs-time) inodes had run out.
XFS handles all that automatically; you don't have to guess ahead of time that you're going to need vast numbers of inodes.
XFS: There and back ... and there again?
Posted Apr 7, 2015 16:17 UTC (Tue) by Paf (subscriber, #91811) [Link]
Thank you, Jake, for a fantastic article (and Dave for a great talk).
XFS reflink status?
Posted Apr 18, 2015 5:31 UTC (Sat) by gmatht (subscriber, #58961) [Link]
XFS reflink status?
Posted May 4, 2015 22:52 UTC (Mon) by dgc (subscriber, #6611) [Link]
-dave.
XFS: There and back ... and there again?
Posted Apr 21, 2015 13:16 UTC (Tue) by nye (guest, #51576) [Link]
Does anyone know what the real reason was for files that were not being written to at the time to be zero'd on unclean shutdown, and/or what circumstances would trigger it?
I guess it's water under the bridge now, but it would be nice to know nevertheless.
XFS: There and back ... and there again?
Posted May 4, 2015 22:55 UTC (Mon) by dgc (subscriber, #6611) [Link]
XFS: There and back ... and there again?
Posted May 5, 2015 12:53 UTC (Tue) by nye (guest, #51576) [Link]
Basically, once upon a time I found a fair number of files wiped out after a power failure, and the only reason I noticed straight away was because one of them was /etc/passwd. It's *possible* that I'd [un]installed something earlier in the day that might have had a need to add/remove/alter a user account, but there was definitely nothing like that going on at the time.
What I learned when attempting to understand this at the time was that XFS was specifically intended to work with battery-backed RAID arrays, and not as a general-purpose filesystem, so I chalked it up to a bad choice for a desktop machine and moved on. Nevertheless, this is one of the two biggest data loss events I've ever experienced (the other being a botched reiserfs resize) so it's stuck with me even though it was long ago now.
XFS: There and back ... and there again?
Posted May 5, 2015 20:41 UTC (Tue) by dlang (✭ supporter ✭, #313) [Link]
The bottom line is that if you want to be sure that data is on disk, you need to do a fsync.
Yes, if the system is idle, it will try to proactively push data out so that it can more easily throw the pages away if needed, but if you rely on that for critical stuff, you are going to run into problems someday. It doesn't matter what filesystem you use.
XFS: There and back ... and there again?
Posted May 6, 2015 11:03 UTC (Wed) by nye (guest, #51576) [Link]
Copyright © 2015, Eklektix, Inc.
This article may be redistributed under the terms of the
Creative
Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds