Converting Many Small Files To A Sequence File In HDFS

Having a lot of small files in HDFS is not efficient for processing and also not good for NameNode metadata. This FAQ provides some common questions and answers on how to convert many small files to a larger sequence file and how to access it.

What use cases sequence file option would be better than HAR file option and What are some of the reasons for not using sequence file options?

You can use HARs if you will still need to access an included filename in the merged file independently after the archive is created (non-programmatically, i.e. via the shell or so). Otherwise, the use of SequenceFiles is recommended, as they are far more performant than HARs.

What would be the best way to convert the small XML files into sequence files? Is there a utility or a command that would do the conversion?

You can use the FileCrusher, a third-party tool for this (https://github.com/edwardcapriolo/filecrush), we offer no inbuilt tools in CDH.

Can we use all tools (like hive, pig, Java, Impala, etc) on the converted sequence files?

Yes, they can still be used, but any existing scripts will need to be changed a bit to fit the new format, i.e. the script will now have to load the XML as a value from the file (with key as filename) instead of what it does normally (i.e. loads XML independently from each small file). Some transformations would thereby be required.

If I had an external hive table pointing to the xml folder, will the same table work by changing the location and store type to the sequence file?

No, since the key-value structure of the file has now changed to be two strings, a key that holds the filename and a value that has the whole XML as a string. Within Hive, you would now read such a file via a different table that has only a column of string (that reads the values as keys are ignored from sequence files in Hive). The XML string (value) must then be parsed via XML UDFs and used directly or via insert to another table.

Will the key (filenames) have the original small individual xml file names? or larger merged file names?

The keys in the crushed file will include the original filename.

How do I retrieve the file names associated with a sequence file? Is there a command-line utility or do I have to write a MR program?

“hadoop fs -text” utility lets you view sequence files in text form which can be utilized to view the keys.