markneustadt.com

Splitting a large text file into manageable pieces

Have you ever gotten a gigantic text file that you needed to split into smaller chunks?  Maybe you have a gigantic log file you can’t even open or some source data in a text file that would be better suited to being processed as a series of smaller files rather than one big file.  How can you split that file into a series of smaller files?

There are some Windows Powershell Scripts you can run.  I tried one and it worked just fine except for that it was really, really slow.  Plus the source of the file was wrong.  You can find the post I used for the source on StackOverflow.  After my modifications to correct the bugs in the script, I ended up with this source code.

$reader = new-object System.IO.StreamReader("MY_VERY_LARGE_FILE.txt")
$count = 1
$upperBound = 2500KB
$rootName = "MY_VERY_LARGE_FILE_"
$ext = "txt"
$fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext)
while(($line = $reader.ReadLine()) -ne $null)
{
 Add-Content -path $fileName -value $line
 if((Get-ChildItem -path $fileName).Length -ge $upperBound)
 {
 ++$count
 $fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext)
 }
}

$reader.Close()

What this does is split the file into 2500kb chunks.  You can configure the size of the chunks using the variable in the code.

Like I said though, this is crazy slow.

A better option is to use a command built into Unix.  If you have the GIT command line or GIT shell, you should have access to this command.  It’s the split command.  GENIUS!

By default, the split command will split files and give them a not-very-useful name.  Here’s the command I used to split a file into 100,000 lines a piece and then make sure that the resulting files have decent names.


split MY_VERY_LARGE_FILE.txt -l 100000 -d MY_VERY_LARGE_FILE_ --additional-suffix=".txt"


Let’s look at the pieces.

  • split – the name of the command
  • MY_VERY_LARGE_FILE.txt – the name of the huge file we want to split
  • -l 100000 – split into files that have 100,000 lines a piece
  • -d – Each output file should have a numeric identifier (as opposed to an alphabetic identifier)
  • MY_VERY_LARGE_FILE_ – The resulting files should all be called “MY_VERY_LARGE_FILE_”.  Notice the underscore?  When combined with the -d parameter, we’ll end up with files like MY_VERY_LARGE_FILE_01 and MY_VERY_LARGE_FILE_02
  • –additional-suffix=”.txt” – This will append the .txt exension to each output file.

Using that command, we can split MY_VERY_LARGE_FILE.txt into MY_VERY_LARGE_FILE_01.txt, MY_VERY_LARGE_FILE_02.txt, etc…  Each output file will have 100,000 lines it it.

 

The best part about the SPLIT command is that it takes literally seconds to complete.  The Powershell version to do this same thing would have taken literally HOURS.

 

Happy splitting!

Leave a Reply

Scroll To Top