WordPress Import files can often be ungainly and hard to work with due to the various limitations that are necessarily attached to the WordPress import tool, like the PHP max_upload_size, max_post_size or max_memory_limit variables, or a limit built into the web server itself.
Sometimes it’s easier to work with smaller files, whether it be for testing or importing small batches. I recently encountered that need, and my previous solution involved copying and pasting chunks of the XML WXR file from once place to another. This became impractical when I faced a nearly 200MB file that would cause my text editor to choke.
To address this, I developed the following set of shell scripts to work with these WXR files and break it into pages of posts in separate files. It requires an XSLT 2.0 processor. XSLT 2.0 is required because of the use of the xsl:result-document element. I used Saxonica’s Saxon Java class wrapper which provides a handy command-line interface to the Saxon libraries.
The xsl stylesheet
The goal of this stylesheet is to break apart the WXR file’s item elements (each of which represents a single WordPress post) into multiple files, while still preserving the WordPress meta data in each output file.
[sourcecode language="xml" wraplines="false"]
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:param name="size" />
<xsl:param name="page" />
<xsl:param name="output" />
<xsl:template match="/rss">
<xsl:result-document method="xml" href="{$output}_{$page * $size}-{($page + 1) * $size – 1}.xml">
<rss version="2.0" xmlns:excerpt="http://wordpress.org/export/1.1/excerpt/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:wp="http://wordpress.org/export/1.1/">
<channel>
<xsl:for-each select="channel/*[local-name() != 'item']">
<xsl:copy-of select="." />
</xsl:for-each>
<xsl:for-each select="channel/item">
<xsl:if test="position() < ($page + 1) * $size and position() >= $page * $size">
<xsl:copy-of select="." />
</xsl:if>
</xsl:for-each>
</channel>
</rss>
</xsl:result-document>
</xsl:template>
</xsl:stylesheet>
[/sourcecode]
This stylesheet takes three parameters: page, the page number; size, the page size in number of “item” elements; and output, the prefix filename for the resuling output files. It will emit a file with the name $output_$start-$end.xml. You may note that this stylesheet can only handle one page of posts at a time due to the lack of for or while loops in the XSLT language (at least without language-paradigm-breaking hackery.) This also enables the output to be controlled fully from the calling program, which for this purpose will just be the shell.
Using the XSLT Stylesheet
The basic functionality of this stylesheet allows me to create a new WXR import file with a range of posts contained in the original. In this example, I’m copying the first 2,000 posts from the import file. After it completes, the posts will be saved into file_0-1999.xml.
[sourcecode language="bash" wraplines="false"]
$ java -Xmx512m -jar ~/saxonhe9-2-0-5j/saxon9he.jar -xsl:split.xsl articles.xml page=0 size=2000
[/sourcecode]
I keep my Saxon JAR file in ~/saxonhe9-2-0-5j/saxon9he.jar, but you’ll likely have it somewhere else.
The -Xmx512m parameter tells the Java VM to set the maximum stack size to 512 MB. You may need to adjust this parameter according to the size of your input file.
Doin’ it all!
Now that we have the basic tool for pulling a single page out of our source XML file, we can use a little bit of shell scripting to get all of the posts into separate files.
[sourcecode language="bash" highlight="29" wraplines="false"]
#!/bin/bash
# filename: required
file=$1
# output file prefix: required
outfile=$2
if [ "$file" = "" ] || [ ! -f $file ] || [ "$outfile" = "" ]; then
echo "Usage: $0 filename outfile [pagesize] [start] [limit]"
exit 1
fi
# page size: defaults to 2000
[ "$3" != "" ] && pagesize=$3 || pagesize=2000
# start post: defaults to 0 (first post)
[ "$4" != "" ] && start=$4 || start=0
# limit: defaults to # of posts in input file
[ "$5" != "" ] && limit=$5 || limit=`grep ‘<item>’ $file | wc -l`
echo "Splitting $file into" `echo "($limit-$start)/$pagesize" | bc` "pages of size $pagesize between posts $start and $limit";
i=$start
while [ "$i" -le "$limit" ]; do
echo "Generating page $((i/pagesize)): posts $((i)) through $((i+pagesize))..";
java -Xmx2000m -jar ~/saxonhe9-2-0-5j/saxon9he.jar -xsl:split.xsl $file page=$((i/pagesize)) size=$pagesize output=$outfile
i=$((i+pagesize))
done
[/sourcecode]
Save the above as split.sh, and the XSLT file as split.xsl in the same directory. Also, be sure to ensure the path to your Saxon JAR file is correct on line 29. Pulling this all together, we can take a large WXR input file and slice and dice it as we see fit:
[sourcecode language="bash" gutter="false"]
[Meerkat ~/Oomph/]$ sh split.sh Articles.xml Articles 1500
Splitting Articles.xml into 16 pages of size 1500 between posts 0 and 24065
Generating page 0: posts 0 through 1500..
Generating page 1: posts 1500 through 3000..
Generating page 2: posts 3000 through 4500..
Generating page 3: posts 4500 through 6000..
Generating page 4: posts 6000 through 7500..
…
[/sourcecode]
We now have 16 files of 1500 articles each, stored as
Articles_0-1499.xml
Articles_1500-2999.xml
Articles_3000-4499.xml
… And so forth.
Now you can import each of these files individually without choking your WordPress importer! I hope that some of you will find this useful. Keep in mind that the XSL stylesheet above could easily be adapted to work with other large XML data files, too. It would be just a matter of changing the element selectors that you wish to break apart.