Search code examples
htmlxmlxsltxslt-1.0

What XSLT is needed to extract and transform this specific XHTML?


I'm trying to extract a subset of some HTML from a larger file and then perform a few transformations of the result. I've made some progress but I'm missing a piece or two to make this work as desired.

Here is a greatly simplified version of the XHTML I wish to transform:

<html>
<head>
<!-- lots of stuff I don't care about -->
</head>
<body>
<div>
  <!-- lots of stuff I don't care about -->
  <div>
     <!-- lots of stuff I don't care about -->
     <div id="key_div">
         <div id="ignore_this">
           <!-- lots of stuff I don't care about -->
         </div>
         <p>More junk I don't want</p>
         <p>Even more junk I don't want</p>
         <h2><span class="someClass" id="someID">Header</span></h2>
         <p>Stuff I want to keep</p>
         <!-- A lot of stuff I want to keep -->
         <p>More stuff I want to keep</p>
         <ul>
           <li><a href="/some/old/path">Fun Place</a></li>
           <li><a href="/some/old/other">Better Place</a></li>
         </ul>
     </div>
     <!-- lots of stuff I don't care about -->
  </div>
  <!-- lots of stuff I don't care about -->
</div>
</body>
</html>

I want to extract everything from the <h2> tag through the rest of the content inside the <div> with the id of "key_div". But I also want to transform the <h2> into a simpler <h1> and I need to modify the hrefs in the list. The final result should look like this:

<html>
<head>
<!-- My own header stuff -->
</head>
<body>
 <h1>Header</h1>
 <p>Stuff I want to keep</p>
 <!-- A lot of stuff I want to keep -->
 <p>More stuff I want to keep</p>
 <ul>
   <li><a href="/new/path">Fun Place</a></li>
   <li><a href="/new/other">Better Place</a></li>
 </ul>
</body>
</html>

I was able to do most of the basic extraction without any of the desired transformations by using the following XSL:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:x="http://www.w3.org/1999/xhtml"
 exclude-result-prefixes="x">
 <xsl:output indent="yes" encoding="utf-8"/>

 <xsl:template match="/">
  <html>
   <head>
     <title>My Title</title>
   </head>
   <body>
    <xsl:apply-templates/>
   </body>
  </html>
 </xsl:template>

 <xsl:template match="div[@id='key_div']/*">
  <xsl:copy-of select="."/>
 </xsl:template>

 <xsl:template match="div[@id='ignore_this']"/>

 <xsl:template match="text()"/>
</xsl:stylesheet>

This results in:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>My Title</title>
</head>
<body>
<p>More junk I don't want</p>
<p>Even more junk I don't want</p>
<h2><span class="someClass" id="someID">Header</span></h2>
<p>Stuff I want to keep</p>
<p>More stuff I want to keep</p>
<ul>
           <li><a href="/some/old/path">Fun Place</a></li>
           <li><a href="/some/old/other">Better Place</a></li>
         </ul>
</body>
</html>

I don't know how to remove the stuff before the <h2>.

I don't know how to transform <h2><span class="someClass" id="someID">Header</span></h2> into <h1>Header</h1> or how to transform the hrefs. All of my attempts to combine a transform with the extraction usually ends up with no content.

There will be a few other transformations I need to perform but for now I'll focus on this example to get me started. I mention it so any possible answers don't prevent any other possible transformations.


Solution

  • Try something like:

    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    
    <xsl:template match="/html">
        <html>
            <head>
                <!-- your own header stuff -->
            </head>
            <body>
                <xsl:apply-templates select="//div[@id='key_div']/h2"/>
            </body>
        </html>
    </xsl:template>
    
    <xsl:template match="h2">
        <h1>
            <xsl:value-of select="." />
        </h1>
        <xsl:apply-templates select="following-sibling::*"/>
    </xsl:template>
    
    <xsl:template match="@href">
        <xsl:attribute name="href">
            <xsl:text>/new/</xsl:text>
            <xsl:value-of select="substring-after(., '/old/')" />
        </xsl:attribute>
    </xsl:template>
    
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>
    
    </xsl:stylesheet>