Search code examples
htmlperlpretty-printhtml-tree

HTML::Element endtag generates end tags for <br> and <img>


I'm using the following Perl code to traverse and format some HTML:

#!/usr/bin/env perl 
use v5.38;
use HTML::TreeBuilder;
my $indent = 3;
my $content = do {local $/; <DATA>};
my $tree = HTML::TreeBuilder->new();
$tree->parse_content($content);
visit($tree);

sub visit($x) {
    my $depth = $x->depth;
    my $in = ' ' x ($indent * $depth);
    foreach my $e ($x->content_list) {
        # element
        if (ref ($e)) {
            say $in . $e->starttag;
            visit($e);
            say $in . $e->endtag;
        }
        # text
        else {
            say $in . $e;
        }
    }
}
__DATA__
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
</head>
<body>
    <font size=3><strong>
    5/5/61 Bob & Jerry - Arroyo Lounge, Stanford University, Palo Alto, CA
    </strong></font>
    <br>
    <img src="poster.png" alt="poster/ad" title="poster/ad">
    <i>(Robert Hunter and Jerry Garcia; source: McNally, Jackson research)</i>
    <br><br>

    <font size=3><strong>
    5/26/61 Bob & Jerry - Barbara Meier's 16th birthday party, Menlo Park, CA
    </strong></font>
    <br>
    Follow The Drinking Gourd, John Henry, Santy Anno*, Poor Paddy Works On The Railway
    <br>
    <i>(*included on
        <a href="https://www.garciafamilyprovisions.com/product/JY148COMBO/before-the-dead-4cd-set?cp=640_62123_100764" target="_blank">Before The Dead
        </a>;
        <a href="https://gdsets.com/63posters/1961_05_26.jpg" target="_blank">birthday doodle for Barbara by Jerry
        </a>;
        <a href="https://gdsets.com/63posters/1961_05_26a.jpg" target="_blank">the master tape
        </a>
    )
    </i>
    <br><br>

My problem is that each <br> is output as:

<br />
</br>

Both <br /> and </br> cause new lines to be rendered. I was surprised that endtag generated anything at all in the case of tag br (and img).

I avoided using HTML::Tree::traverse because the doc discourages its use:

[I]f you want to recursively visit every node in the tree, it's almost always simpler to write a subroutine does just that, than it is to bundle up the pre- and/or post-order code in callbacks for the traverse method.

There are no examples given, so the above is what I cooked up.

Am I using starttag and endtag correctly? Should I detect when I'm displaying a tag that doesn't take an end tag and avoid calling endtag? What's the right/best/simplest way to traverse an HTML tree and prettify it?

Update:

As suggested by Stephen Ullrich, I tried to use as_HTML() for formatting:

#!/usr/bin/env perl 
use v5.38;
use HTML::TreeBuilder;
say "\%HTML::Element::optionalEndTag= ",
    join ', ', keys %HTML::Element::optionalEndTag;
my $content = do {local $/; <DATA>};
my $tree = HTML::TreeBuilder->new();
$tree->parse_content($content);
# don't encode any entities; indent with three spaces; 
say $tree->as_HTML('', '   ');
__DATA__
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
</head>
<body>
    <font size=3><strong>
    5/5/61 Bob & Jerry - Arroyo Lounge, Stanford University, Palo Alto, CA
    </strong></font>
    <br>
    <img src="poster.png" alt="poster/ad" title="poster/ad">
    <i>(Robert Hunter and Jerry Garcia; source: McNally, Jackson research)</i>
    <br><br>

    <font size=3><strong>
    5/26/61 Bob & Jerry - Barbara Meier's 16th birthday party, Menlo Park, CA
    </strong></font>
    <br>
    Follow The Drinking Gourd, John Henry, Santy Anno*, Poor Paddy Works On The Railway
    <br>
    <i>(*included on
        <a href="https://www.garciafamilyprovisions.com/product/JY148COMBO/before-the-dead-4cd-set?cp=640_62123_100764" target="_blank">Before The Dead
        </a>;
        <a href="https://gdsets.com/63posters/1961_05_26.jpg" target="_blank">birthday doodle for Barbara by Jerry
        </a>;
        <a href="https://gdsets.com/63posters/1961_05_26a.jpg" target="_blank">the master tape
        </a>
    )
    </i>
    <br><br>

Output:

%HTML::Element::optionalEndTag= dt, dd, li, p
<!DOCTYPE html>
<html lang="en">
   <head>
      <meta charset="utf-8" />
   </head>
   <body><font size="3"><strong> 5/5/61 Bob & Jerry - Arroyo Lounge, Stanford University, Palo Alto, CA </strong></font><br /><img alt="poster/ad" src="poster.png" title="poster/ad" /> <i>(Robert Hunter and Jerry Garcia; source: McNally, Jackson research)</i><br />
      <br /><font size="3"><strong> 5/26/61 Bob & Jerry - Barbara Meier's 16th birthday party, Menlo Park, CA </strong></font><br /> Follow The Drinking Gourd, John Henry, Santy Anno*, Poor Paddy Works On The Railway <br /><i>(*included on <a href="https://www.garciafamilyprovisions.com/product/JY148COMBO/before-the-dead-4cd-set?cp=640_62123_100764" target="_blank">Before The Dead </a>; <a href="https://gdsets.com/63posters/1961_05_26.jpg" target="_blank">birthday doodle for Barbara by Jerry </a>; <a href="https://gdsets.com/63posters/1961_05_26a.jpg" target="_blank">the master tape </a> ) </i><br />
      <br />
   </body>
</html>

Unfortunately, this isn't "pretty" enough. I don't understand why the indenting leaves off after the first couple of levels. However, I do note that it doesn't generate </br> or </img>, despite the fact that neither of these tags is mentioned in %HTML::Element::optionalEndTag!

Update 2

(Although they are listed in %HTML::Tagset::emptyElement, which as_HTML checks.)


Solution

  • <br> and <img> (among others) are empty elements; they aren't intended to surround anything, thus there is no point to having a separate endtag. Nevertheless, HTML::Element::endtag always generates the string </tag>, whether or not tag is an empty element.

    (Note that starttag is smart enough to write <tag attr=... /> for empty tags like <img ... /> and <br />.)

    Therefore the programmer must explicitly test whether or not an endtag is appropriate. Fortunately there's a variable, %HTML::Tagset::emptyElement, that maps each empty element to 1 (true).

    The following code will print the HTML supplied in the OP in a simple, indented format with each tag on a separate line.

    #!/usr/bin/env perl 
    use v5.38;
    use HTML::TreeBuilder;
    my $indent = 3;
    my $content = do {local $/; <DATA>};
    my $tree = HTML::TreeBuilder->new();
    $tree->parse_content($content);
    visit($tree);
    
    sub visit($x) {
        use HTML::Tagset;
        my $depth = $x->depth;
        my $in = ' ' x ($indent * $depth);
        for my $e ($x->content_list) {
            if (ref ($e)) {     # element
                say $in . $e->starttag;
                if (! $HTML::Tagset::emptyElement{$e->tag}) {
                    visit($e);
                    say $e->endtag;
                }
            }
            else {              # text
    
                # for extra prettiness
                use Text::Wrap;
                $Text::Wrap::columns = 132;
                say wrap($in, $in, $e);
            }
        }
    }
    __DATA__
    <!DOCTYPE html>
    <html lang="en">
    <head>
    <meta charset="utf-8">
    </head>
    <body>
        <font size=3><strong>
        5/5/61 Bob & Jerry - Arroyo Lounge, Stanford University, Palo Alto, CA
        </strong></font>
        <br>
        <img src="poster.png" alt="poster/ad" title="poster/ad">
        <i>(Robert Hunter and Jerry Garcia; source: McNally, Jackson research)</i>
        <br><br>
    
        <font size=3><strong>
        5/26/61 Bob & Jerry - Barbara Meier's 16th birthday party, Menlo Park, CA
        </strong></font>
        <br>
        Follow The Drinking Gourd, John Henry, Santy Anno*, Poor Paddy Works On The Railway
        <br>
        <i>(*included on
            <a href="https://www.garciafamilyprovisions.com/product/JY148COMBO/before-the-dead-4cd-set?cp=640_62123_100764" target="_blank">Before The Dead
            </a>;
            <a href="https://gdsets.com/63posters/1961_05_26.jpg" target="_blank">birthday doodle for Barbara by Jerry
            </a>;
            <a href="https://gdsets.com/63posters/1961_05_26a.jpg" target="_blank">the master tape
            </a>
        )
        </i>
        <br><br>
    

    Output:

    <head>
       <meta charset="utf-8" />
    </head>
    <body>
       <font size="3">
          <strong>
          5/5/61 Bob & Jerry - Arroyo Lounge, Stanford University, Palo Alto,
         CA
          </strong>
       </font>
       <br />
       <img alt="poster/ad" src="poster.png" title="poster/ad" />
    
       <i>
          (Robert Hunter and Jerry Garcia; source: McNally, Jackson research)
       </i>
       <br />
       <br />
       <font size="3">
          <strong>
          5/26/61 Bob & Jerry - Barbara Meier's 16th birthday party, Menlo
         Park, CA
          </strong>
       </font>
       <br />
        Follow The Drinking Gourd, John Henry, Santy Anno*, Poor Paddy Works On The
       Railway
       <br />
       <i>
          (*included on
          <a href="https://www.garciafamilyprovisions.com/product/JY148COMBO/before-the-dead-4cd-set?cp=640_62123_100764" target="_blank">
         Before The Dead
          </a>
          ;
          <a href="https://gdsets.com/63posters/1961_05_26.jpg" target="_blank">
         birthday doodle for Barbara by Jerry
          </a>
          ;
          <a href="https://gdsets.com/63posters/1961_05_26a.jpg" target="_blank">
         the master tape
          </a>
           )
       </i>
       <br />
       <br />
    </body>