Search code examples
pythonhtmltexttags

Python: Remove everything between two given HTML tags


The programming language of choice is Python. This is my HTML text:

<a href="https://www.example.com">Original message</a><br>
<ul id="list">
    <li class="blockbody" id="post_1">
        <div class="header">
            <div class="datetime">
                24 januari 2020, 11:34
            </div><span class="name">Jane Doe</span>
        </div>
        <div class="content">
            <blockquote class="restore">
                <div class="bbcode_container">
                    <i class="fa fa-envelope"></i> Citation:
                    <div class="bbcode_quote printable">
                        <hr>
                        <div>
                            Citation: John Doe <a href="showthread.php#post684209" rel="nofollow"><img alt="" class="inlineimg" src="image/style/Aesthetica/button/view.gif"></a>
                        </div>
                        <div class="message">
                            Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
                        </div>
                        <hr>
                    </div>
                </div><br>
                <div class="bbcode_container">
                    <i class="fa fa-envelope"></i> Citation:
                    <div class="bbcode_quote printable">
                        <hr>
                        Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum?<br>
                        <br>
                        Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam.<br>
                        <br>
                        quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum..
                        <hr>
                    </div>
                </div>... <a href="https://example.com" target="_blank">https://example.html</a><br>
                <br>
                velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum?
            </blockquote>
        </div>
    </li>
</ul>

I want to remove everything between THE FIRST <div class="bbcode_quote printable"> and THE LAST <hr> tag. As you can see there are multiple instances of both tags that is why I emphasize THE FIRST and THE LAST. I'm familiar with Python but string manipulation is not my field of expertise. I hope I made myself clear.


Solution

  • Using regex, preserving the first and last tags:

    html = '''
    <a href="https://www.example.com">Original message</a><br>
    <ul id="list">
        <li class="blockbody" id="post_1">
            <div class="header">
                <div class="datetime">
                    24 januari 2020, 11:34
                </div><span class="name">Jane Doe</span>
            </div>
            <div class="content">
                <blockquote class="restore">
                    <div class="bbcode_container">
                        <i class="fa fa-envelope"></i> Citation:
                        <div class="bbcode_quote printable">
                            <hr>
                            <div>
                                Citation: John Doe <a href="showthread.php#post684209" rel="nofollow"><img alt="" class="inlineimg" src="image/style/Aesthetica/button/view.gif"></a>
                            </div>
                            <div class="message">
                                Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
                            </div>
                            <hr>
                        </div>
                    </div><br>
                    <div class="bbcode_container">
                        <i class="fa fa-envelope"></i> Citation:
                        <div class="bbcode_quote printable">
                            <hr>
                            Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum?<br>
                            <br>
                            Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam.<br>
                            <br>
                            quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum..
                            <hr>
                        </div>
                    </div>... <a href="https://example.com" target="_blank">https://example.html</a><br>
                    <br>
                    velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum?
                </blockquote>
            </div>
        </li>
    </ul>
    '''
    

    Then:

    import re
    
    first = re.search(r'<div class="bbcode_quote printable">', html).end()
    last = re.search(fr'{"<hr>"[::-1]}', html[::-1]).end()
    
    new_html = html[:first] + html[len(html)-last:]
    

    Result:

    print(new_html)
    
    <a href="https://www.example.com">Original message</a><br>
    <ul id="list">
        <li class="blockbody" id="post_1">
            <div class="header">
                <div class="datetime">
                    24 januari 2020, 11:34
                </div><span class="name">Jane Doe</span>
            </div>
            <div class="content">
                <blockquote class="restore">
                    <div class="bbcode_container">
                        <i class="fa fa-envelope"></i> Citation:
                        <div class="bbcode_quote printable"><hr>
                        </div>
                    </div>... <a href="https://example.com" target="_blank">https://example.html</a><br>
                    <br>
                    velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum?
                </blockquote>
            </div>
        </li>
    </ul>