Search code examples
javascriptpythonpdf.jsqwebengineview

Inaccurate Y Values When Extracting Text Coordinates


I'm using PyQt5.QWebEngineView to display a pdf. Currently I'm working with PDF.js to extract text coordinates (based on this question) from a selected region in a PDF document using the following code:

from PyQt5.QtWidgets import QApplication, QMainWindow, QAction, QTextBrowser
from PyQt5.QtWebEngineWidgets import QWebEngineView
from PyQt5.QtCore import QUrl, QTimer, QEventLoop
import os


class MyWebWidgetPdf(QWebEngineView):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)


    def contextMenuEvent(self, event):
        position_action = QAction("Position", self)
        position_action.triggered.connect(self.get_current_position)

        self.menu = self.page().createStandardContextMenu()
        self.menu.addAction(position_action)
        self.menu.popup(event.globalPos())

    def get_current_position(self):
        js_script = """
                var pageIndex = PDFViewerApplication.pdfViewer.currentPageNumber - 1; 
                var page = PDFViewerApplication.pdfViewer.getPageView(pageIndex);
                var pageRect = page.canvas.getClientRects()[0];
                var selection = window.getSelection();
                var selectionRects = selection.getRangeAt(0).getClientRects();
                var selectionRectsArray = Array.from(selectionRects);

                var selectedText = selection.toString();
                var viewport = page.viewport;
                var selected = selectionRectsArray.map(function (r) {
                    return viewport.convertToPdfPoint(r.left - pageRect.x, r.top - pageRect.y).concat(
                        viewport.convertToPdfPoint(r.right - pageRect.x, r.bottom - pageRect.y)
                    ); 
                });
                
                var result = {
                    page: pageIndex + 1, 
                    coords: selected, 
                    selectedText: selectedText
                };

                result;
            """

        result = self.execJavaScript(js_script)
        print(result)


    def execJavaScript(self, script):
        """This function executes a javascript script and returns the result.
            :param script: The script to execute.
            :return: The result of the script."""
        result = None  # initialize the result

        def callback(data):
            """This function is called when the script is executed.
                :param data: The result of the script."""
            nonlocal result  # use the result variable of the parent function
            result = data  # set the result
            loop.quit()  # quit the event loop

        loop = QEventLoop()  # create an event loop
        QTimer.singleShot(0, lambda: self.page().runJavaScript(script, callback))  # execute the script
        loop.exec()  # start the event loop
        return result  # return the result


class PDFViewer(QMainWindow):
    def __init__(self):
        super().__init__()

        self.browser = QTextBrowser()
        self.setCentralWidget(self.browser)

        self.pdf_viewer = MyWebWidgetPdf()
        self.setCentralWidget(self.pdf_viewer)
        path = "/Users/user/Desktop/3._SprengV.pdf"
        PDF = f'file:{os.path.abspath(path)}'
        self.PDFJS = 'file:////Users/user/PycharmProjects/legalref/pdfjs-3/web/viewer.html'
        self.pdf_viewer.load(QUrl.fromUserInput(f'{self.PDFJS}?file={PDF}'))


if __name__ == "__main__":
    import sys

    app = QApplication(sys.argv)
    dialog = PDFViewer()
    dialog.show()
    sys.exit(app.exec_())

I'm encountering an issue where the X values are correct, but the Y values are consistently off by much. I controlled the x and y values with PyMuPdf.


Solution

  • Native PDF content is based like charts on a Y direction upwards. Commonly the "Origin" is in the lower left corner. The height often uses point size units as nominally 1/72".

    All this can be changed by /UserUnit and Current Matrix transformations, into a different "Up" as "right to left" (Right Reading) or even "Upside-down".

    Normally (default) we see /MediaBox for a letter page as 72 x 11 = 792 units high, but it can be any number of units and still be 11" high when printed / displayed.

    Thus when working with rendering the "Screen" origin is presumed to be top left. Many PDF applications work in those downwards units, then translate into the opposite direction during writing.

    This causes issues, when using multiple applications in the same workspace. However, is easily resolved by subtract current height from /Media height and adjusting for current user scalar and or position.

    Thus we may be placing an image 4.5" high at an elevation of half way up (5.5" so their can be a 1" void above) We can thus calculate it needs to be at 792 - 396 - 324 = 72 Down.