Search code examples
pythonweb-scrapingbeautifulsoup

parsing discussion forum only gets me the first user comment but not the other user replies


could someone please give me a hand it seems i cant figure this one out. I have a url file list that looks like this :

https://community.appian.com/discussions/f/administration/14/integrate-token-device-with-appian
https://community.appian.com/discussions/f/administration/27/how-do-we-configure-enable-appian-tempo
https://community.appian.com/discussions/f/administration/31/how-to-download-get-pdf-of-the-documentation
https://community.appian.com/discussions/f/administration/39/we-need-to-establish-a-single-signon-with-an-outside-web-site-for-which-a-certif
https://community.appian.com/discussions/f/administration/43/is-there-a-way-to-import-an-application-exported-from-appian-6-6-1and-import-in
https://community.appian.com/discussions/f/administration/47/we-are-having-issues-with-oracle-db-integration-for-appian-6-6-1-we-are-install

the problem is when trying to scrape them i only get the first user comment.

do you guys have any idea how can I make this work ?

tried using BS4 and miserably failed and im only getting the first user comment and not the other user replies

here's what im using :

import json
from scrapegraphai.graphs import SmartScraperGraph

def main():
    graph_config = {
        "llm": {
            "model": "ollama/llama3",
            "temperature": 0,
            "base_url": "http://localhost:11434",
            "format": "json",  # Ollama needs the format to be specified explicitly
        },
        "embeddings": {
            "model": "ollama/nomic-embed-text",
        }
    }

    source_urls = []
    with open('cleaned-urls.txt', 'r') as f:
        sources = [line.strip() for line in f if line.strip()]
        source_urls.extend(sources)

    for source_url in source_urls:
        try:
            prompt = "find the best way to extract data, eliminate unneccesary fields and organise to only show the entire conversation and code snippets. make sure to include all text from the conversation and the users answers. always the first text is the question and what follows is from other user replies"
            smart_scraper_graph = SmartScraperGraph(prompt=prompt, source=source_url, config=graph_config)
            result = smart_scraper_graph.run()
            output = json.dumps(result, indent=2)
            print(output)
        except Exception as e:
            print(f"An error occurred: {e}")

if __name__ == "__main__":
    main()

Solution

  • The reason you can only get The first user post (question, basically) is due to replies being hydrated in page via an XHR call to an API endpoint. There is a way to fully scrape that page using requests & BeautifulSoup, however it's very convoluted, and in the interest of a minimal complexity budget, I suggest using Selenium in this instance.

    Here is an example of how you can do it with Selenium. Bear in mind this code won't work on a headless machine, or Google Colab for that matter, you would need some extra packages to make it work. This is meant to be ran on a standard machine with Chrome installed, in Jupyter or as a standalone Python file.

    As SoF is not a code writing service, I only printed the output in terminal. You can further drill down into elements selected and get time, author, clean content, parse JSON, put it into a dataframe, save it to disk, and so on.

    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    import json
    
    chrome_options = Options()
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument('disable-notifications')
    chrome_options.add_argument("window-size=1280,1080")
    
    urls = [
    'https://community.appian.com/discussions/f/administration/14/integrate-token-device-with-appian',
    'https://community.appian.com/discussions/f/administration/27/how-do-we-configure-enable-appian-tempo',
    'https://community.appian.com/discussions/f/administration/31/how-to-download-get-pdf-of-the-documentation',
    'https://community.appian.com/discussions/f/administration/39/we-need-to-establish-a-single-signon-with-an-outside-web-site-for-which-a-certif',
    'https://community.appian.com/discussions/f/administration/43/is-there-a-way-to-import-an-application-exported-from-appian-6-6-1and-import-in',
    'https://community.appian.com/discussions/f/administration/47/we-are-having-issues-with-oracle-db-integration-for-appian-6-6-1-we-are-install'
    ]
    with webdriver.Chrome(options=chrome_options) as driver:
        wait = WebDriverWait(driver, 15)
        for url in urls:
            driver.get(url)
            main_q = json.loads(wait.until(EC.presence_of_element_located((By.XPATH, '//script[@type="application/ld+json"]'))).get_attribute('innerHTML'))
            print(main_q)
            print('____________________________________________________________')
            print('____________________________________________________________')
            replies = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//li[@class="threaded content-item  "]')))
            for reply in replies:
                print(reply.text)
                print('____________________________________________________________')
            print('+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')
    

    Result in terminal:

    {'@context': 'https://schema.org', '@type': 'QAPage', 'mainEntity': {'@type': 'Question', 'name': 'Integrate token device with Appian', 'text': 'how can I add login token device? \n OriginalPostID-22130 \n OriginalPostID-22130', 'answerCount': 1, 'upvoteCount': 0, 'dateCreated': '2012-01-21T23:04:00.6170000Z', 'author': {'@type': 'Person', 'name': 'alex.he38'}, 'suggestedAnswer': [{'@type': 'Answer', 'text': 'Yes', 'dateCreated': '2022-02-27T06:57:12.6030000Z', 'upvoteCount': 0, 'url': 'https://community.appian.com/discussions/f/administration/14/integrate-token-device-with-appian/91615', 'author': {'@type': 'Person', 'name': 'kalicharans0001'}}]}}
    ____________________________________________________________
    ____________________________________________________________
    phillip.russell
    Appian Employee
    over 12 years ago
    Can you provide a little more information about this request? Were you looking at implementing multi-factor authentication, or were you speaking about something different?
    Vote Up
    0
    Vote Down
    Sign in to reply
    ____________________________________________________________
    alex.he38 over 12 years ago
    Hi Phillip,
    Thanks for your prompt response, I’m bidding for a financial group for 3 banks and one of this bank already using Token device ref# C100 to login users. www.ftsafe.com/.../epass.html
    I am organizing a pilot to demo Appian and I would to like demo how Appian could integrate with Token devices as well. Is there any possibility or adapter?
    Vote Up
    0
    Vote Down
    Sign in to reply
    Stefan Helzle
    Certified Lead Developer
    over 2 years ago in reply to alex.he38
    Appian supports single sign on via SAML. If your identity provider supports these tokens, you should be good to go.
    ____________________________________________________________
    Stefan Helzle
    Certified Lead Developer
    over 2 years ago in reply to alex.he38
    Appian supports single sign on via SAML. If your identity provider supports these tokens, you should be good to go.
    ____________________________________________________________
    kalicharans0001 over 2 years ago
    Yes
    ____________________________________________________________
    +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    {'@context': 'https://schema.org', '@type': 'QAPage', 'mainEntity': {'@type': 'Question', 'name': 'How do we configure/Enable Appian Tempo', 'text': 'How do we configure/Enable Appian Tempo ? \n OriginalPostID-23560 \n OriginalPostID-23560', 'answerCount': 1, 'upvoteCount': 0, 'dateCreated': '2012-02-07T17:28:48.6600000Z', 'author': {'@type': 'Person', 'name': 'santoshks'}, 'acceptedAnswer': {'@type': 'Answer', 'text': 'You need to configure a primary datasource. To do this, please review sections 6.1, 6.2 and 6.3 of the following: forum.appian.com/.../Appian_6.6_Windows_Installation_Guide_for_JBoss', 'dateCreated': '2012-02-07T17:32:08.5000000Z', 'upvoteCount': 0, 'url': 'https://community.appian.com/discussions/f/administration/27/how-do-we-configure-enable-appian-tempo/54', 'author': {'@type': 'Person', 'name': 'phillip.russell'}}}}
    ____________________________________________________________
    ____________________________________________________________
    +1
    phillip.russell
    Appian Employee
    over 12 years ago
    You need to configure a primary datasource. To do this, please review sections 6.1, 6.2 and 6.3 of the following: forum.appian.com/.../Appian_6.6_Windows_Installation_Guide_for_JBoss
    Vote Up
    0
    Vote Down
    Sign in to reply
    ____________________________________________________________
    +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    {'@context': 'https://schema.org', '@type': 'QAPage', 'mainEntity': {'@type': 'Question', 'name': 'How to download/get PDF of the documentation', 'text': 'Hi, I am a newbie to appian. I just downloaded the software for installation. Upon looking for documentation, I only see wiki pages and no zip archive or pdf document that I can download. Am I missing something or is the documentation only available via wiki pages ? Regards Zacharia \n OriginalPostID-23695 \n OriginalPostID-23695', 'answerCount': 1, 'upvoteCount': 0, 'dateCreated': '2012-02-09T00:56:55.0170000Z', 'author': {'@type': 'Person', 'name': 'zacm'}, 'acceptedAnswer': {'@type': 'Answer', 'text': 'In order to keep it updated and available at any time the documentation is only provided via the Documentation section online. This allows you to get the most recent information at any time anywhere.', 'dateCreated': '2012-02-09T01:00:32.7370000Z', 'upvoteCount': -1, 'url': 'https://community.appian.com/discussions/f/administration/31/how-to-download-get-pdf-of-the-documentation/71', 'author': {'@type': 'Person', 'name': 'Eduardo Fuentes'}}}}
    ____________________________________________________________
    ____________________________________________________________
    +1
    Eduardo Fuentes
    Appian Employee
    over 12 years ago
    In order to keep it updated and available at any time the documentation is only provided via the Documentation section online. This allows you to get the most recent information at any time anywhere.
    Vote Up
    -1
    Vote Down
    Sign in to reply
    ____________________________________________________________
    +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    {'@context': 'https://schema.org', '@type': 'QAPage', 'mainEntity': {'@type': 'Question', 'name': 'We need to establish a single signon with an outside web site for which a certif', 'text': 'We need to establish a single signon with an outside web site for which a certificate needs to be installed on the server. I'm assuming this is done through JBoss. Has anyone done this and what procedure did you follow? OriginalPostID-23833 OriginalPostID-23833', 'answerCount': 0, 'upvoteCount': 0, 'dateCreated': '2012-02-11T00:18:30.8670000Z', 'author': {'@type': 'Person', 'name': 'craigt'}}}
    ____________________________________________________________
    ____________________________________________________________
    Jacob Rank
    Appian Employee
    over 12 years ago
    Craig, do you mean that you need to setup SSL for your site? That's enabled via a web server. For an example see forum.appian.com/.../Configuring_Apache_Web_Server_with_JBoss
    
    On the other hand if you need to use the cert for client authentication or as part of the system truststore it would be a completely different configuration involving the JBoss JVM.
    Vote Up
    0
    Vote Down
    Sign in to reply
    ____________________________________________________________
    craigt over 12 years ago
    It's not for our site. It's to do a single signon to another website to display some data.
    Vote Up
    0
    Vote Down
    Sign in to reply
    ____________________________________________________________
    Jacob Rank
    Appian Employee
    over 12 years ago
    I wasn't sure if perhaps the SSO solution required your site to be using certain SSL cert. Which SSO provider are you working with?
    ____________________________________________________________
    phillip.russell
    Appian Employee
    over 12 years ago
    Why the need to deploy a certificate? Is it an Intermediate CA that's not being sent in the chain? Do you have OpenSSL installed on the Appian server, and if so, what's the result of running "openssl s_client -connect" against the SSO URL?
    ____________________________________________________________
    craigt over 12 years ago
    It's a certificate the site author provided and indicated it needed to be sent when prompted. We don't have OpenSSL currently installed.
    ____________________________________________________________
    +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    {'@context': 'https://schema.org', '@type': 'QAPage', 'mainEntity': {'@type': 'Question', 'name': 'Is there a way to import an application exported from Appian 6.6.1and import  in', 'text': 'Is there a way to import an application exported from Appian 6.6.1and import into Appian 6.6.0? OriginalPostID-24338 OriginalPostID-24338', 'answerCount': 0, 'upvoteCount': 0, 'dateCreated': '2012-02-16T16:49:28.3970000Z', 'author': {'@type': 'Person', 'name': 'Bill'}}}
    ____________________________________________________________
    ____________________________________________________________
    Eduardo Fuentes
    Appian Employee
    over 12 years ago
    As a best practice you shouldn't be doing this, even though, some basic items can be still compatible and importable by just changing the Appian-Version in the META-INF/MANIFEST.MF to 6.6.0.0 , this approach is not supported and Appian cannot guarantee this import completely fine, which makes sense if we take in consideration that new versions have new features that may not be present/compatible with previous versions.
    Vote Up
    0
    Vote Down
    Sign in to reply
    ____________________________________________________________
    Bill over 12 years ago
    I tried your solution and worked with some errors. 36 of 48 imported successfully. I have the import log. How do I post the log?
    Vote Up
    0
    Vote Down
    Sign in to reply
    ____________________________________________________________
    Eduardo Fuentes
    Appian Employee
    over 12 years ago
    Go this post: forum.appian.com/.../3557 and use the "Add Attachments" button; follow these steps: forum.appian.com/.../Uploading_a_Forum_Attachment
    
    Let me know the name of the folder where you uploaded it so I can take a look.
    ____________________________________________________________
    Myles Weber
    Appian Employee
    over 12 years ago
    Basically, this puts the system in an unsupported status. Nobody should be doing this that cares about their production system.
    ____________________________________________________________
    Bill over 12 years ago
    I have gone to the discussion, hit add attachment, selected Default Community, Appian KC, Discussion Topic Attachements, Eduardo Fuentes, selected Upload Document, browsed to the file, gave it a description, selected create, It indicated that it worked but I can not find the file to select it an add as an attachment. The file is named import_failure_log.zip. I don't know what I'm doing wrong but it is not being loaded.
    ____________________________________________________________
    Eduardo Fuentes
    Appian Employee
    over 12 years ago
    The import log confirms what Myles and I said; this is not supported because some features are not compatible with previous versions, if you see your import log, you have two problems; the first one is the target envrionment doesn't have a primary data soruce configured, and second, data stores from 6.6.1 are internally different from 6.6.1 therefore they are not importable into an old version of Appian. Please use an installation of 6.6.1 instead.
    ____________________________________________________________
    Bill over 12 years ago
    Thanks, That is what I thought. However, I did set up a primary data source. I wonder why it doesn't see it?
    ____________________________________________________________
    Eduardo Fuentes
    Appian Employee
    over 12 years ago
    Although this is definitely not going to solve your problem of using the unsupported approach of importing a 6.6.1 one app in 6.6 you need to make sure you have configured the primary data source correctly to take advantage of the new features that require the data source.
    
    Take a look at the beginning of the application server log, if the configuration is right you will see something like this:
    
    Validating and initializing the primary data source: java:/AppianPrimaryDS
    [java:/AppianPrimaryDS] Checking schema and migrating if necessary...
    [java:/AppianPrimaryDS] Schema check/migration completed successfully.
    ...
    ____________________________________________________________
    Eduardo Fuentes
    Appian Employee
    over 12 years ago
    If you see any errors related to your primary data source during JBoss startup (search for your JNDI name in the most recent entries in your log) share them with me so we can see what the issue is.
    ____________________________________________________________
    +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    {'@context': 'https://schema.org', '@type': 'QAPage', 'mainEntity': {'@type': 'Question', 'name': 'We are having issues with Oracle DB integration for Appian 6.6.1. We are install', 'text': 'We are having issues with Oracle DB integration for Appian 6.6.1. We are installing 6.6.1 and reusing the existing Primary DataSource configured for our Appian 6.5.1 installation.But the JBoss App server throwing error saying "Invalid Schema" for the Primary DataSource. As I understand from the documentation, Appian automatically updates the Schema for the newer version if we use an existing schema(for older version). Can someone please provide insight into what might be causing this issue? OriginalPostID-24783 OriginalPostID-24783', 'answerCount': 0, 'upvoteCount': 0, 'dateCreated': '2012-02-21T14:25:44.6330000Z', 'author': {'@type': 'Person', 'name': 'prosenjitd'}}}
    ____________________________________________________________
    ____________________________________________________________
    Eduardo Fuentes
    Appian Employee
    over 12 years ago
    Is this error thrown on 6.6.1 or when you try to run the old 6.5.1? Once Appian 6.6.1 updates the schema you can't use that one with the old install. If this error is happening on the upgraded environment, check the complete application-server.log, there should be more information on why the schema cannot be updated (e.g. permissions/privileges)
    Vote Up
    0
    Vote Down
    Sign in to reply
    ____________________________________________________________
    +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    

    Selenium documentation can be found here.