• 0

how to add a login to a bs4 parser-script


Question

tarifa

dear experts, 

 

first of all - i hope you are all right and all goes well. 

 

 

I want to scrape a website that requires login with password first, how can I start scraping it with python using beautifulsoup4 library?

Below is what I do at the moment: 

 


import requests

from bs4 import BeautifulSoup as BS

 

session = requests.Session()

session.headers.update({'User-Agent': 'Mozilla/5.0'}) # this page needs header 'User-Agent` 

 

url = 'https://wordpress.org//{}/'

 

for page in range(1, 3):

    print('\n--- PAGE:', page, '---\n')

    

    # read page with list of posts

    r = session.get(url.format(page))

 

 

 

but what should i do to login to Wordpress-support forums?  

Note my parser-job requires login.

 

I found some options and i have had a closer look at - here i have added them

 

the first of several methods: see this way: 

 

 


from bs4 import BeautifulSoup    

import urllib2 

url = urllib2.urlopen("http://www.python.org")    

content = url.read()    

soup = BeautifulSoup(content)

How should the code be changed to accommodate login? Assume that the website I want to scrape is a forum that requires login. An example is http://forum.arduino.cc/index.php

 

 

or should i use mechanize:

 

import mechanize

from bs4 import BeautifulSoup

import urllib2 

import cookielib

 

cj = cookielib.CookieJar()

br = mechanize.Browser()

br.set_cookiejar(cj)

br.open("https://id.arduino.cc/auth/login/")

 

br.select_form(nr=0)

br.form['username'] = 'username'

br.form['password'] = 'password.'

br.submit()

print br.response().read()

 

 

 

besides this we also can go this  way: 

 

# Login to website using just Python 3 Standard Library

import urllib.parse

import urllib.request

import http.cookiejar

 

def scraper_login():

    ####### change variables here, like URL, action URL, user, pass

    # your base URL here, will be used for headers and such, with and without https://

    base_url = 'www.example.com'

    https_base_url = 'https://' + base_url

 

    # here goes URL that's found inside form action='.....'

    #   adjust as needed, can be all kinds of weird stuff

    authentication_url = https_base_url + '/login'

 

    # username and password for login

    username = 'yourusername'

    password = 'SoMePassw0rd!'

 

    # we will use this string to confirm a login at end

    check_string = 'Logout'

 

    ####### rest of the script is logic

    # but you will need to tweak couple things maybe regarding "token" logic

    #   (can be _token or token or _token_ or secret ... etc)

 

    # big thing! you need a referer for most pages! and correct headers are the key

    headers={"Content-Type":"application/x-www-form-urlencoded",

    "User-agent":"Mozilla/5.0 Chrome/81.0.4044.92",    # Chrome 80+ as per web search

    "Host":base_url,

    "Origin":https_base_url,

    "Referer":https_base_url}

 

    # initiate the cookie jar (using : http.cookiejar and urllib.request)

    cookie_jar = http.cookiejar.CookieJar()

    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie_jar))

    urllib.request.install_opener(opener)

 


 and so forht 
 

scraper_login()


see more here https://stackoverflow.com/questions/23102833/how-to-scrape-a-website-which-requires-login-using-python-and-beautifulsoup
 

 

 

but there is even a simpler way, 

 

a method that gets us there without selenium or mechanize, or other 3rd party tools, albeit it is semi-automated. Basically, when we login into a site in a normal way, we identify ourself in a unique way using the credentials, and the same identity is used thereafter for every other interaction, which is stored in cookies and headers, for a brief period of time.

 

What we need to do is use the same cookies and headers when we make our http requests, and we'll be in.

 

To replicate that, follow these steps:

 

In the browser, open the developer tools

we go to the site, and login

After the login, go to the network tab, and then refresh the page

At this point, we should see a list of requests, the top one being the actual site - and that will be our focus, because it contains the data with the identity we can use for Python and BeautifulSoup to scrape it: we now can right click the site request (the top one), hover over copy, and then copy as cURL ...

 

 

 

What do you suggest bere?

 

look forward to hear from you 

Edited by tarifa
Link to post
Share on other sites

2 answers to this question

Recommended Posts

  • 0
tarifa

i want to log to wordpress- the support forums. https://login.wordpress.org/?locale=en_US

 

cf: <a class="ab-item" href="https://login.wordpress.org/?locale=en_US">Log In</a>


https://login.wordpress.org/?locale=en_US
 

 

<body class="wp-core-ui login js route-root">
<script type="text/javascript">document.body.className = document.body.className.replace('no-js','js');</script>
<noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-P24PF4B" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
    
<div id="login">
    <h1><a href="https://wordpress.org/" title="WordPress.org" tabindex="-1">WordPress.org Login</a></h1>
<p class="intro">Log in to your WordPress.org account to contribute to WordPress, get help in the support forum, or rate and review themes and plugins.</p>


        <form name="loginform" id="loginform" action="https://login.wordpress.org/wp-login.php" method="post" data-submit-ready="true">
            
            <p class="login-username">
                <label for="user_login">Username or Email Address</label>
                <input type="text" name="log" id="user_login" class="input" value="" size="20">
            </p>
            <p class="login-password">
                <label for="user_pass">Password</label>
                <input type="password" name="pwd" id="user_pass" class="input" value="" size="20">
            </p>
            
            <p class="login-remember"><label><input name="rememberme" type="checkbox" id="rememberme" value="forever"> Remember Me</label></p>
            <p class="login-submit">
                <input type="submit" name="wp-submit" id="wp-submit" class="button button-primary" value="Log In">
                <input type="hidden" name="redirect_to" value="https://wordpress.org/support/plugin/wp-job-manager/">
            </p>
            
        <input type="hidden" name="_reCaptcha_v3_token" value="03AGdBq25itmMwr7dEGxc4MkXQ5bm55D9x2OHMwxe7r5Vn8L7Mjwi4l4WC3MdBJ86HKzKf3x33be1BsN3ZlnCWEXJaPXLhbIxQk2SUhpidOwIqU0eNK-dWYqFvNfFdherkBIJvvem8j7P6gdO7Z-A11vd8JUrcgPi16N2ZQXo2fCIP8gDxxlm-Uc81-wq9e2a_ovTPFz3V85-vQL0mDrLc_pdWUvNOW2HAmgbIz01TzGxanypi9ouSxdexqttMipcXO1_VxZpdsaRgOfGUHs7v79xctNQn396J9eeL7sktFQzq-2rLofxqGoR6b1NGJh9uO_By6dnfsuNAPE99PaMaL9T8H_8PvhdBxpUlJBg8wITG7_cKNhHB1zqZFFVVSsdXwLmN8Xiz-CBWA9BgL1Nk0QeXeTtTA0i14d903JYEoha3ZDTpIZKLBZR2mTYofxK76eETgTLUqO2L"></form>
<p id="nav">
    <a href="https://login.wordpress.org/lostpassword" title="Password Lost and Found">Lost password?</a> &nbsp; • &nbsp;
    <a href="https://login.wordpress.org/register" title="Create an account">Create an account</a>
</p>

<script type="text/javascript">
setTimeout( function() {
    try {
        d = document.getElementById( 'user_login' );
        d.focus();
        d.select();
    } catch( e ){}
}, 200 );
</script>


    </div>

    <div class="language-switcher">
        <form id="language-switcher" action="" method="GET">
                            <input type="hidden" name="redirect_to" value="https://wordpress.org/support/plugin/wp-job-manager/">
                        <label for="language-switcher-locales">
                <span aria-hidden="true" class="dashicons dashicons-translation"></span>
                <span class="screen-reader-text">Select the language:</span>
            </label>
            <select id="language-switcher-locales" name="locale">
                <option value="fa_AF">(فارسی (افغانستان</option><option value="gax">Afaan Oromoo</option><option value="af">Afrikaans</option><option value="so_SO">Afsoomaali</option><option value="arg">Aragonés</option><option value="frp">Arpitan</option><option value="ast">Asturianu</option><option value="ibo">Asụsụ Igbo</option><option value="az_TR">Azərbaycan Türkcəsi</option><option value="az">Azərbaycan dili</option><option value="id_ID">Bahasa Indonesia</option><option value="ms_MY">Bahasa Melayu</option><option value="jv_ID">Basa Jawa</option><option value="su_ID">Basa Sunda</option><option value="bs_BA">Bosanski</option><option value="bre">Brezhoneg</option><option value="ca">Català</option><option value="bal">Català (Balear)</option><option value="ceb">Cebuano</option><option value="sna">ChiShona</option><option value="pcd">Ch’ti</option><option value="co">Corsu</option><option value="me_ME">Crnogorski jezik</option><option value="cy">Cymraeg</option><option value="da_DK">Dansk</option><option value="de_DE">Deutsch</option><option value="de_CH">Deutsch (Schweiz)</option><option value="de_CH_informal">Deutsch (Schweiz, Du)</option><option value="de_DE_formal">Deutsch (Sie)</option><option value="de_AT">Deutsch (Österreich)</option><option value="dsb">Dolnoserbšćina</option><option value="et">Eesti</option><option value="en_US" selected="selected">English</option><option value="en_AU">English (Australia)</option><option value="en_CA">English (Canada)</option><option value="en_NZ">English (New Zealand)</option><option value="art_xpirate">English (Pirate)</option><option value="en_ZA">English (South Africa)</option><option value="en_GB">English (UK)</option><option value="es_ES">Español</option><option value="es_AR">Español de Argentina</option><option value="es_CL">Español de Chile</option><option value="es_CO">Español de Colombia</option><option value="es_CR">Español de Costa Rica</option><option value="es_GT">Español de Guatemala</option><option value="es_HN">Español de Honduras</option><option value="es_MX">Español de México</option><option value="es_PE">Español de Perú</option><option value="es_PR">Español de Puerto Rico</option><option value="es_DO">Español de República Dominicana</option><option value="es_UY">Español de Uruguay</option><option value="es_VE">Español de Venezuela</option><option value="eo">Esperanto</option><option value="eu">Euskara</option><option value="ewe">Eʋegbe</option><option value="fr_FR">Français</option><option value="fr_BE">Français de Belgique</option><option value="fr_CA">Français du Canada</option><option value="fur">Friulian</option><option value="fy">Frysk</option><option value="fo">Føroyskt</option><option value="ga">Gaelige</option><option value="gl_ES">Galego</option><option value="gd">Gàidhlig</option><option value="hau">Harshen Hausa</option><option value="hsb">Hornjoserbšćina</option><option value="hr">Hrvatski</option><option value="ido">Ido</option><option value="kin">Ikinyarwanda</option><option value="it_IT">Italiano</option><option value="kal">Kalaallisut</option><option value="cor">Kernewek</option><option value="sw">Kiswahili</option><option value="mfe">Kreol Morisien</option><option value="hat">Kreyol ayisyen</option><option value="kmr">Kurdî</option><option value="lv">Latviešu valoda</option><option value="lt_LT">Lietuvių kalba</option><option value="li">Limburgs</option><option value="lmo">Lombardo</option><option value="lb_LU">Lëtzebuergesch</option><option value="lij">Lìgure</option><option value="hu_HU">Magyar</option><option value="mg_MG">Malagasy</option><option value="mlt">Malti</option><option value="nl_NL">Nederlands</option><option value="nl_BE">Nederlands (België)</option><option value="nl_NL_formal">Nederlands (Formeel)</option><option value="lin">Ngala</option><option value="pcm">Nigerian Pidgin</option><option value="nb_NO">Norsk bokmål</option><option value="nn_NO">Norsk nynorsk</option><option value="oci">Occitan</option><option value="lug">Oluganda</option><option value="uz_UZ">O‘zbekcha</option><option value="pap_AW">Papiamento</option><option value="pap_CW">Papiamentu</option><option value="pl_PL">Polski</option><option value="pt_PT">Português</option><option value="pt_PT_ao90">Português (AO90)</option><option value="pt_AO">Português de Angola</option><option value="pt_BR">Português do Brasil</option><option value="fuc">Pulaar</option><option value="sq_XK">Për Kosovën Shqip</option><option value="kaa">Qaraqalpaq tili</option><option value="tah">Reo Tahiti</option><option value="ro_RO">Română</option><option value="roh">Rumantsch</option><option value="rhg">Ruáinga</option><option value="srd">Sardu</option><option value="sq">Shqip</option><option value="ssw">SiSwati</option><option value="scn">Sicilianu</option><option value="sk_SK">Slovenčina</option><option value="sl_SI">Slovenščina</option><option value="fi">Suomi</option><option value="sv_SE">Svenska</option><option value="syr">Syriac</option><option value="tl">Tagalog</option><option value="kab">Taqbaylit</option><option value="mri">Te Reo Māori</option><option value="vi">Tiếng Việt</option><option value="twd">Twents</option><option value="tuk">Türkmençe</option><option value="tr_TR">Türkçe</option><option value="wol">Wolof</option><option value="yor">Yorùbá</option><option value="xho">isiXhosa</option><option value="zul">isiZulu</option><option value="is_IS">Íslenska</option><option value="cs_CZ">Čeština</option><option value="szl">Ślōnskŏ gŏdka</option><option value="el">Ελληνικά</option><option value="bel">Беларуская мова</option><option value="bg_BG">Български</option><option value="os">Ирон</option><option value="kir">Кыргызча</option><option value="mk_MK">Македонски јазик</option><option value="mn">Монгол</option><option value="ru_RU">Русский</option><option value="sah">Сахалыы</option><option value="sr_RS">Српски језик</option><option value="tt_RU">Татар теле</option><option value="tg">Тоҷикӣ</option><option value="uk">Українська</option><option value="kk">Қазақ тілі</option><option value="hy">Հայերեն</option><option value="he_IL">עִבְרִית</option><option value="ug_CN">ئۇيغۇرچە</option><option value="ur">اردو</option><option value="arq">الدارجة الجزايرية</option><option value="ar">العربية</option><option value="ary">العربية المغربية</option><option value="bcc">بلوچی مکرانی</option><option value="skr">سرائیکی</option><option value="snd">سنڌي</option><option value="fa_IR">فارسی</option><option value="ckb">كوردی‎</option><option value="haz">هزاره گی</option><option value="ps">پښتو</option><option value="azb">گؤنئی آذربایجان</option><option value="dv">ދިވެހި</option><option value="nqo">ߒߞߏ</option><option value="ne_NP">नेपाली</option><option value="brx">बोडो‎</option><option value="sa_IN">भारतम्</option><option value="bho">भोजपुरी</option><option value="mr">मराठी</option><option value="mai">मैथिली</option><option value="hi_IN">हिन्दी</option><option value="as">অসমীয়া</option><option value="bn_BD">বাংলা</option><option value="bn_IN">বাংলা (ভারত)</option><option value="pa_IN">ਪੰਜਾਬੀ</option><option value="gu">ગુજરાતી</option><option value="ory">ଓଡ଼ିଆ</option><option value="ta_IN">தமிழ்</option><option value="ta_LK">தமிழ்</option><option value="te">తెలుగు</option><option value="kn">ಕನ್ನಡ</option><option value="ml_IN">മലയാളം</option><option value="si_LK">සිංහල</option><option value="th">ไทย</option><option value="lo">ພາສາລາວ</option><option value="bo">བོད་ཡིག</option><option value="dzo">རྫོང་ཁ</option><option value="my_MM">ဗမာစာ</option><option value="ka_GE">ქართული</option><option value="tir">ትግርኛ</option><option value="am">አማርኛ</option><option value="km">ភាសាខ្មែរ</option><option value="tzm">ⵜⴰⵎⴰⵣⵉⵖⵜ</option><option value="zh_SG">中文</option><option value="ja">日本語</option><option value="zh_CN">简体中文</option><option value="zh_TW">繁體中文</option><option value="zh_HK">香港中文版    </option><option value="ko_KR">한국어</option><option value="art_xemoji">??? (Emoji)</option>            </select>
        </form>
    </div>
    <script>
        var switcherForm  = document.getElementById( 'language-switcher' );
        var localesSelect = document.getElementById( 'language-switcher-locales' );
        localesSelect.addEventListener( 'change', function() {
            switcherForm.submit()
        } );
    </script>

<div><div class="grecaptcha-badge" data-style="bottomright" style="width: 256px; height: 60px; display: block; transition: right 0.3s ease 0s; position: fixed; bottom: 14px; right: -186px; box-shadow: gray 0px 0px 5px; border-radius: 2px; overflow: hidden;"><div class="grecaptcha-logo"><iframe src="https://www.google.com/recaptcha/api2/anchor?ar=1&amp;k=6LckXrgUAAAAANrzcMN7iy_WxvmMcseaaRW-YFts&amp;co=aHR0cHM6Ly9sb2dpbi53b3JkcHJlc3Mub3JnOjQ0Mw..&amp;hl=de&amp;v=oqtdXEs9TE9ZUAIhXNz5JBt_&amp;size=invisible&amp;cb=hj49xha7yz3z" width="256" height="60" role="presentation" name="a-1ctg1295s9ze" frameborder="0" scrolling="no" sandbox="allow-forms allow-popups allow-same-origin allow-scripts allow-top-navigation allow-modals allow-popups-to-escape-sandbox"></iframe></div><div class="grecaptcha-error"></div><textarea id="g-recaptcha-response-100000" name="g-recaptcha-response" class="g-recaptcha-response" style="width: 250px; height: 40px; border: 1px solid rgb(193, 193, 193); margin: 10px 25px; padding: 0px; resize: none; display: none;"></textarea></div><iframe style="display: none;"></iframe></div></body>

 

well - iwill apply these findings of the investigation to the code you gave. 

 

 

 


this also would be a option too 

import requests
from bs4 import BeautifulSoup
  
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}
  
params = {
    "username": "your_username",
    "password": "xxxxxxx",
    "remember": "yes",
    "submit": "Login",
    "action": "do_login",
}
  
with requests.Session() as s:
    s.post('https://python-forum.io/member.php?action=login', headers=headers, params=params)
    # logged in! session cookies saved for future requests
    response = s.get('https://login.wordpress.org/?locale=en_US')
    # cookies sent automatically!
    soup = BeautifulSoup(response.content, 'lxml')
    welcome = soup.find('span', class_="welcome").text
    print(welcome)


wp_login = 'http://ip/wordpress/wp-login.php'
wp_admin = 'http://ip/wordpress/wp-admin/'
username = 'admin'
password = 'admin'

with requests.Session() as s:
    headers1 = { 'Cookie':'wordpress_test_cookie=WP Cookie check' }
    datas={ 
        'log':username, 'pwd':password, 'wp-submit':'Log In', 
        'redirect_to':wp_admin, 'testcookie':'1'  
    }
    s.post(wp_login, headers=headers1, data=datas)
    resp = s.get(wp_admin)
    print(resp.text)

what do you say!? 

Link to post
Share on other sites
  • 0
Jim K

Huh?

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    No registered users viewing this page.