w3hello.com logo
Home PHP C# C++ Android Java Javascript Python IOS SQL HTML videos Categories
Link Scraping with requests, bs4. Getting Warning: unresponsive script
The issue might be either long loading times of a site, or a cycle in your website links' graph - i.e. page1 (Main Page) has link to page2 (Terms of Service) which in turn has link to page1. You could try this snippet to see how long it takes to get a response from a website (snippet usage included). Regarding your last question: I'm pretty sure requests doesn't parse your response's content (except for .json() method). What you might be experiencing is a link to a resource, like <a href="http://www.example.com/very_big_file.exe">Free Cookies!</a> which you script would visit. requests have mechanics to counter such case, see this for reference. Moreover, the aforementioned technique allows checking Content-Type header to make sure you're downloading pages you're interested i

Categories : Python

Browser Simulation and Scraping with windmill or selenium, how many http requests?
Selenium uses the browser but number of HTTP request is not one. There will be multiple HTTP request to the server for JS, CSS and Images (if any) mentioned in the HTML document. If you want to scrape the page with single HTTP request, you need to use scrapers which only gets what is present in the HTML source. If you are using Python, check out BeautifulSoup.

Categories : Python

How to insert a HTML element in a tree of lxml.html
Use .insert/.append method. import lxml.html def add_css_code(webpageString, linkString): root = lxml.html.fromstring(webpageString) link = lxml.html.fromstring(linkString).find('.//link') head = root.find('.//head') title = head.find('title') if title == None: where = 0 else: where = head.index(title) + 1 head.insert(where, link) return lxml.html.tostring(root) webpageString1 = "<html><head><title>test</title></head><body>webpage content</body></html>" webpageString2 = "<html><head></head><body>webpage content</body></html>" linkString = "<link rel='stylesheet' type='text/css'>" print(add_css_code(webpageString1, linkString)) print(add_css_code(webp

Categories : Python

Cx_freeze with lxml.html TypeError
The error you are getting indicates that module.parent.path is returning NoneType. You probably need to make sure that lxml is in your PYTHONPATH.

Categories : Python

Parsing HTML with lxml (python)
import lxml.html # lxml can download pages directly root = lxml.html.parse('http://test.xyz').getroot() # use a CSS selector for class="main", # or use root.xpath('//table[@class="main"]') tables = root.cssselect('table.main') # extract HTML content from all tables # use lxml.html.tostring(t, method="text", encoding=unicode) # to get text content without tags " ".join([lxml.html.tostring(t) for t in tables]) # removing only specific empty tags, here <b></b> and <i></i> for empty in root.xpath('//*[self::b or self::i][not(node())]'): empty.getparent().remove(empty) # removing all empty tags (tags that do not have children nodes) for empty in root.xpath('//*[not(node())]'): empty.getparent().remove(empty) # root does not contain those empty tags anymore

Categories : Python

lxml: Converting XML to HTML through XSLT and get HtmlElements
>>> import lxml.etree >>> import lxml.html >>> >>> xmlstring = ''' ... <?xml version='1.0' encoding='ASCII'?> ... <root><a class="here">link1</a><a class="there">link2</a></root> ... ''' >>> root = lxml.etree.fromstring(xmlstring) >>> root.cssselect('a.here') Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'lxml.etree._Element' object has no attribute 'cssselect' lxml.etree.tostring(root) -> lxml.html.fromstring(..) >>> root = lxml.html.fromstring(lxml.etree.tostring(root)) >>> root.cssselect('a.here') [<Element a at 0x2989308>] Get XML output: >>> print lxml.etree.tostring(root, xml_declaration=True) <

Categories : Python

How to remove insignificant whitespace in lxml.html?
Ok. You would like to detect some whitespaces, and get away those in excess. You can do it with a reg-exp. from re import sub sub(r"(s)+",' ',yourstring) it'll replace all adjacent whitespaces (when more than one) by one and only one of them '<p> Hello World </p>' was my result with this. I suppose it's close enough to your expectations, and a lone whitespace is always better for readability than none. With a bit longer regular expression, you should manage to get away whitespaces adjacent to HTML tags.

Categories : Python

upper case html tags encoded in lxml
The HTML parser converters all tag names to lower case. This is why xpath('//TR') returns an empty list. I'm not able to reproduce the second problem, where upper case tags get printed as '. Can you modify the code below to demonstrate the problem? import lxml.etree as ET content = ''' <TR> <TR vAlign="top" align="left"> <!--<TD><B onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>--> <TD></TD> </TR> </TR>''' root = ET.HTML(content) print(root.xpath('//TR')) # [] print(root.xpath('//tr/node()')) # [<!--<TD><B onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B>

Categories : Python

How can I preserve
as newlines with lxml.html text_content() or equivalent?
Prepending an character to the tail of each <br /> element should give the result you're expecting: >>> import lxml.html as html >>> fragment = '<div>This is a text node.<br/>This is another text node.<br/><br/><span>And a child element.</span><span>Another child,<br> with two text nodes</span></div>' >>> doc = html.document_fromstring(fragment) >>> for br in doc.xpath("*//br"): br.tail = " " + br.tail if br.tail else " " >>> doc.text_content() 'This is a text node. This is another text node. And a child element.Another child, with two text nodes' >>> fragment '<div>This is a text node.<br/>This is another text node.<br/><br/><span

Categories : Python

HTML Scraping with Javascript
Your question is kinda vague. I think there may be two ways to get this done: 1. apply RegExp to match patterns 2. import the html into a dom simulator and walk the tree to find the data ( I assume you using nodejs )

Categories : Javascript

Scraping Js file with Simple HTML DOM
SimpleHTMLDom, as the name indicates is for parsing HTML, not JavaScript. If you want to parse JavaScript, then why not simply use file_get_contents()? $js = file_get_contents('http://www.example.com/file.js'); //parse it! There's also UglifyJS (originally based on parse-js). You may want to check that out, as well.

Categories : PHP

Scraping html and displaying it on another site
You have all the code you need right there. Just replace your "parse" function with a jQuery wrapper and select the section of the page you want. Keep in mind, this will only work if you're using the same stylesheet on both pages. If not, you'll have to pull in a copy of the styles as well. function fetchPage(url) { $.ajax({ type: "GET", url: url, error: function(request, status) { alert('Error fetching ' + url); }, success: function(data) { $(data.responseText).find('#yourHeader').prependTo(document.body); } }); }

Categories : Javascript

Website Login And Scraping HTML
You are adding cookies to the request from first response: request.CookieContainer.Add(httpResponse.Cookies); Probably the cookies in response are null! To cope up with this issue, read cookie values from response header and add them to the next request like this: string response_header_cookies = response.Headers.Get("Set-Cookie") req.Headers.Add("Cookie",response_header_cookies); In most of the cases this is the more efficient way. Hope this helps! Source: msdn

Categories : C#

Scraping HTML content, preg_match not working
A good way to do that is to use the DOM combined with XPath as wrote Prix. If you want to check that the link you are looking for is a child element of an item from an unordered list with the class "pagination", and to check that the item is the next after the "active page" item, the query will be a little complicated. $doc = new DOMDocument(); @$doc->loadHTMLFile($url); $xpath = new DOMXPath($doc); $xquery = '//ul[@class="pagination"]' // ul with the "pagination" class . '/li[descendant::span[@class="page active"]]' // li that contains a span with "page active" class . '/following-sibling::*[1]' // next sibling (next li) . '/a/@href'; // href attribute of the a tags $links = $xpath->

Categories : PHP

Scraping HTML with JSoup, getting HTTP error, status 456
If you include the user agent it will work, from documentation: Document doc = Jsoup.connect("http://example.com").userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0").get();

Categories : Java

Django Tastypie with Apache LXML and DefusedXML: Error: "Usage of the xml aspects requires lxml and defusedxml"
OK, so I found the answer to my not so well worded question. When I ran easy_install, it never told me that lmxl had actually failed to install since I was missing a compiler. I have no idea why it worked from the Django development server and not through Apache, but I found and installed a binary distribution of lxml and it all started working the way it should.

Categories : Python

Scraping oldschool HTML - Dont think I can use XPath/Dom, and rusty at regexps
This pattern works: preg_match('~<h2>Description</h2>s*<p>K(?>[^<]++|<++(?!/p>))+~', $html, $rawdesc); print_r($rawdesc); Yours works too if you add a ? after the +

Categories : HTML

Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?
The Problem: DOM Requires <tbody/> Tags Firebug, Chrome's Developer Tool, XPath functions in JavaScript and others work on the DOM, not the basic HTML source code. The DOM for HTML requires that all table rows not contained in a table header of footer (<thead/>, <tfoot/>) are included in table body tags <tbody/>. Thus, browsers add this tag if it's missing while parsing (X)HTML. For example, Microsoft's DOM documentation says The tbody element is exposed for all tables, even if the table does not explicitly define a tbody element. There is an in-depth explanation in another answer on stackoverflow. On the other hand, HTML does not necessarily require that tag to be used: The TBODY start tag is always required except when the table contains only one tabl

Categories : HTML

Web scraping/ Screen scraping
There are other solutions but Yodlee has been the longest running provider for aggregation of financial data through both direct data access feeds from top banks as well as screenscrape solutions to cover smaller FI's.

Categories : HTML

HTML 501 error using Python Requests
I looked through the site and it appears that you are using a GET HTTP method to retrieve the data when what you actually need is a POST. Typically an HTTP 501 is sent across as a response to the client, when the web server does not understand the HTTP verb sent across by the client within the request. Try changing the code: r = requests.get('https://venta.renfe.com/vol/inicioCompra.do', data=payload, cookies=cookies, headers=headers) to something like r = requests.post('https://venta.renfe.com/vol/inicioCompra.do', data=payload, cookies=cookies, headers=headers) Note : I have not used Requests, hence you may want to double check the function call parameters. For a quick reference see this link. Hope this helps - and here is a dump of my header as visible in Chrome. Observe that y

Categories : Python

Sending 2 requests using 1 html page
Try to remove parent. and change document.getElementByID to document.getElementById: document.getElementById("frame2").src= "http://www.MyWebsite.com" Edit: Also make sure you add quotes to the onload attribute: <iframe id= "frame1" src= "CustomPageCreated.html" width="300" height="500" frameborder="1" scrolling="auto" onload="changeone()"> </iframe>

Categories : HTML

JSON Proxy: Perform cross-domain web service requests from a single-page HTML app as ubiquitously as possible?
In case anyone searches this, I have offered my app with the following workarounds: A small node.js Proxy: Approx 10 lines of code and I have a standalone proxy to get around CORS A Chrome Extension: Extensions can make CORS requests. Flash Object (Not Tried): Flash Objects can make CORS requests, or so I read A firefox extension: Same as the chrome situation Instructions on how to have the remote services enable CORS.

Categories : Json

How to limit download rate of HTTP requests in requests python library?
There are several approaches to rate limiting; one of them is token bucket, for which you can find a recipe here and another one here. Usually you would want to do throttling or rate limiting on socket.send() and socket.recv(). You could play with socket-throttle and see if it does what you need. This is not to be confused with x-ratelimit rate limiting response headers, which are related to a number of requests rather than a download / transfer rate.

Categories : Python

python-requests returning unicode Exception message (or how to set requests locale)
You can try os.strerror, but it would probably return nothing or the same non-English string. This hard-coded English was scraped from here: http://support.microsoft.com/kb/819124 ENGLISH_WINDOWS_SOCKET_MESSAGES = { 10004: "Interrupted function call.", 10013: "Permission denied.", 10014: "Bad address.", 10022: "Invalid argument.", 10024: "Too many open files.", 10035: "Resource temporarily unavailable.", 10036: "Operation now in progress.", 10037: "Operation already in progress.", 10038: "Socket operation on nonsocket.", 10039: "Destination address required.", 10040: "Message too long.", 10041: "Protocol wrong type for socket.", 10042: "Bad protocol option.", 10043: "Protocol not supported.", 10044: "Socket type not supported.",

Categories : Python

Apache Camel - aggregator to space out requests, but not queuing requests
I believe you would be interested in seeing the Throttler pattern, which is documented here http://camel.apache.org/throttler.html Hope this helps :) EDIT - If you are wanting to eliminate excess requests, you can also investigate setting a TTL (Time to live) header within JMS and add a concurrent consumer of 1 to your route, which means any excess messages will also be discarded.

Categories : Apache

.htaccess rewrite to add .php to no extension requests, BUT to not accept .php requests by users
I am not a htaccess guru yet, but I hope this helps: RewriteCond %{THE_REQUEST} .php RewriteRule .php$ - [R=404] RewriteCond %{REQUEST_URI} !(.php) RewriteCond %{REQUEST_URI} !(^/error/.*$) RewriteRule ^(.*)$ $1.php [L]

Categories : Apache

Why doesn't requests.get() return? What is the default timeout that requests.get() uses?
What is the default timeout that get uses? I believe the timeout is None, which means it'll wait (hang) until the connection is closed. What happens when you pass in a timeout value? r = requests.get( 'http://www.justdial.com', proxies={'http': '222.255.169.74:8080'}, timeout=5 )

Categories : Python

Consecutive requests with python Requests.Session() not working
In the lastest version of requests, he sessions object is with Cookie Persistence, look the requests Sessions ojbects docs. So you don't need add the cookie artificialy. Just import requests s=requests.Session() login_data = dict(userName='user', password='pwd') ra=s.post('http://example/checklogin.php', data=login_data) print ra.content print ra.headers ans = dict(answer='5') r=s.post('http://example/level1.php',data=ans) print r.content Just print the cookie to look up wheather you were logged. for cookie in s.cookies: print (cookie.name, cookie.value) And is the example site is yours? If not maybe the site reject the bot/crawler ! And you can change your requests's user-agent as looks likes you are using a browser. For example: import requests s=requests.Session() headers

Categories : Python

How to redirect all HTTPS requests to HTTP requests?
In the Global.asax file try something like. protected void Application_BeginRequest() { if (Context.Request.IsSecureConnection) Response.Redirect(Context.Request.Url.ToString().Replace("https:", "http:")); }

Categories : Asp Net Mvc

lxml converts "<" to <. Why?
You are inserting text into an XML element. Text always will be escaped to be XML-safe. If you wanted to add a new tag, create a new Element; the ElementTree.SubElement() factory is easiest: from lxml import etree etree.SubElement(td, 'span').text = 'one tag' If you wanted to wrap the contents of the td, simply move all elements over (plus the .text attribute: def wrap(parent, tagname, **kw): sub = etree.SubElement(parent, tagname, **kw) parent.text, sub.text = None, parent.text for index, child in enumerate(parent.iterchildren()): if child is not sub: sub.insert(index, child) return parent wrap(td, 'span') Demo: >>> etree.tostring(doc.findall('.//td')[2]) '<td> <a href="http://korv.com/apa.tar.gz">3.4</a> </

Categories : Python

LXML: remove the x
In a perfect world you would use an XML parsing or html scraping library to parse your html to make sure you have the exact tags that you need, in context. It is almost certainly simpler, quicker and good enough in this case to simply use a regular expression to match what you need. >>> import re >>> samp = """<sitemap xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> ... <loc>http://www.some_page.com/sitemap-page-2010-11.xml</loc> ... <lastmod>2011-12-22T15:46:17+00:00</lastmod> ... </sitemap>""" >>> re.findall(r'<loc>(.*)</loc>', samp) ['http://www.some_page.com/sitemap-page-2010-11.xml']

Categories : Python

Install lxml on mac 10.8.3
I figured it out. Install the xcode command line tools from here http://osxdaily.com/2012/07/06/install-gcc-without-xcode-in-mac-os-x/. Then open terminal and type sudo pip install lxml. Hope this helps.

Categories : Python

How to re-install lxml?
I am using BeautifulSoup 4.3.2 and OS X 10.6.8. I also have a problem with improperly installed lxml. Here are some things that I found out: First of all, check this related question: Removed MacPorts, now Python is broken Now, in order to check which builders for BeautifulSoup 4 are installed, try >>> import bs4 >>> bs4.builder.builder_registry.builders If you don't see your favorite builder, then it is not installed, and you will see an error as above ("Couldn't find a tree builder..."). Also, just because you can import lxml, doesn't mean that everything is perfect. Try >>> import lxml >>> import lxml.etree To understand what's going on, go to the bs4 installation and open the egg (tar -xvzf). Notice the modules bs4.builder. Inside it you sh

Categories : Python

Parsing with lxml xpath
I like css selectors much adaptive on page changes than xpaths: import urllib from lxml import html url = 'https://www.theice.com/productguide/ProductSpec.shtml?specId=251' response = urllib.urlopen(url).read() h = html.document_fromstring(response) for tr in h.cssselect('#tradingHours tbody tr'): td = tr.cssselect('td') print td[0].text_content(), td[1].text_content()

Categories : Python

How do I skip validating the URI in lxml?
If you are sure that those specific errors are not significant to your use case you could just catch it as an exeption: try: # process your tree here SomeFn() except lxml.etree.XMLSyntaxError, e: print "Ignoring", e pass

Categories : Python

XML Declaration standalone="yes" lxml
You can pass standalone keyword argument to tostring(): etree.tostring(tree, pretty_print = True, xml_declaration = True, encoding='UTF-8', standalone="yes")

Categories : Python

lxml-alternative for a BS-code
To parse, make an lxml.etree.HTMLParser and use lxml.etree.fromstring: import lxml.etree parser = lxml.etree.HTMLParser() html = lxml.etree.fromstring(text, parser) You can now use xpath to select the things that you want: for elem in html.xpath("//span[@class='finereader']"): Then, since lxml doesn't let you add text nodes, and instead deals with the text and tail content of nodes, we have to do some magic to replace the nodes with strings: text = (elem.text or "") + (elem.tail or "") if elem.getprevious() is not None: # If there's a previous node previous = elem.getprevious() previous.tail = (previous.tail or "") + text # append to its tail else: parent = elem.getparent() # Otherwise use the parent parent.text = (parent.text or "") + te

Categories : Python

Why is lxml not finding this class?
It should work, provided root = tree.getroot(). import lxml.html import urllib response = urllib.urlopen('http://www.codecademy.com/username') tree = lxml.html.parse(response) # tree.write('/tmp/test.html') root = tree.getroot() print(root.find_class('stat-count')) yields [<Element span at 0xa3146bc>, <Element span at 0xa3146ec>]

Categories : Python

How to update XML file with lxml
Here you go, get the root of the tree, append your new element, save the tree as a string to a file: from lxml import etree tree = etree.parse('books.xml') new_entry = etree.fromstring('''<book category="web" cover="paperback"> <title lang="en">Learning XML 2</title> <author>Erik Ray</author> <year>2006</year> <price>49.95</price> </book>''') root = tree.getroot() root.append(new_entry) f = open('books-mod.xml', 'w') f.write(etree.tostring(root, pretty_print=True)) f.close()

Categories : Python

Formatting the output as XML with lxml
That's strange, because it is exactly the way it should work. Could you try this: root = etree.XML( YOUR XML STRING ) print etree.tostring(root, pretty_print=True) <Variable Name="one" RefID="two"> <Component Type="three"> <Value>four</Value> </Component> </Variable> This should generate a formatted string, which you can process yourself.

Categories : Python



© Copyright 2017 w3hello.com Publishing Limited. All rights reserved.