|
|
|
@ -19,14 +19,15 @@ parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML. |
|
|
|
.. class:: HTMLParser(strict=True) |
|
|
|
|
|
|
|
Create a parser instance. If *strict* is ``True`` (the default), invalid |
|
|
|
html results in :exc:`~html.parser.HTMLParseError` exceptions [#]_. If |
|
|
|
HTML results in :exc:`~html.parser.HTMLParseError` exceptions [#]_. If |
|
|
|
*strict* is ``False``, the parser uses heuristics to make a best guess at |
|
|
|
the intention of any invalid html it encounters, similar to the way most |
|
|
|
browsers do. |
|
|
|
the intention of any invalid HTML it encounters, similar to the way most |
|
|
|
browsers do. Using ``strict=False`` is advised. |
|
|
|
|
|
|
|
An :class:`HTMLParser` instance is fed HTML data and calls handler functions when tags |
|
|
|
begin and end. The :class:`HTMLParser` class is meant to be overridden by the |
|
|
|
user to provide a desired behavior. |
|
|
|
An :class:`.HTMLParser` instance is fed HTML data and calls handler methods |
|
|
|
when start tags, end tags, text, comments, and other markup elements are |
|
|
|
encountered. The user should subclass :class:`.HTMLParser` and override its |
|
|
|
methods to implement the desired behavior. |
|
|
|
|
|
|
|
This parser does not check that end tags match start tags or call the end-tag |
|
|
|
handler for elements which are closed implicitly by closing an outer element. |
|
|
|
@ -39,25 +40,61 @@ An exception is defined as well: |
|
|
|
.. exception:: HTMLParseError |
|
|
|
|
|
|
|
Exception raised by the :class:`HTMLParser` class when it encounters an error |
|
|
|
while parsing. This exception provides three attributes: :attr:`msg` is a brief |
|
|
|
message explaining the error, :attr:`lineno` is the number of the line on which |
|
|
|
the broken construct was detected, and :attr:`offset` is the number of |
|
|
|
characters into the line at which the construct starts. |
|
|
|
while parsing and *strict* is ``True``. This exception provides three |
|
|
|
attributes: :attr:`msg` is a brief message explaining the error, |
|
|
|
:attr:`lineno` is the number of the line on which the broken construct was |
|
|
|
detected, and :attr:`offset` is the number of characters into the line at |
|
|
|
which the construct starts. |
|
|
|
|
|
|
|
:class:`HTMLParser` instances have the following methods: |
|
|
|
|
|
|
|
Example HTML Parser Application |
|
|
|
------------------------------- |
|
|
|
|
|
|
|
.. method:: HTMLParser.reset() |
|
|
|
As a basic example, below is a simple HTML parser that uses the |
|
|
|
:class:`HTMLParser` class to print out start tags, end tags, and data |
|
|
|
as they are encountered:: |
|
|
|
|
|
|
|
Reset the instance. Loses all unprocessed data. This is called implicitly at |
|
|
|
instantiation time. |
|
|
|
from html.parser import HTMLParser |
|
|
|
|
|
|
|
class MyHTMLParser(HTMLParser): |
|
|
|
def handle_starttag(self, tag, attrs): |
|
|
|
print("Encountered a start tag:", tag) |
|
|
|
def handle_endtag(self, tag): |
|
|
|
print("Encountered an end tag :", tag) |
|
|
|
def handle_data(self, data): |
|
|
|
print("Encountered some data :", data) |
|
|
|
|
|
|
|
parser = MyHTMLParser(strict=False) |
|
|
|
parser.feed('<html><head><title>Test</title></head>' |
|
|
|
'<body><h1>Parse me!</h1></body></html>') |
|
|
|
|
|
|
|
The output will then be:: |
|
|
|
|
|
|
|
Encountered a start tag: html |
|
|
|
Encountered a start tag: head |
|
|
|
Encountered a start tag: title |
|
|
|
Encountered some data : Test |
|
|
|
Encountered an end tag : title |
|
|
|
Encountered an end tag : head |
|
|
|
Encountered a start tag: body |
|
|
|
Encountered a start tag: h1 |
|
|
|
Encountered some data : Parse me! |
|
|
|
Encountered an end tag : h1 |
|
|
|
Encountered an end tag : body |
|
|
|
Encountered an end tag : html |
|
|
|
|
|
|
|
|
|
|
|
:class:`.HTMLParser` Methods |
|
|
|
---------------------------- |
|
|
|
|
|
|
|
:class:`HTMLParser` instances have the following methods: |
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.feed(data) |
|
|
|
|
|
|
|
Feed some text to the parser. It is processed insofar as it consists of |
|
|
|
complete elements; incomplete data is buffered until more data is fed or |
|
|
|
:meth:`close` is called. |
|
|
|
:meth:`close` is called. *data* must be :class:`str`. |
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.close() |
|
|
|
@ -68,6 +105,12 @@ An exception is defined as well: |
|
|
|
the :class:`HTMLParser` base class method :meth:`close`. |
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.reset() |
|
|
|
|
|
|
|
Reset the instance. Loses all unprocessed data. This is called implicitly at |
|
|
|
instantiation time. |
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.getpos() |
|
|
|
|
|
|
|
Return current line number and offset. |
|
|
|
@ -81,23 +124,35 @@ An exception is defined as well: |
|
|
|
attributes can be preserved, etc.). |
|
|
|
|
|
|
|
|
|
|
|
The following methods are called when data or markup elements are encountered |
|
|
|
and they are meant to be overridden in a subclass. The base class |
|
|
|
implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`): |
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.handle_starttag(tag, attrs) |
|
|
|
|
|
|
|
This method is called to handle the start of a tag. It is intended to be |
|
|
|
overridden by a derived class; the base class implementation does nothing. |
|
|
|
This method is called to handle the start of a tag (e.g. ``<div id="main">``). |
|
|
|
|
|
|
|
The *tag* argument is the name of the tag converted to lower case. The *attrs* |
|
|
|
argument is a list of ``(name, value)`` pairs containing the attributes found |
|
|
|
inside the tag's ``<>`` brackets. The *name* will be translated to lower case, |
|
|
|
and quotes in the *value* have been removed, and character and entity references |
|
|
|
have been replaced. For instance, for the tag ``<A |
|
|
|
HREF="http://www.cwi.nl/">``, this method would be called as |
|
|
|
``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``. |
|
|
|
have been replaced. |
|
|
|
|
|
|
|
For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method |
|
|
|
would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``. |
|
|
|
|
|
|
|
All entity references from :mod:`html.entities` are replaced in the attribute |
|
|
|
values. |
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.handle_endtag(tag) |
|
|
|
|
|
|
|
This method is called to handle the end tag of an element (e.g. ``</div>``). |
|
|
|
|
|
|
|
The *tag* argument is the name of the tag converted to lower case. |
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.handle_startendtag(tag, attrs) |
|
|
|
|
|
|
|
Similar to :meth:`handle_starttag`, but called when the parser encounters an |
|
|
|
@ -106,57 +161,46 @@ An exception is defined as well: |
|
|
|
implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`. |
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.handle_endtag(tag) |
|
|
|
|
|
|
|
This method is called to handle the end tag of an element. It is intended to be |
|
|
|
overridden by a derived class; the base class implementation does nothing. The |
|
|
|
*tag* argument is the name of the tag converted to lower case. |
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.handle_data(data) |
|
|
|
|
|
|
|
This method is called to process arbitrary data (e.g. the content of |
|
|
|
``<script>...</script>`` and ``<style>...</style>``). It is intended to be |
|
|
|
overridden by a derived class; the base class implementation does nothing. |
|
|
|
This method is called to process arbitrary data (e.g. text nodes and the |
|
|
|
content of ``<script>...</script>`` and ``<style>...</style>``). |
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.handle_charref(name) |
|
|
|
.. method:: HTMLParser.handle_entityref(name) |
|
|
|
|
|
|
|
This method is called to process a character reference of the form ``&#ref;``. |
|
|
|
It is intended to be overridden by a derived class; the base class |
|
|
|
implementation does nothing. |
|
|
|
This method is called to process a named character reference of the form |
|
|
|
``&name;`` (e.g. ``>``), where *name* is a general entity reference |
|
|
|
(e.g. ``'gt'``). |
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.handle_entityref(name) |
|
|
|
.. method:: HTMLParser.handle_charref(name) |
|
|
|
|
|
|
|
This method is called to process a general entity reference of the form |
|
|
|
``&name;`` where *name* is an general entity reference. It is intended to be |
|
|
|
overridden by a derived class; the base class implementation does nothing. |
|
|
|
This method is called to process decimal and hexadecimal numeric character |
|
|
|
references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal |
|
|
|
equivalent for ``>`` is ``>``, whereas the hexadecimal is ``>``; |
|
|
|
in this case the method will receive ``'62'`` or ``'x3E'``. |
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.handle_comment(data) |
|
|
|
|
|
|
|
This method is called when a comment is encountered. The *comment* argument is |
|
|
|
a string containing the text between the ``--`` and ``--`` delimiters, but not |
|
|
|
the delimiters themselves. For example, the comment ``<!--text-->`` will cause |
|
|
|
this method to be called with the argument ``'text'``. It is intended to be |
|
|
|
overridden by a derived class; the base class implementation does nothing. |
|
|
|
This method is called when a comment is encountered (e.g. ``<!--comment-->``). |
|
|
|
|
|
|
|
For example, the comment ``<!-- comment -->`` will cause this method to be |
|
|
|
called with the argument ``' comment '``. |
|
|
|
|
|
|
|
.. method:: HTMLParser.handle_decl(decl) |
|
|
|
The content of Internet Explorer conditional comments (condcoms) will also be |
|
|
|
sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``, |
|
|
|
this method will receive ``'[if IE 9]>IE-specific content<![endif]'``. |
|
|
|
|
|
|
|
Method called when an SGML ``doctype`` declaration is read by the parser. |
|
|
|
The *decl* parameter will be the entire contents of the declaration inside |
|
|
|
the ``<!...>`` markup. It is intended to be overridden by a derived class; |
|
|
|
the base class implementation does nothing. |
|
|
|
|
|
|
|
.. method:: HTMLParser.handle_decl(decl) |
|
|
|
|
|
|
|
.. method:: HTMLParser.unknown_decl(data) |
|
|
|
This method is called to handle an HTML doctype declaration (e.g. |
|
|
|
``<!DOCTYPE html>``). |
|
|
|
|
|
|
|
Method called when an unrecognized SGML declaration is read by the parser. |
|
|
|
The *data* parameter will be the entire contents of the declaration inside |
|
|
|
the ``<!...>`` markup. It is sometimes useful to be overridden by a |
|
|
|
derived class; the base class implementation raises an :exc:`HTMLParseError`. |
|
|
|
The *decl* parameter will be the entire contents of the declaration inside |
|
|
|
the ``<!...>`` markup (e.g. ``'DOCTYPE html'``). |
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.handle_pi(data) |
|
|
|
@ -174,29 +218,123 @@ An exception is defined as well: |
|
|
|
cause the ``'?'`` to be included in *data*. |
|
|
|
|
|
|
|
|
|
|
|
.. _htmlparser-example: |
|
|
|
.. method:: HTMLParser.unknown_decl(data) |
|
|
|
|
|
|
|
Example HTML Parser Application |
|
|
|
------------------------------- |
|
|
|
This method is called when an unrecognized declaration is read by the parser. |
|
|
|
|
|
|
|
The *data* parameter will be the entire contents of the declaration inside |
|
|
|
the ``<![...]>`` markup. It is sometimes useful to be overridden by a |
|
|
|
derived class. The base class implementation raises an :exc:`HTMLParseError` |
|
|
|
when *strict* is ``True``. |
|
|
|
|
|
|
|
As a basic example, below is a simple HTML parser that uses the |
|
|
|
:class:`HTMLParser` class to print out start tags, end tags, and data |
|
|
|
as they are encountered:: |
|
|
|
|
|
|
|
.. _htmlparser-examples: |
|
|
|
|
|
|
|
Examples |
|
|
|
-------- |
|
|
|
|
|
|
|
The following class implements a parser that will be used to illustrate more |
|
|
|
examples:: |
|
|
|
|
|
|
|
from html.parser import HTMLParser |
|
|
|
from html.entities import name2codepoint |
|
|
|
|
|
|
|
class MyHTMLParser(HTMLParser): |
|
|
|
def handle_starttag(self, tag, attrs): |
|
|
|
print("Encountered a start tag:", tag) |
|
|
|
print("Start tag:", tag) |
|
|
|
for attr in attrs: |
|
|
|
print(" attr:", attr) |
|
|
|
def handle_endtag(self, tag): |
|
|
|
print("Encountered an end tag:", tag) |
|
|
|
print("End tag :", tag) |
|
|
|
def handle_data(self, data): |
|
|
|
print("Encountered some data:", data) |
|
|
|
|
|
|
|
parser = MyHTMLParser() |
|
|
|
parser.feed('<html><head><title>Test</title></head>' |
|
|
|
'<body><h1>Parse me!</h1></body></html>') |
|
|
|
|
|
|
|
print("Data :", data) |
|
|
|
def handle_comment(self, data): |
|
|
|
print("Comment :", data) |
|
|
|
def handle_entityref(self, name): |
|
|
|
c = chr(name2codepoint[name]) |
|
|
|
print("Named ent:", c) |
|
|
|
def handle_charref(self, name): |
|
|
|
if name.startswith('x'): |
|
|
|
c = chr(int(name[1:], 16)) |
|
|
|
else: |
|
|
|
c = chr(int(name)) |
|
|
|
print("Num ent :", c) |
|
|
|
def handle_decl(self, data): |
|
|
|
print("Decl :", data) |
|
|
|
|
|
|
|
parser = MyHTMLParser(strict=False) |
|
|
|
|
|
|
|
Parsing a doctype:: |
|
|
|
|
|
|
|
>>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" ' |
|
|
|
... '"http://www.w3.org/TR/html4/strict.dtd">') |
|
|
|
Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd" |
|
|
|
|
|
|
|
Parsing an element with a few attributes and a title:: |
|
|
|
|
|
|
|
>>> parser.feed('<img src="python-logo.png" alt="The Python logo">') |
|
|
|
Start tag: img |
|
|
|
attr: ('src', 'python-logo.png') |
|
|
|
attr: ('alt', 'The Python logo') |
|
|
|
>>> |
|
|
|
>>> parser.feed('<h1>Python</h1>') |
|
|
|
Start tag: h1 |
|
|
|
Data : Python |
|
|
|
End tag : h1 |
|
|
|
|
|
|
|
The content of ``script`` and ``style`` elements is returned as is, without |
|
|
|
further parsing:: |
|
|
|
|
|
|
|
>>> parser.feed('<style type="text/css">#python { color: green }</style>') |
|
|
|
Start tag: style |
|
|
|
attr: ('type', 'text/css') |
|
|
|
Data : #python { color: green } |
|
|
|
End tag : style |
|
|
|
>>> |
|
|
|
>>> parser.feed('<script type="text/javascript">' |
|
|
|
... 'alert("<strong>hello!</strong>");</script>') |
|
|
|
Start tag: script |
|
|
|
attr: ('type', 'text/javascript') |
|
|
|
Data : alert("<strong>hello!</strong>"); |
|
|
|
End tag : script |
|
|
|
|
|
|
|
Parsing comments:: |
|
|
|
|
|
|
|
>>> parser.feed('<!-- a comment -->' |
|
|
|
... '<!--[if IE 9]>IE-specific content<![endif]-->') |
|
|
|
Comment : a comment |
|
|
|
Comment : [if IE 9]>IE-specific content<![endif] |
|
|
|
|
|
|
|
Parsing named and numeric character references and converting them to the |
|
|
|
correct char (note: these 3 references are all equivalent to ``'>'``):: |
|
|
|
|
|
|
|
>>> parser.feed('>>>') |
|
|
|
Named ent: > |
|
|
|
Num ent : > |
|
|
|
Num ent : > |
|
|
|
|
|
|
|
Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but |
|
|
|
:meth:`~HTMLParser.handle_data` might be called more than once:: |
|
|
|
|
|
|
|
>>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']: |
|
|
|
... parser.feed(chunk) |
|
|
|
... |
|
|
|
Start tag: span |
|
|
|
Data : buff |
|
|
|
Data : ered |
|
|
|
Data : text |
|
|
|
End tag : span |
|
|
|
|
|
|
|
Parsing invalid HTML (e.g. unquoted attributes) also works:: |
|
|
|
|
|
|
|
>>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>') |
|
|
|
Start tag: p |
|
|
|
Start tag: a |
|
|
|
attr: ('class', 'link') |
|
|
|
attr: ('href', '#main') |
|
|
|
Data : tag soup |
|
|
|
End tag : p |
|
|
|
End tag : a |
|
|
|
|
|
|
|
.. rubric:: Footnotes |
|
|
|
|
|
|
|
|