root/livinglogic.python.xist/src/ll/xist/parse.py @ 4475:96ffb42c6d99

Revision 4475:96ffb42c6d99, 47.3 KB (checked in by Walter Doerwald <walter@…>, 8 years ago)

Markup None as a constant.

Line 
1# -*- coding: utf-8 -*-
2
3## Copyright 1999-2011 by LivingLogic AG, Bayreuth/Germany
4## Copyright 1999-2011 by Walter Dörwald
5##
6## All Rights Reserved
7##
8## See ll/__init__.py for the license
9
10
11"""
12This module contains everything you need to create XIST objects by parsing
13files, strings, URLs etc.
14
15Parsing XML is done with a pipelined approach. The first step in the pipeline
16is a source object that provides the input for the rest of the pipeline.
17The next step is the XML parser. It turns the input source into an iterator over
18parsing events (an "event stream"). Further steps in the pipeline might resolve
19namespace prefixes (:class:`NS`), and instantiate XIST classes
20(:class:`Node`). The final step in the pipeline is either building an
21XML tree via :func:`tree` or an iterative parsing step (similar to ElementTrees
22:func:`iterparse` function) via :func:`itertree`.
23
24Parsing a simple HTML string might e.g. look like this::
25
26    >>> from ll.xist import xsc, parse
27    >>> from ll.xist.ns import html
28    >>> source = "<a href='http://www.python.org/'>Python</a>"
29    >>> doc = parse.tree(
30    ...     parse.String(source)
31    ...     parse.Expat()
32    ...     parse.NS(html)
33    ...     parse.Node(pool=xsc.Pool(html))
34    ... )
35    >>> doc.bytes()
36    '<a href="http://www.python.org/">Python</a>'
37
38A source object is an iterable object that produces the input byte string for
39the parser (possibly in multiple chunks) (and information about the URL of the
40input)::
41
42    >>> from ll.xist import parse
43    >>> list(parse.String("<a href='http://www.python.org/'>Python</a>"))
44    [('url', URL('STRING')),
45     ('bytes', "<a href='http://www.python.org/'>Python</a>")]
46
47All subsequent objects in the pipeline are callable objects, get the input
48iterator as an argument and return an iterator over events themselves. The
49following code shows an example of an event stream::
50
51    >>> from ll.xist import parse
52    >>> source = "<a href='http://www.python.org/'>Python</a>"
53    >>> list(parse.events(parse.String(source), parse.Expat()))
54    [('url', URL('STRING')),
55     ('position', (0, 0)),
56     ('enterstarttag', u'a'),
57     ('enterattr', u'href'),
58     ('text', u'http://www.python.org/'),
59     ('leaveattr', u'href'),
60     ('leavestarttag', u'a'),
61     ('position', (0, 39)),
62     ('text', u'Python'),
63     ('endtag', u'a')]
64
65An event is a tuple consisting of the event type and the event data. Different
66stages in the pipeline produce different event types. The following event types
67can be produced by source objects:
68
69    ``"url"``
70        The event data is the URL of the source. Usually such an event is produced
71        only once at the start of the event stream. For sources that have no
72        natural URL (like strings or streams) the URL can be specified when
73        creating the source object.
74
75    ``"bytes"``
76        This event is produced by source objects  (and :class:`Transcoder` objects).
77        The event data is a byte string.
78
79    ``"unicode"``
80        The event data is a unicode string. This event is produced by
81        :class:`Decoder` objects. Note that the only predefined pipeline objects
82        that can handle ``"unicode"`` events are :class:`Encoder` objects.
83
84The following type of events are produced by parsers (in addition to the
85``"url"`` event from above):
86
87    ``"position"``
88        The event data is a tuple containing the line and column number in the
89        source (both starting with 0). All the following events should use this
90        position information until the next position event.
91
92    ``"xmldecl"``
93        The XML declaration. The event data is a dictionary containing the keys
94        ``"version"``, ``"encoding"`` and ``"standalone"``. Parsers may omit this
95        event.
96
97    ``"begindoctype"``
98        The begin of the doctype. The event data is a dictionary containing the
99        keys ``"name"``, ``"publicid"`` and ``"systemid"``.  Parsers may omit this
100        event.
101
102    ``"enddoctype"``
103        The end of the doctype. The event data is :const:`None`. (If there is no
104        internal subset, the ``"enddoctype"`` event immediately follows the
105        ``"begindoctype"`` event). Parsers may omit this event.
106
107    ``"comment"``
108        A comment. The event data is the content of the comment.
109
110    ``"text"``
111        Text data. The event data is the text content. Parsers should try to avoid
112        outputting multiple text events in sequence.
113
114    ``"cdata"``
115        A CDATA section. The event data is the content of the CDATA section.
116        Parsers may report CDATA sections as ``"text"`` events instead of
117        ``"cdata"`` events.
118
119    ``"enterstarttag"``
120        The beginning of an element start tag. The event data is the element name.
121
122    ``"leavestarttag"``
123        The end of an element start tag. The event data is the element name.
124        The parser will output events for the attributes between the
125        ``"enterstarttag"`` and the ``"leavestarttag"`` event.
126
127    ``"enterattr"``
128        The beginning of an attribute. The event data is the attribute name.
129
130    ``"leaveattr"``
131        The end of an attribute. The event data is the attribute name.
132        The parser will output events for the attribute value between the
133        ``"enterattr"`` and the ``"leaveattr"`` event. (In almost all cases
134        this is one text event).
135
136    ``"endtag"``
137        An element end tag. The event data is the element name.
138
139    ``"procinst"``
140        A processing instruction. The event data is a tuple consisting of the
141        processing instruction target and the data.
142
143    ``"entity"``
144        An entity reference. The event data is the entity name.
145
146The following events are produced for elements and attributes in namespace mode
147(instead of those without the ``ns`` suffix). They are produced by :class:`NS`
148objects or by :class:`Expat` objects when :var:`ns` is true (i.e. the expat
149parser does the namespace resolution):
150
151    ``"enterstarttagns"``
152        The beginning of an element start tag in namespace mode.
153        The event data is an (element name, namespace name) tuple.
154
155    ``"leavestarttagns"``
156        The end of an element start tag in namespace mode. The event data is an
157        (element name, namespace name) tuple.
158
159    ``"enterattrns"``
160        The beginning of an attribute in namespace mode. The event data is an
161        (element name, namespace name) tuple.
162
163    ``"leaveattrns"``
164        The end of an attribute in namespace mode. The event data is an
165        (element name, namespace name) tuple.
166
167    ``"endtagns"``
168        An element end tag in namespace mode. The event data is an
169        (element name, namespace name) tuple.
170
171Once XIST nodes have been instantiated (by :class:`Node` objects) the
172following events are used:
173
174    ``"xmldeclnode"``
175        The XML declaration. The event data is an instance of
176        :class:`ll.xist.xml.XML`.
177
178    ``"doctypenode"``
179        The doctype. The event data is an instance of :class:`ll.xist.xsc.DocType`.
180
181    ``"commentnode"``
182        A comment. The event data is an instance of :class:`ll.xist.xsc.Comment`.
183
184    ``"textnode"``
185        Text data. The event data is an instance of :class:`ll.xist.xsc.Text`.
186
187    ``"startelementnode"``
188        The beginning of an element. The event data is an instance of
189        :class:`ll.xist.xsc.Element` (or rather one of its subclasses). The
190        attributes of the element object are set, but the element has no content.
191
192    ``"endelementnode"``
193        The end of an element. The event data is an instance of
194        :class:`ll.xist.xsc.Element`.
195
196    ``"procinstnode"``
197        A processing instruction. The event data is an instance of
198        :class:`ll.xist.xsc.ProcInst`.
199
200    ``"entitynode"``
201        An entity reference. The event data is an instance of
202        :class:`ll.xist.xsc.Entity`.
203
204For consuming event streams there are three functions:
205
206    :func:`events`
207        This generator simply outputs the events.
208
209    :func:`tree`
210        This function builds an XML tree from the events and returns it.
211
212    :func:`itertree`
213        This generator builds a tree like :func:`tree`, but returns events
214        during certain steps in the parsing process.
215"""
216
217
218import sys, os, os.path, warnings, cStringIO, codecs, contextlib, types
219
220from xml.parsers import expat
221
222from ll import url as url_, misc, xml_codec
223from ll.xist import xsc, xfind
224try:
225    from ll.xist import sgmlop
226except ImportError:
227    pass
228from ll.xist.ns import xml, html
229
230
231__docformat__ = "reStructuredText"
232
233
234html_xmlns = "http://www.w3.org/1999/xhtml"
235
236
237###
238### exceptions
239###
240
241class UnknownEventError(TypeError):
242    """
243    This exception is raised when a pipeline object doesn't know how to handle
244    an event.
245    """
246    def __init__(self, pipe, event):
247        self.pipe = pipe
248        self.event = event
249
250    def __str__(self):
251        return "{0.pipe!r} can't handle event type {0.event[0]!r}".format(self)
252
253
254###
255### Sources: Classes that create on event stream
256###
257
258class String(object):
259    """
260    Provides parser input from a string.
261    """
262    def __init__(self, data, url=None):
263        """
264        Create a :class:`String` object. :var:`data` must be a byte or
265        unicode string. :var:`url` specifies the URL for the source (defaulting
266        to ``"STRING"``).
267        """
268        self.url = url_.URL(url if url is not None else "STRING")
269        self.data = data
270
271    def __iter__(self):
272        """
273        Produces an event stream of one ``"url"`` event and one ``"bytes"`` or
274        ``"unicode"`` event for the data.
275        """
276        yield (u"url", self.url)
277        if isinstance(self.data, str):
278            yield (u"bytes", self.data)
279        elif isinstance(self.data, unicode):
280            yield (u"unicode", self.data)
281        else:
282            raise TypeError("data must be str or unicode")
283
284
285class Iter(object):
286    """
287    Provides parser input from an iterator over strings.
288    """
289
290    def __init__(self, iterable, url=None):
291        """
292        Create a :class:`Iter` object. :var:`iterable` must be an iterable object
293        producing byte or unicode strings. :var:`url` specifies the URL for the
294        source (defaulting to ``"ITER"``).
295        """
296        self.url = url_.URL(url if url is not None else "ITER")
297        self.iterable = iterable
298
299    def __iter__(self):
300        """
301        Produces an event stream of one ``"url"`` event followed by the
302        ``"bytes"``/``"unicode"`` events for the data from the iterable.
303        """
304        yield (u"url", self.url)
305        for data in self.iterable:
306            if isinstance(data, str):
307                yield (u"bytes", data)
308            elif isinstance(data, unicode):
309                yield (u"unicode", data)
310            else:
311                raise TypeError("data must be str or unicode")
312
313
314class Stream(object):
315    """
316    Provides parser input from a stream (i.e. an object that provides a
317    :meth:`read` method).
318    """
319
320    def __init__(self, stream, url=None, bufsize=8192):
321        """
322        Create a :class:`Stream` object. :var:`stream` must have a :meth:`read`
323        method (with a ``size`` argument). :var:`url` specifies the URL for the
324        source (defaulting to ``"STREAM"``). :var:`bufsize` specifies the
325        chunksize for reads from the stream.
326        """
327        self.url = url_.URL(url if url is not None else "STREAM")
328        self.stream = stream
329        self.bufsize = bufsize
330
331    def __iter__(self):
332        """
333        Produces an event stream of one ``"url"`` event followed by the
334        ``"bytes"``/``"unicode"`` events for the data from the stream.
335        """
336        yield (u"url", self.url)
337        while True:
338            data = self.stream.read(self.bufsize)
339            if data:
340                if isinstance(data, str):
341                    yield (u"bytes", data)
342                elif isinstance(data, unicode):
343                    yield (u"unicode", data)
344                else:
345                    raise TypeError("data must be str or unicode")
346            else:
347                break
348
349
350class File(object):
351    """
352    Provides parser input from a file.
353    """
354
355    def __init__(self, filename, bufsize=8192):
356        """
357        Create a :class:`File` object. :var:`filename` is the name of the file
358        and may start with ``~`` or ``~user`` for the home directory of the
359        current or the specified user. :var:`bufsize` specifies the chunksize
360        for reads from the file.
361        """
362        self.url = url_.File(filename)
363        self._filename = os.path.expanduser(filename)
364        self.bufsize = bufsize
365
366    def __iter__(self):
367        """
368        Produces an event stream of one ``"url"`` event followed by the
369        ``"bytes"`` events for the data from the file.
370        """
371        yield (u"url", self.url)
372        with open(self._filename, "rb") as stream:
373            while True:
374                data = stream.read(self.bufsize)
375                if data:
376                    yield (u"bytes", data)
377                else:
378                    break
379
380
381class URL(object):
382    """
383    Provides parser input from a URL.
384    """
385
386    def __init__(self, name, bufsize=8192, *args, **kwargs):
387        """
388        Create a :class:`URL` object. :var:`name` is the URL. :var:`bufsize`
389        specifies the chunksize for reads from the URL. :var:`args` and
390        :var:`kwargs` will be passed on to the :meth:`open` method of the URL
391        object.
392
393        The URL for the input will be the final URL for the resource (i.e. it will
394        include redirects).
395        """
396        self.url = url_.URL(name)
397        self.bufsize = bufsize
398        self.args = args
399        self.kwargs = kwargs
400
401    def __iter__(self):
402        """
403        Produces an event stream of one ``"url"`` event followed by the
404        ``"bytes"`` events for the data from the URL.
405        """
406        stream = self.url.open("rb", *self.args, **self.kwargs)
407        yield (u"url", stream.finalurl())
408        with contextlib.closing(stream) as stream:
409            while True:
410                data = stream.read(self.bufsize)
411                if data:
412                    yield (u"bytes", data)
413                else:
414                    break
415
416
417class ETree(object):
418    """
419    Produces a (namespaced) event stream from an object that supports the
420    ElementTree__ API.
421
422    __ http://effbot.org/zone/element-index.htm
423    """
424
425    def __init__(self, data, url=None, defaultxmlns=None):
426        """
427        Create an :class:`ETree` object. Arguments have the following meaning:
428
429        :var:`data`
430            An object that supports the ElementTree API.
431
432        :var:`url`
433            The URL of the source. Defaults to ``"ETREE"``.
434
435        :var:`defaultxmlns`
436            The namespace name (or a namespace module containing a namespace name)
437            that will be used for all elements that don't have a namespace.
438        """
439        self.url = url_.URL(url if url is not None else "ETREE")
440        self.data = data
441        self.defaultxmlns = xsc.nsname(defaultxmlns)
442
443    def _asxist(self, node):
444        name = type(node).__name__
445        if "Element" in name:
446            elementname = node.tag
447            if elementname.startswith("{"):
448                (elementxmlns, sep, elementname) = elementname[1:].partition("}")
449            else:
450                elementxmlns = self.defaultxmlns
451            yield (u"enterstarttagns", (elementname, elementxmlns))
452            for (attrname, attrvalue) in node.items():
453                if attrname.startswith("{"):
454                    (attrxmlns, sep, attrname) = attrname[1:].partition("}")
455                else:
456                    attrxmlns = None
457                yield (u"enterattrns", (attrname, attrxmlns))
458                yield (u"text", attrvalue)
459                yield (u"leaveattrns", (attrname, attrxmlns))
460            yield (u"leavestarttagns", (elementname, elementxmlns))
461            if node.text:
462                yield (u"text", node.text)
463            for child in node:
464                for event in self._asxist(child):
465                    yield event
466                if hasattr(child, "tail") and child.tail:
467                    yield (u"text", child.tail)
468            yield (u"endtagns", (elementname, elementxmlns))
469        elif "ProcessingInstruction" in name:
470            yield (u"procinst", (node.target, node.text))
471        elif "Comment" in name:
472            yield (u"comment", node.text)
473
474    def __iter__(self):
475        """
476        Produces an event stream of namespaced parsing events for the ElementTree
477        object passed as :var:`data` to the constructor.
478        """
479        yield (u"url", self.url)
480        for event in self._asxist(self.data):
481            yield event
482
483
484###
485### Transformers: Classes that transform the event stream.
486###
487
488class Decoder(object):
489    """
490    Decode the byte strings produced by the previous object in the pipeline to
491    unicode strings.
492
493    This input object can be a source object or any other pipeline object that
494    produces byte strings.
495    """
496
497    def __init__(self, encoding=None):
498        """
499        Create a :class:`Decoder` object. :var:`encoding` is the encoding of the
500        input. If :var:`encoding` is :const:`None` it will be automatically
501        detected from the XML data.
502        """
503        self.encoding = encoding
504
505    def __call__(self, input):
506        decoder = codecs.getincrementaldecoder("xml")(encoding=self.encoding)
507        for (evtype, data) in input:
508            if evtype == u"bytes":
509                data = decoder.decode(data, False)
510                if data:
511                    yield (u"unicode", data)
512            elif evtype == u"unicode":
513                if data:
514                    yield (u"unicode", data)
515            elif evtype == u"url":
516                yield (u"url", data)
517            else:
518                raise UnknownEventError(self, (evtype, data))
519        data = decoder.decode("", True)
520        if data:
521            yield (u"unicode", data)
522
523    def __repr__(self):
524        return "<{0.__class__.__module__}.{0.__class__.__name__} object encoding={0.encoding!r} at {1:#x}>".format(self, id(self))
525
526
527class Encoder(object):
528    """
529    Encode the unicode strings produced by the previous object in the pipeline to
530    byte strings.
531
532    This input object must be a pipeline object that produces unicode output
533    (e.g. a :class:`Decoder` object).
534    """
535
536    def __init__(self, encoding=None):
537        """
538        Create an :class:`Encoder` object. :var:`encoding` will be the encoding of
539        the output. If :var:`encoding` is :const:`None` it will be automatically
540        detected from the XML declaration in the data.
541        """
542        self.encoding = encoding
543
544    def __call__(self, input):
545        encoder = codecs.getincrementalencoder("xml")(encoding=self.encoding)
546        for (evtype, data) in input:
547            if evtype == u"unicode":
548                data = encoder.encode(data, False)
549                if data:
550                    yield (u"bytes", data)
551            elif evtype == u"bytes":
552                if data:
553                    yield (u"bytes", data)
554            elif evtype == u"url":
555                yield (u"url", data)
556            else:
557                raise UnknownEventError(self, (evtype, data))
558        data = encoder.encode(u"", True)
559        if data:
560            yield (u"bytes", data)
561
562    def __repr__(self):
563        return "<{0.__class__.__module__}.{0.__class__.__name__} object encoding={0.encoding!r} at {1:#x}>".format(self, id(self))
564
565
566class Transcoder(object):
567    """
568    Transcode the byte strings of the input object into another encoding.
569
570    This input object can be a source object or any other pipeline object that
571    produces byte strings.
572    """
573
574    def __init__(self, fromencoding=None, toencoding=None):
575        """
576        Create a :class:`Transcoder` object. :var:`fromencoding` is the encoding
577        of the input. :var:`toencoding` is the encoding of the output. If any of
578        them is :const:`None` the encoding will be detected from the data.
579        """
580        self.fromencoding = fromencoding
581        self.toencoding = toencoding
582
583    def __call__(self, input):
584        decoder = codecs.getincrementaldecoder("xml")(encoding=self.fromencoding)
585        encoder = codecs.getincrementalencoder("xml")(encoding=self.toencoding)
586        for (evtype, data) in input:
587            if evtype == u"bytes":
588                data = encoder.encode(decoder.decode(data, False), False)
589                if data:
590                    yield (u"bytes", data)
591            elif evtype == u"url":
592                yield (u"url", data)
593            else:
594                raise UnknownEventError(self, (evtype, data))
595        data = encoder.encode(decoder.decode("", True), True)
596        if data:
597            yield (u"bytes", data)
598
599    def __repr__(self):
600        return "<{0.__class__.__module__}.{0.__class__.__name__} object fromencoding={0.fromencoding!r} toencoding={0.toencoding!r} at {1:#x}>".format(self, id(self))
601
602
603###
604### Parsers
605###
606
607class Parser(object):
608    """
609    Basic parser interface.
610    """
611    evxmldecl = u"xmldecl"
612    evbegindoctype = u"begindoctype"
613    evenddoctype = u"enddoctype"
614    evcomment = u"comment"
615    evtext = u"text"
616    evcdata = u"cdata"
617    eventerstarttag = u"enterstarttag"
618    eventerstarttagns = u"enterstarttagns"
619    eventerattr = u"enterattr"
620    eventerattrns = u"enterattrns"
621    evleaveattr = u"leaveattr"
622    evleaveattrns = u"leaveattrns"
623    evleavestarttag = u"leavestarttag"
624    evleavestarttagns = u"leavestarttagns"
625    evendtag = u"endtag"
626    evendtagns = u"endtagns"
627    evprocinst = u"procinst"
628    eventity = u"entity"
629    evposition = u"position"
630    evurl = u"url"
631
632
633class Expat(Parser):
634    """
635    A parser using Pythons builtin :mod:`expat` parser.
636    """
637
638    def __init__(self, encoding=None, xmldecl=False, doctype=False, loc=True, cdata=False, ns=False):
639        """
640        Create an :class:`Expat` parser. Arguments have the following meaning:
641
642        :var:`encoding` : string or :const:`None`
643            Forces the parser to use the specified encoding. The default
644            :const:`None` results in the encoding being detected from the XML itself.
645
646        :var:`xmldecl` : bool
647            Should the parser produce events for the XML declaration?
648
649        :var:`doctype` : bool
650            Should the parser produce events for the document type?
651
652        :var:`loc` : bool
653            Should the parser produce ``"location"`` events?
654
655        :var:`cdata` : bool
656            Should the parser output CDATA sections as ``"cdata"`` events? (If
657            :var:`cdata` is false output ``"text"`` events instead.)
658
659        :var:`ns` : bool
660            If :var:`ns` is true, the parser does its own namespace processing,
661            i.e. it will emit ``"enterstarttagns"``, ``"leavestarttagns"``,
662            ``"endtagns"``, ``"enterattrns"`` and ``"leaveattrns"`` events instead
663            of ``"enterstarttag"``, ``"leavestarttag"``, ``"endtag"``,
664            ``"enterattr"`` and ``"leaveattr"`` events.
665        """
666        self.encoding = encoding
667        self.xmldecl = xmldecl
668        self.doctype = doctype
669        self.loc = loc
670        self.cdata = cdata
671        self.ns = ns
672
673    def __repr__(self):
674        v = []
675        if self.encoding is not None:
676            v.append(" encoding={!r}".format(self.encoding))
677        if self.xmldecl is not None:
678            v.append(" xmldecl={!r}".format(self.xmldecl))
679        if self.doctype is not None:
680            v.append(" doctype={!r}".format(self.doctype))
681        if self.loc is not None:
682            v.append(" loc={!r}".format(self.loc))
683        if self.cdata is not None:
684            v.append(" cdata={!r}".format(self.cdata))
685        if self.ns is not None:
686            v.append(" ns={!r}".format(self.ns))
687        return "<{0.__class__.__module__}.{0.__class__.__name__} object{1} at {2:#x}>".format(self, "".join(v), id(self))
688
689    def __call__(self, input):
690        """
691        Return an iterator over the events produced by :var:`input`.
692        """
693        self._parser = expat.ParserCreate(self.encoding, "\x01" if self.ns else None)
694        self._parser.buffer_text = True
695        self._parser.ordered_attributes = True
696        self._parser.UseForeignDTD(True)
697        self._parser.CharacterDataHandler = self._handle_text
698        self._parser.StartElementHandler = self._handle_startelement
699        self._parser.EndElementHandler = self._handle_endelement
700        self._parser.ProcessingInstructionHandler = self._handle_procinst
701        self._parser.CommentHandler = self._handle_comment
702        self._parser.DefaultHandler = self._handle_default
703
704        if self.cdata:
705            self._parser.StartCdataSectionHandler = self._handle_startcdata
706            self._parser.EndCdataSectionHandler = self._handle_endcdata
707
708        if self.xmldecl:
709            self._parser.XmlDeclHandler = self._handle_xmldecl
710
711        # Always required, as we want to recognize whether a comment or PI is in the internal DTD subset
712        self._parser.StartDoctypeDeclHandler = self._handle_begindoctype
713        self._parser.EndDoctypeDeclHandler = self._handle_enddoctype
714
715        self._indoctype = False
716        self._incdata = False
717        self._currentloc = None # Remember the last reported position
718
719        # Buffers the events generated during one call to ``Parse``
720        self._buffer = []
721
722        try:
723            for (evtype, data) in input:
724                if evtype == u"bytes":
725                    try:
726                        self._parser.Parse(data, False)
727                    except Exception, exc:
728                        # In case of an exception we want to output the events we have gathered so far, before reraising the exception
729                        for event in self._flush(True):
730                            yield event
731                        raise exc
732                    else:
733                        for event in self._flush(False):
734                            yield event
735                elif evtype == u"url":
736                    yield (self.evurl, data)
737                else:
738                    raise UnknownEventError(self, (evtype, data))
739            try:
740                self._parser.Parse(b"", True)
741            except Exception, exc:
742                for event in self._flush(True):
743                    yield event
744                raise exc
745            else:
746                for event in self._flush(True):
747                    yield event
748        finally:
749            del self._buffer
750            del self._currentloc
751            del self._incdata
752            del self._indoctype
753            del self._parser
754
755    def _event(self, evtype, evdata):
756        loc = None
757        if self.loc:
758            loc = (self._parser.CurrentLineNumber-1, self._parser.CurrentColumnNumber)
759            if loc == self._currentloc:
760                loc = None
761        if self._buffer and evtype == self._buffer[-1][0] == self.evtext:
762            self._buffer[-1] = (evtype, self._buffer[-1][1] + evdata)
763        else:
764            if loc:
765                self._buffer.append((self.evposition, loc))
766            self._buffer.append((evtype, evdata))
767            self._currentloc = loc
768
769    def _flush(self, force):
770        # Flush ``self._buffer`` as far as possible
771        if force or not self._buffer or self._buffer[-1][0] != self.evtext:
772            for event in self._buffer:
773                yield event
774            del self._buffer[:]
775        else:
776            # hold back the last text event, because there might be more
777            for event in self._buffer[:-1]:
778                yield event
779            del self._buffer[:-1]
780
781    def _getname(self, name):
782        if self.ns:
783            if "\x01" in name:
784                return tuple(name.split("\x01")[::-1])
785            return (name, None)
786        return name
787
788    def _handle_startcdata(self):
789        self._incdata = True
790
791    def _handle_endcdata(self):
792        self._incdata = False
793
794    def _handle_xmldecl(self, version, encoding, standalone):
795        standalone = (bool(standalone) if standalone != -1 else None)
796        self._event(self.evxmldecl, {u"version": version, u"encoding": encoding, u"standalone": standalone})
797
798    def _handle_begindoctype(self, doctypename, systemid, publicid, has_internal_subset):
799        if self.doctype:
800            self._event(self.evbegindoctype, {u"name": doctypename, u"publicid": publicid, u"systemid": systemid})
801
802    def _handle_enddoctype(self):
803        if self.doctype:
804            self._event(self.evenddoctype, None)
805
806    def _handle_default(self, data):
807        if data.startswith(u"&") and data.endswith(u";"):
808            self._event(self.eventity, data[1:-1])
809
810    def _handle_comment(self, data):
811        if not self._indoctype:
812            self._event(self.evcomment, data)
813
814    def _handle_text(self, data):
815        self._event(self.evcdata if self._incdata else self.evtext, data)
816
817    def _handle_startelement(self, name, attrs):
818        name = self._getname(name)
819        self._event(self.eventerstarttagns if self.ns else self.eventerstarttag, name)
820        for i in xrange(0, len(attrs), 2):
821            key = self._getname(attrs[i])
822            self._event(self.eventerattrns if self.ns else self.eventerattr, key)
823            self._event(self.evtext, attrs[i+1])
824            self._event(self.evleaveattrns if self.ns else self.evleaveattr, key)
825        self._event(self.evleavestarttagns if self.ns else self.evleavestarttag, name)
826
827    def _handle_endelement(self, name):
828        name = self._getname(name)
829        self._event(self.evendtagns if self.ns else self.evendtag, name)
830
831    def _handle_procinst(self, target, data):
832        if not self._indoctype:
833            self._event(self.evprocinst, (target, data))
834
835
836class SGMLOP(Parser):
837    """
838    A parser based on :mod:`sgmlop`.
839    """
840
841    def __init__(self, encoding=None, cdata=False):
842        """
843        Create a :class:`SGMLOP` parser. Arguments have the following meaning:
844
845        :var:`encoding` : string or :const:`None`
846            Forces the parser to use the specified encoding. The default
847            :const:`None` results in the encoding being detected from the XML itself.
848
849        :var:`cdata` : bool
850            Should the parser output CDATA sections as ``"cdata"`` events? (If
851            :var:`cdata` is false output ``"text"`` events instead.)
852        """
853        self.encoding = encoding
854        self.cdata = cdata
855
856    def __repr__(self):
857        return "<{0.__class__.__module__}.{0.__class__.__name__} object encoding={0.encoding!r} at {1:#x}>".format(self, id(self))
858
859    def __call__(self, input):
860        """
861        Return an iterator over the events produced by :var:`input`.
862        """
863        self._decoder = codecs.getincrementaldecoder("xml")(encoding=self.encoding)
864        self._parser = sgmlop.XMLParser()
865        self._parser.register(self)
866        self._buffer = []
867        self._hadtext = False
868
869        try:
870            for (evtype, data) in input:
871                if evtype == u"bytes":
872                    try:
873                        self._parser.feed(self._decoder.decode(data, False))
874                    except Exception, exc:
875                        # In case of an exception we want to output the events we have gathered so far, before reraising the exception
876                        for event in self._flush(True):
877                            yield event
878                        self._parser.close()
879                        raise exc
880                    else:
881                        for event in self._flush(False):
882                            yield event
883                elif evtype == u"url":
884                    yield (self.evurl, data)
885                else:
886                    raise UnknownEventError(self, (evtype, data))
887            self._parser.close()
888            for event in self._flush(True):
889                yield event
890        finally:
891            del self._hadtext
892            del self._buffer
893            self._parser.register(None)
894            del self._parser
895            del self._decoder
896
897    def _event(self, evtype, evdata):
898        if self._buffer and evtype == self._buffer[-1][0] == self.evtext:
899            self._buffer[-1] = (evtype, self._buffer[-1][1] + evdata)
900        else:
901            self._buffer.append((evtype, evdata))
902
903    def _flush(self, force):
904        # Flush ``self._buffer`` as far as possible
905        if force or not self._buffer or self._buffer[-1][0] != self.evtext:
906            for event in self._buffer:
907                yield event
908            del self._buffer[:]
909        else:
910            # hold back the last text event, because there might be more
911            for event in self._buffer[:-1]:
912                yield event
913            del self._buffer[:-1]
914
915    def handle_comment(self, data):
916        self._event(self.evcomment, data)
917
918    def handle_data(self, data):
919        self._event(self.evtext, data)
920
921    def handle_cdata(self, data):
922        self._event(self.evcdata if self.cdata else self.evtext, data)
923
924    def handle_proc(self, target, data):
925        if target.lower() != u"xml":
926            self._event(self.evprocinst, (target, data))
927
928    def handle_entityref(self, name):
929        self._event(self.eventity, name)
930
931    def handle_enterstarttag(self, name):
932        self._event(self.eventerstarttag, name)
933
934    def handle_leavestarttag(self, name):
935        self._event(self.evleavestarttag, name)
936
937    def handle_enterattr(self, name):
938        self._event(self.eventerattr, name)
939
940    def handle_leaveattr(self, name):
941        self._event(self.evleaveattr, name)
942
943    def handle_endtag(self, name):
944        self._event(self.evendtag, name)
945
946
947class NS(object):
948    """
949    An :class:`NS` object is used in a parsing pipeline to add support for XML
950    namespaces. It replaces the ``"enterstarttag"``, ``"leavestarttag"``,
951    ``"endtag"``, ``"enterattr"`` and ``"leaveattr"`` events with the appropriate
952    namespace version of the events (i.e. ``"enterstarttagns"`` etc.) where the
953    event data is a ``(name, namespace)`` tuple.
954
955    The output of an :class:`NS` object in the stream looks like this::
956
957        >>> from ll.xist import parse
958        >>> from ll.xist.ns import html
959        >>> list(parse.events(
960        ...     parse.String("<a href='http://www.python.org/'>Python</a>"),
961        ...     parse.Expat(),
962        ...     parse.NS(html)
963        ... ))
964        [('url', URL('STRING')),
965         ('position', (0, 0)),
966         ('enterstarttagns', (u'a', 'http://www.w3.org/1999/xhtml')),
967         ('enterattrns', (u'href', None)),
968         ('text', u'http://www.python.org/'),
969         ('leaveattrns', (u'href', None)),
970         ('leavestarttagns', (u'a', 'http://www.w3.org/1999/xhtml')),
971         ('position', (0, 39)),
972         ('text', u'Python'),
973         ('endtagns', (u'a', 'http://www.w3.org/1999/xhtml'))]
974    """
975
976    def __init__(self, prefixes=None, **kwargs):
977        """
978        Create an :class:`NS` object. :var:`prefixes` (if not :const:`None`) can
979        be a namespace name (or module), which will be used for the empty prefix,
980        or a dictionary that maps prefixes to namespace names (or modules).
981        :var:`kwargs` maps prefixes to namespaces names too. If a prefix is in both
982        :var:`prefixes` and :var:`kwargs`, :var:`kwargs` wins.
983        """
984        # the currently active prefix mapping (will be replaced once xmlns attributes are encountered)
985        newprefixes = {}
986
987        def make(prefix, xmlns):
988            if prefix is not None and not isinstance(prefix, basestring):
989                raise TypeError("prefix must be None or string, not {!r}".format(prefix))
990            xmlns = xsc.nsname(xmlns)
991            if not isinstance(xmlns, basestring):
992                raise TypeError("xmlns must be string, not {!r}".format(xmlns))
993            newprefixes[prefix] = xmlns
994
995        if prefixes is not None:
996            if isinstance(prefixes, dict):
997                for (prefix, xmlns) in prefixes.iteritems():
998                    make(prefix, xmlns)
999            else:
1000                make(None, prefixes)
1001
1002        for (prefix, xmlns) in kwargs.iteritems():
1003            make(prefix, xmlns)
1004        self._newprefixes = self._attrs = self._attr = None
1005        # A stack entry is an ``((elementname, namespacename), prefixdict)`` tuple
1006        self._prefixstack = [(None, newprefixes)]
1007
1008    def __call__(self, input):
1009        for (evtype, data) in input:
1010            try:
1011                handler = getattr(self, evtype)
1012            except AttributeError:
1013                raise UnknownEventError(self, (evtype, data))
1014            for event in handler(data):
1015                yield event
1016
1017    def url(self, data):
1018        yield (u"url", data)
1019
1020    def xmldecl(self, data):
1021        data = (u"xmldecl", data)
1022        if self._attr is not None:
1023            self._attr.append(data)
1024        else:
1025            yield data
1026
1027    def begindoctype(self, data):
1028        data = (u"begindoctype", data)
1029        if self._attr is not None:
1030            self._attr.append(data)
1031        else:
1032            yield data
1033
1034    def enddoctype(self, data):
1035        data = (u"enddoctype", data)
1036        if self._attr is not None:
1037            self._attr.append(data)
1038        else:
1039            yield data
1040
1041    def comment(self, data):
1042        data = (u"comment", data)
1043        if self._attr is not None:
1044            self._attr.append(data)
1045        else:
1046            yield data
1047
1048    def text(self, data):
1049        data = (u"text", data)
1050        if self._attr is not None:
1051            self._attr.append(data)
1052        else:
1053            yield data
1054
1055    def cdata(self, data):
1056        data = (u"cdata", data)
1057        if self._attr is not None:
1058            self._attr.append(data)
1059        else:
1060            yield data
1061
1062    def procinst(self, data):
1063        data = (u"procinst", data)
1064        if self._attr is not None:
1065            self._attr.append(data)
1066        else:
1067            yield data
1068
1069    def entity(self, data):
1070        data = (u"entity", data)
1071        if self._attr is not None:
1072            self._attr.append(data)
1073        else:
1074            yield data
1075
1076    def position(self, data):
1077        data = (u"position", data)
1078        if self._attr is not None:
1079            self._attr.append(data)
1080        else:
1081            yield data
1082
1083    def enterstarttag(self, data):
1084        self._newprefixes = {}
1085        self._attrs = {}
1086        self._attr = None
1087        if 0:
1088            yield False
1089
1090    def enterattr(self, data):
1091        if data==u"xmlns" or data.startswith(u"xmlns:"):
1092            prefix = data[6:] or None
1093            self._newprefixes[prefix] = self._attr = []
1094        else:
1095            self._attrs[data] = self._attr = []
1096        if 0:
1097            yield False
1098
1099    def leaveattr(self, data):
1100        self._attr = None
1101        if 0:
1102            yield False
1103
1104    def leavestarttag(self, data):
1105        oldprefixes = self._prefixstack[-1][1]
1106
1107        if self._newprefixes:
1108            prefixes = oldprefixes.copy()
1109            newprefixes = {key: u"".join(d for (t, d) in value if t == u"text") for (key, value) in self._newprefixes.iteritems()}
1110            prefixes.update(newprefixes)
1111        else:
1112            prefixes = oldprefixes
1113
1114        (prefix, sep, name) = data.rpartition(u":")
1115        prefix = prefix or None
1116
1117        try:
1118            data = (name, prefixes[prefix])
1119        except KeyError:
1120            raise xsc.IllegalPrefixError(prefix)
1121
1122        self._prefixstack.append((data, prefixes))
1123
1124        yield (u"enterstarttagns", data)
1125        for (attrname, attrvalue) in self._attrs.iteritems():
1126            if u":" in attrname:
1127                (attrprefix, attrname) = attrname.split(u":", 1)
1128                if attrprefix == "xml":
1129                    xmlns = xsc.xml_xmlns
1130                else:
1131                    try:
1132                        xmlns = prefixes[attrprefix]
1133                    except KeyError:
1134                        raise xsc.IllegalPrefixError(attrprefix)
1135            else:
1136                xmlns = None
1137            yield (u"enterattrns", (attrname, xmlns))
1138            for event in attrvalue:
1139                yield event
1140            yield (u"leaveattrns", (attrname, xmlns))
1141        yield (u"leavestarttagns", data)
1142        self._newprefixes = self._attrs = self._attr = None
1143
1144    def endtag(self, data):
1145        (data, prefixes) = self._prefixstack.pop()
1146        yield (u"endtagns", data)
1147
1148
1149class Node(object):
1150    """
1151    A :class:`Node` object is used in a parsing pipeline to instantiate XIST
1152    nodes. It consumes a namespaced event stream::
1153
1154        >>> from ll.xist import xsc, parse
1155        >>> from ll.xist.ns import html
1156        >>> list(parse.events(
1157        ...     parse.String("<a href='http://www.python.org/'>Python</a>"),
1158        ...     parse.Expat(),
1159        ...     parse.NS(html),
1160        ...     parse.Node(pool=xsc.Pool(html))
1161        ... ))
1162        [(u'startelementnode',
1163          <ll.xist.ns.html.a element object (no children/1 attr) (from STRING:0:0) at 0x1026e6a10>),
1164         (u'textnode',
1165          <ll.xist.xsc.Text content=u'Python' (from STRING:0:39) at 0x102566b48>),
1166         (u'endelementnode',
1167          <ll.xist.ns.html.a element object (no children/1 attr) (from STRING:0:0) at 0x1026e6a10>)]
1168
1169    The event data of all events are XIST nodes. The element node from the
1170    ``"startelementnode"`` event already has all attributes set. There will be
1171    no events for attributes.
1172    """
1173    def __init__(self, pool=None, base=None, loc=True):
1174        """
1175        """
1176        self.pool = (pool if pool is not None else xsc.threadlocalpool.pool)
1177        if base is not None:
1178            base = url_.URL(base)
1179        self._base = base
1180        self._url = url_.URL()
1181        self.loc = loc
1182        self._position = (None, None)
1183        self._stack = []
1184        self._inattr = False
1185        self._indoctype = False
1186
1187    @property
1188    def base(self):
1189        if self._base is None:
1190            return self._url
1191        else:
1192            return self._base
1193
1194    def __call__(self, input):
1195        for (evtype, data) in input:
1196            try:
1197                handler = getattr(self, evtype)
1198            except AttributeError:
1199                raise UnknownEventError(self, (evtype, data))
1200            event = handler(data)
1201            if event:
1202                yield event
1203
1204    def url(self, data):
1205        self._url = data
1206
1207    def xmldecl(self, data):
1208        node = xml.XML(version=data[u"version"], encoding=data[u"encoding"], standalone=data[u"standalone"])
1209        if self.loc:
1210            node.startloc = xsc.Location(self._url, *self._position)
1211        return (u"xmldeclnode", node)
1212
1213    def begindoctype(self, data):
1214        if data[u"publicid"]:
1215            fmt = u'{0[name]} PUBLIC "{0[publicid]}" "{0[systemid]}"'
1216        elif data["systemid"]:
1217            fmt = u'{0[name]} SYSTEM "{0[systemid]}"'
1218        else:
1219            fmt = u'{0[name]}'
1220        node = xsc.DocType(fmt.format(data))
1221        if self.loc:
1222            node.startloc = xsc.Location(self._url, *self._position)
1223        self.doctype = node
1224        self._indoctype = True
1225
1226    def enddoctype(self, data):
1227        result = (u"doctypenode", self.doctype)
1228        del self.doctype
1229        self._indoctype = False
1230        return result
1231
1232    def entity(self, data):
1233        node = self.pool.entity_xml(data)
1234        if self.loc:
1235            node.startloc = xsc.Location(self._url, *self._position)
1236        node.parsed(self, u"entity")
1237        if self._inattr:
1238            self._stack[-1].append(node)
1239        elif not self._indoctype:
1240            return (u"entitynode", node)
1241
1242    def comment(self, data):
1243        node = xsc.Comment(data)
1244        if self.loc:
1245            node.startloc = xsc.Location(self._url, *self._position)
1246        node.parsed(self, u"comment")
1247        if self._inattr:
1248            self._stack[-1].append(node)
1249        elif not self._indoctype:
1250            return (u"commentnode", node)
1251
1252    def cdata(self, data):
1253        node = xsc.Text(data)
1254        if self.loc:
1255            node.startloc = xsc.Location(self._url, *self._position)
1256        node.parsed(self, u"cdata")
1257        if self._inattr:
1258            self._stack[-1].append(node)
1259        elif not self._indoctype:
1260            return (u"textnode", node)
1261
1262    def text(self, data):
1263        node = xsc.Text(data)
1264        if self.loc:
1265            node.startloc = xsc.Location(self._url, *self._position)
1266        node.parsed(self, u"text")
1267        if self._inattr:
1268            self._stack[-1].append(node)
1269        elif not self._indoctype:
1270            return (u"textnode", node)
1271
1272    def enterstarttagns(self, data):
1273        node = self.pool.element_xml(*data)
1274        if self.loc:
1275            node.startloc = xsc.Location(self._url, *self._position)
1276        self._stack.append(node)
1277        node.parsed(self, u"starttagns")
1278
1279    def enterattrns(self, data):
1280        if data[1] is not None:
1281            node = self.pool.attrclass_xml(*data)
1282        else:
1283            node = self._stack[-1].attrs.allowedattr_xml(data[0])
1284        if self.loc:
1285            node.startloc = xsc.Location(self._url, *self._position)
1286        self._stack[-1].attrs[node] = ()
1287        node = self._stack[-1].attrs[node]
1288        self._stack.append(node)
1289        self._inattr = True
1290        node.parsed(self, u"enterattrns")
1291
1292    def leaveattrns(self, data):
1293        node = self._stack.pop()
1294        self._inattr = False
1295        node.parsed(self, u"leaveattrns")
1296
1297    def leavestarttagns(self, data):
1298        node = self._stack[-1]
1299        node.parsed(self, u"leavestarttagns")
1300        return (u"startelementnode", node)
1301
1302    def endtagns(self, data):
1303        node = self._stack.pop()
1304        if self.loc:
1305            node.endloc = xsc.Location(self._url, *self._position)
1306        node.parsed(self, u"endtagns")
1307        return (u"endelementnode", node)
1308
1309    def procinst(self, data):
1310        node = self.pool.procinst_xml(*data)
1311        if self.loc:
1312            node.startloc = xsc.Location(self._url, *self._position)
1313        node.parsed(self, u"procinst")
1314        if self._inattr:
1315            self._stack[-1].append(node)
1316        elif not self._indoctype:
1317            return (u"procinstnode", node)
1318
1319    def position(self, data):
1320        self._position = data
1321
1322
1323class Tidy(object):
1324    """
1325    A :class:`Tidy` object parses (potentially ill-formed) HTML from a source
1326    into a (unnamespaced) event stream by using libxml2__'s HTML parser::
1327
1328        >>> from ll.xist import parse
1329        >>> list(parse.events(parse.URL("http://www.yahoo.com/"), parse.Tidy()))
1330        [('url', URL('http://de.yahoo.com/?p=us')),
1331         ('position', (3, None)),
1332         ('enterstarttag', u'html'),
1333         ('enterattr', u'lang'),
1334         ('text', u'de-DE'),
1335         ('leaveattr', u'lang'),
1336         ('enterattr', u'class'),
1337         ('text', u'y-fp-bg y-fp-pg-grad  bkt708'),
1338         ('leaveattr', u'class'),
1339         ('leavestarttag', u'html')
1340        ...
1341
1342    __ http://xmlsoft.org/
1343    """
1344
1345    def __init__(self, encoding=None, skipbad=False, loc=True):
1346        """
1347        Create a new :class:`Tidy` object. Parameters have the following meaning:
1348
1349        :var:`encoding` : string or :const:`None`
1350            The encoding of the input. If :var:`encoding` is :const:`None` it will
1351            be automatically detected by the HTML parser.
1352
1353        :var:`skipbad` : bool
1354            If :var:`skipbad` is true, unknown elements (i.e. those not in the
1355            :mod:`ll.xist.ns.html` namespace) will be skipped (i.e. instead of
1356            the element its content will be output). Unknown attributes will be
1357            skipped completely.
1358
1359        :var:`loc` : bool
1360            If :var:`loc` is true, ``"position"`` events will be generated else
1361            they will be skipped.
1362        """
1363        self.encoding = encoding
1364        self.skipbad = skipbad
1365        self.loc = loc
1366
1367    def __repr__(self):
1368        return "<{0.__class__.__module__}.{0.__class__.__name__} object encoding={0.encoding!r} loc={0.loc!r} at {1:#x}>".format(self, id(self))
1369
1370    def _handle_pos(self, node):
1371        if self.loc:
1372            lineno = node.lineNo()
1373            if lineno != self._lastlineno:
1374                result = (u"position", (lineno, None))
1375                self._lastlineno = lineno
1376                return result
1377
1378    @staticmethod
1379    def decode(s):
1380        try:
1381            return s.decode("utf-8")
1382        except UnicodeDecodeError:
1383            return s.decode("iso-8859-1")
1384
1385    def _asxist(self, node):
1386        decode = self.decode
1387        if node.type == "document_html":
1388            child = node.children
1389            while child is not None:
1390                for event in self._asxist(child):
1391                    yield event
1392                child = child.next
1393        elif node.type == "element":
1394            pos = self._handle_pos(node)
1395            if pos is not None:
1396                yield pos
1397            elementname = decode(node.name).lower()
1398            if self.skipbad:
1399                el = getattr(html, elementname, None)
1400                elok = el is not None
1401            else:
1402                elok = True
1403            if elok:
1404                yield (u"enterstarttag", elementname)
1405                attr = node.properties
1406                while attr is not None:
1407                    attrname = decode(attr.name).lower()
1408                    if not self.skipbad or el.Attrs.isallowed_xml(attrname):
1409                        content = decode(attr.content) if attr.content is not None else u""
1410                        yield (u"enterattr", attrname)
1411                        yield (u"text", content)
1412                        yield (u"leaveattr", attrname)
1413                    attr = attr.next
1414                yield (u"leavestarttag", elementname)
1415            child = node.children
1416            while child is not None:
1417                for event in self._asxist(child):
1418                    yield event
1419                child = child.next
1420            if elok:
1421                yield (u"endtag", elementname)
1422        elif node.type == "text":
1423            pos = self._handle_pos(node)
1424            if pos is not None:
1425                yield pos
1426            yield (u"text", decode(node.content))
1427        elif node.type == "cdata":
1428            pos = self._handle_pos(node)
1429            if pos is not None:
1430                yield pos
1431            yield (u"cdata", decode(node.content))
1432        elif node.type == "comment":
1433            pos = self._handle_pos(node)
1434            if pos is not None:
1435                yield pos
1436            yield (u"comment", decode(node.content))
1437        # ignore all other types
1438
1439    def __call__(self, input):
1440        import libxml2 # This requires libxml2 (see http://www.xmlsoft.org/)
1441
1442        url = None
1443        collectdata = []
1444        for (evtype, data) in input:
1445            if evtype == u"url":
1446                if url is None:
1447                    url = data
1448                else:
1449                    raise ValueError("got multiple url events")
1450            elif evtype == u"bytes":
1451                collectdata.append(data)
1452            else:
1453                raise UnknownEventError(self, (evtype, data))
1454        data = "".join(collectdata)
1455        if url is not None:
1456            yield (u"url", url)
1457        if data:
1458            self._lastlineno = None
1459            try:
1460                olddefault = libxml2.lineNumbersDefault(1)
1461                doc = libxml2.htmlReadMemory(data, len(data), str(url), self.encoding, 0x160)
1462                try:
1463                    for event in self._asxist(doc):
1464                        yield event
1465                finally:
1466                    doc.freeDoc()
1467            finally:
1468                libxml2.lineNumbersDefault(olddefault)
1469
1470
1471###
1472### Consumers: Functions that consume an event stream
1473###
1474
1475def events(*pipeline):
1476    """
1477    Return an iterator over the events produced by the pipeline objects in
1478    :var:`pipeline`.
1479    """
1480    source = pipeline[0]
1481
1482    # Propagate first pipeline object to a source object (if unambiguous, else use it as it is)
1483    if isinstance(source, basestring):
1484        source = String(source)
1485    elif isinstance(source, url_.URL):
1486        source = URL(source)
1487
1488    # Execute the pipeline, propagating pipeline objects in the process
1489    output = iter(source)
1490    for pipe in pipeline[1:]:
1491        if isinstance(pipe, xsc.Pool):
1492            pipe = Node(pool=pipe)
1493        output = pipe(output)
1494    return output
1495
1496
1497def tree(*pipeline, **kwargs):
1498    """
1499    Return a tree of XIST nodes from the event stream :var:`pipeline`.
1500
1501    :var:`pipeline` must output only events that contain XIST nodes, i.e. the
1502    event types ``"xmldeclnode"``, ``"doctypenode"``, ``"commentnode"``,
1503    ``"textnode"``, ``"startelementnode"``, ``"endelementnode"``,
1504    ``"procinstnode"`` and ``"entitynode"``.
1505
1506    :var:`kwargs` supports one keyword argument: :var:`validate`.
1507    If :var:`validate` is true, the tree is validated, i.e. it is checked if
1508    the structure of the tree is valid (according to the :var:`model` attribute
1509    of each element node), if all required attributes are specified and all
1510    attributes have allowed values.
1511
1512    The node returned from :func:`tree` will always be a :class:`Frag` object.
1513
1514    Example::
1515
1516        >>> from ll.xist import xsc, parse
1517        >>> from ll.xist.ns import xml, html, chars
1518        >>> doc = parse.tree(
1519        ...     parse.URL("http://www.python.org/"),
1520        ...     parse.Expat(ns=True),
1521        ...     parse.Node(pool=xsc.Pool(xml, html, chars))
1522        ... )
1523        >>> doc[0]
1524        <ll.xist.ns.html.html element object (5 children/2 attrs) (from http://www.python.org/:3:0) at 0x1028eb3d0>
1525    """
1526    stack = [xsc.Frag()]
1527    validate = kwargs.get("validate", True)
1528    for (evtype, node) in events(*pipeline):
1529        if evtype == u"startelementnode":
1530            stack[-1].append(node)
1531            stack.append(node)
1532        elif evtype == u"endelementnode":
1533            if validate:
1534                node.checkvalid()
1535            stack.pop()
1536        else:
1537            stack[-1].append(node)
1538    return stack[0]
1539
1540
1541def itertree(*pipeline, **kwargs):
1542    """
1543    Parse the event stream :var:`pipeline` iteratively.
1544
1545    :func:`itertree` still builds a tree, but it returns a iterator of
1546    ``(event type, path)`` tuples that track changes to the tree as it is built.
1547    ``path`` is a list containing the path from the root ``Frag`` object to the
1548    node being worked on.
1549
1550    Which events and paths are produced depends on the keyword arguments
1551    :var:`events` and :var:`filter`. :var:`events`  specifies which events you
1552    want to see (possible event types are ``"xmldeclnode"``, ``"doctypenode"``,
1553    ``"commentnode"``, ``"textnode"``, ``"startelementnode"``,
1554    ``"endelementnode"``, ``"procinstnode"`` and ``"entitynode"``). The default
1555    is to only produce ``"endelementnode"`` events. (Note that for
1556    ``"startelementnode"`` events, the attributes of the element have been set,
1557    but the element is still empty). :var:`filter` specifies an XIST walk filter
1558    (see the :mod:`ll.xist.xfind` module for more info on walk filters) to filter
1559    which paths are output. The default is to output all paths.
1560
1561    Example::
1562
1563        >>> from ll.xist import xsc, parse
1564        >>> from ll.xist.ns import xml, html, chars
1565        >>> for (evtype, path) in parse.itertree(
1566        ...     parse.URL("http://www.python.org/"),
1567        ...     parse.Expat(ns=True),
1568        ...     parse.Node(pool=xsc.Pool(xml, html, chars)),
1569        ...     filter=html.a/html.img
1570        ... ):
1571        ...     print path[-1].attrs.src, "-->", path[-2].attrs.href
1572        http://www.python.org/images/python-logo.gif --> http://www.python.org/
1573        http://www.python.org/images/trans.gif --> http://www.python.org/#left%2Dhand%2Dnavigation
1574        http://www.python.org/images/trans.gif --> http://www.python.org/#content%2Dbody
1575        http://www.python.org/images/donate.png --> http://www.python.org/psf/donations/
1576        http://www.python.org/images/worldmap.jpg --> http://wiki.python.org/moin/Languages
1577        http://www.python.org/images/success/tribon.jpg --> http://www.python.org/about/success/tribon/
1578    """
1579    events_ = kwargs.get("events", ("endelementnode",))
1580    validate = kwargs.get("validate", True)
1581    filter = xfind.makewalkfilter(kwargs.get("filter", None))
1582
1583    path = [xsc.Frag()]
1584    for (evtype, node) in events(*pipeline):
1585        if evtype == u"startelementnode":
1586            path[-1].append(node)
1587            path.append(node)
1588            if evtype in events_ and filter.matchpath(path): # FIXME: This requires that the ``WalkFilter`` is in fact a ``Selector``
1589                yield (evtype, path)
1590        elif evtype == u"endelementnode":
1591            if validate:
1592                node.checkvalid()
1593            if evtype in events_ and filter.matchpath(path): # FIXME: This requires that the ``WalkFilter`` is in fact a ``Selector``
1594                yield (evtype, path)
1595            path.pop()
1596        else:
1597            path[-1].append(node)
1598            path.append(node)
1599            if evtype in events_ and filter.matchpath(path): # FIXME: This requires that the ``WalkFilter`` is in fact a ``Selector``
1600                yield (evtype, path)
1601            path.pop()
Note: See TracBrowser for help on using the browser.