Annexe : Analyse de HTML et XML

Note : à améliorer. Pour l’instant inclusion de la page Memo_AnalyseXML_HTML.html

Contenu téléchargeable en .DOCX : Memo_AnalyseXML_HTML.docx


Mémo : Analyse de données XML et HTML dans Openrefine

Auteur : Mathieu Saby

Licence CC-BY-v4

 

Historique

V1

24/10/2018

Création, traduction à finir

 

La version d’Openrefine décrite est la 3.1.

 

 

Table des matières

1.     Analyse de données XML et HTML. 2

1.1.      Les sélecteurs jsoup (trad en cours). 2

Combinators 2

Pseudo selectors. 2

Structural pseudo selectors. 3

2.     Utilisation de Jython. 4

 


 

 

1.   Analyse de données XML et HTML

La fonction select() permet de sélectionner des éléments issus de parseHtml() en utilisant la syntaxe du parseur jsoup (https://jsoup.org). Jsoup utilise des sélecteurs insensibles à la casse, identiques aux sélecteurs CSS3 et Jquery.

Voir la documentation complète : https://jsoup.org/apidocs/org/jsoup/select/Selector.html

Interface de démonstration (pour tester une expression jsoup) : https://try.jsoup.org/

 

Le support du XML est imparfait.

 

1.1. Les sélecteurs jsoup (trad en cours)

A selector is a chain of simple selectors, separated by combinators. Selectors are case insensitive (including against elements, attributes, and attribute values).

The universal selector (*) is implicit when no element selector is supplied (i.e. *.header and .header is equivalent).

Pattern

Matches

Example

 

*

N’importe quel élément

*

 

tag

Les éléments de type tag

div

 

*|E

Les éléments de type E in any namespace ns

*|name finds <fb:name> elements

 

ns|E

Les éléments de type E in the namespace ns

fb|name finds <fb:name> elements

 

#id

Les éléments avec attribute ID of "id"

div#wrap#logo

 

.class

Les éléments avec une classe "class"

div.left.result

 

[attr]

Les éléments avec un attribut "attr"

a[href][title]

 

[^attrPrefix]

Les éléments avec an attribute name starting with "attrPrefix". Use to find elements with HTML5 datasets

[^data-]div[^data-]

 

[attr=val]

Les éléments avec un attribut "attr" de valeur égale à "val"

img[width=500]a[rel=nofollow]

 

[attr="val"]

Les éléments avec un attribut "attr" de valeur égale à "val"

span[hello="Cleveland"][goodbye="Columbus"]a[rel="nofollow"]

 

[attr^=valPrefix]

Les éléments avec un attribut "attr" de valeur commençant par "valPrefix"

a[href^=http:]

 

[attr$=valSuffix]

Les éléments avec un attribut "attr" de valeur se terminant par "valSuffix"

img[src$=.png]

 

[attr*=valContaining]

Les éléments avec un attribut "attr" de valeur contenant "valContaining"

a[href*=/search/]

 

[attr~=regex]

Les éléments avec un attribut "attr" de valeur correspondant à une expression régulière.

img[src~=(?i)\\.(png|jpe?g)]

 

The above may be combined in any order

div.header[title]

 

Combinators

E F

un élément F descendant d’un élément E

div a.logo h1

E > F

un élément F fils d’un élément E

ol > li

E + F

un élément F immédiatement précédé par un élément frère E

li + lidiv.head + div

E ~ F

un élément F précédé par un élément frère E

h1 ~ p

E, F, G

un élément E ou un élément F ou un élément G

a[href], div, h3

Pseudo selectors

:lt(n)

elements whose sibling index is less than n

td:lt(3) finds the first 3 cells of each row

:gt(n)

elements whose sibling index is greater than n

td:gt(1) finds cells after skipping the first two

:eq(n)

elements whose sibling index is equal to n

td:eq(0) finds the first cell of each row

:has(selector)

elements that contains at least one element matching the selector

div:has(p) finds divs that contain p elements

:not(selector)

elements that do not match the selector. See also Elements.not(String)

div:not(.logo) finds all divs that do not have the "logo" class.

div:not(:has(div)) finds divs that do not contain divs.

:contains(text)

elements that contains the specified text. The search is case insensitive. The text may appear in the found element, or any of its descendants.

p:contains(jsoup) finds p elements containing the text "jsoup".

:matches(regex)

elements whose text matches the specified regular expression. The text may appear in the found element, or any of its descendants.

td:matches(\\d+) finds table cells containing digits. div:matches((?i)login) finds divs containing the text, case insensitively.

:containsOwn(text)

elements that directly contain the specified text. The search is case insensitive. The text must appear in the found element, not any of its descendants.

p:containsOwn(jsoup) finds p elements with own text "jsoup".

:matchesOwn(regex)

elements whose own text matches the specified regular expression. The text must appear in the found element, not any of its descendants.

td:matchesOwn(\\d+) finds table cells directly containing digits. div:matchesOwn((?i)login)finds divs containing the text, case insensitively.

:containsData(data)

elements that contains the specified data. The contents of script and style elements, and comment nodes (etc) are considered data nodes, not text nodes. The search is case insensitive. The data may appear in the found element, or any of its descendants.

script:contains(jsoup) finds script elements containing the data "jsoup".

The above may be combined in any order and with other selectors

.light:contains(name):eq(0)

:matchText

treats text nodes as elements, and so allows you to match against and select text nodes.

Note that using this selector will modify the DOM, so you may want to clone your document before using.

p:matchText:firstChild with input <p>One<br />Two</p> will return one PseudoTextElementwith text "One".

Structural pseudo selectors

:root

The element that is the root of the document. In HTML, this is the html element

:root

:nth-child(an+b)

elements that have an+b-1 siblings before it in the document tree, for any positive integer or zero value of n, and has a parent element. For values of a and b greater than zero, this effectively divides the element's children into groups of a elements (the last group taking the remainder), and selecting the bth element of each group. For example, this allows the selectors to address every other row in a table, and could be used to alternate the color of paragraph text in a cycle of four. The a and b values must be integers (positive, negative, or zero). The index of the first child of an element is 1.

In addition to this, :nth-child() can take odd and even as arguments insteadodd has the same signification as 2n+1, and even has the same signification as 2n.

tr:nth-child(2n+1) finds every odd row of a table. :nth-child(10n-1) the 9th, 19th, 29th, etc, elementli:nth-child(5) the 5h li

:nth-last-child(an+b)

elements that have an+b-1 siblings after it in the document tree. Otherwise like :nth-child()

tr:nth-last-child(-n+2) the last two rows of a table

:nth-of-type(an+b)

pseudo-class notation represents an element that has an+b-1 siblings with the same expanded element name before it in the document tree, for any zero or positive integer value of n, and has a parent element

img:nth-of-type(2n+1)

:nth-last-of-type(an+b)

pseudo-class notation represents an element that has an+b-1 siblings with the same expanded element name after it in the document tree, for any zero or positive integer value of n, and has a parent element

img:nth-last-of-type(2n+1)

:first-child

elements that are the first child of some other element.

div > p:first-child

:last-child

elements that are the last child of some other element.

ol > li:last-child

:first-of-type

elements that are the first sibling of its type in the list of children of its parent element

dl dt:first-of-type

:last-of-type

elements that are the last sibling of its type in the list of children of its parent element

tr > td:last-of-type

:only-child

elements that have a parent element and whose parent element hasve no other element children

:only-of-type

an element that has a parent element and whose parent element has no other element children with the same expanded element name

:empty

elements that have no children at all

 

2.   Utilisation de Jython

Il est également possible d’écrire du code en Jython en utilisant la librarie xml.etree.

Ex : récupérer le titre et l’auteur de livres écrits par des auteurs anglais

from xml.etree import ElementTree
element = ElementTree.fromstring(value.encode("utf-8"))
listeResultats = element.findall(
".//author[@country='GB']/..")
for resultat in listeResultats:
   
return (resultat.find('title').text + " / " + resultat.find('author').text)

 

Résultat :

 

 

value

from xml.etree import ElementT ...

 

<firstlist><book><title>Le Seigneur des anneaux</title><author country="GB">Tolkien</author></book><book><title>La peste</title><author country="FR">Camus</author></book></firstlist>

Le Seigneur des anneaux / Tolkien

 

<secondlist><book><title>Hamlet</title><author country="GB">Shakespeare</author></book><book><title>Portnoy</title><author country="US">Roth</author></book></secondlist>

Hamlet / Shakespeare