.. meta:: 
   :description: data engineering  
   :keywords:  reproducible, maintainable, modular data science code

Orientation Data du projet
============================

A quoi ça sert ? 
--------------------

La question peut se poser : à quoi ça sert de passer à du code datascience ? 
Pourquoi s'acharner à produire du code maintenable, modulaire, de manière 
à pouvoir reproduire une interprétation de data ? 

.. admonition:: Les questions que posent l'orientation datas d'un projet 

    - Qu'est-ce qu'une orientation data ? 
    - Quel est l'intérêt de produire du "datascience code" ? 
    - Ne faut-il pas mieux revenir au code `database -> html renderer ?` 


Le code d'une web app 
-------------------------

Pour partir d'un exemple voici un bout du code de la maquette : 

.. code-block:: python

        source_doc = etree.parse(
            os.path.join(APPPATH, "static", "xml", house, acte_id + '.xml'))
        # remove namespace :
        query = "descendant-or-self::*[namespace-uri()!='']"
        for element in source_doc.xpath(query):
            #replace element name with its local name
            element.tag = etree.QName(element).localname
        etree.cleanup_namespaces(source_doc)

        xslt_doc = etree.parse(os.path.join(APPPATH, "static", "xsl", "actes_princiers.xsl"))
        xslt_transformer = etree.XSLT(xslt_doc)
        output_doc = xslt_transformer(source_doc)
        return render_template("acte.html", house=house, prince=prince,
            infos=q_acte, place=place[0], doc=doc[0][0], arch=inst[0],
            diplo=diplo_t[0].replace("_", " "), state=state[0],
            output_doc=output_doc, name_prince=prince_name[0],
            transcribers=transcribers)

Ce code est : 

- difficilement compréhensible, 
- difficilement maintenable par quelqu'un d'autre que 
  celui qui a produit ce code, 
- est fortement lié à l'organisation d'une base de données
  et d'un document xml, il n'y a donc pas d'unité des sources 
  de données, 
- etc...


Un data science pipeline ? 
--------------------------------

Un datascience framework, plutôt :

.. image:: img/KedroRunTimeline.png

- on charge d'abord un catalogue de données sources 
- on fait le traitement dans des étapes bien distinctes appelées pipeline 

.. glossary::

    pipeline 
    
        Un pipeline est un processus ordonné. 
        Plusieurs actions sont lancées de manière successives ou 
        bien en parallèle, ces actions sont dépendantes les unes des autres
        et sont encapsulées dans des nodes. 

`Voici la définition d'un pipeline d'après kedro <https://docs.kedro.org/en/stable/get_started/kedro_concepts.html#pipeline>`_ :

A pipeline organises the dependencies and execution order of a collection of nodes and connects inputs and outputs while keeping your code modular. The pipeline determines the node execution order by resolving dependencies and does not necessarily run the nodes in the order in which they are passed in.

Here is a pipeline comprised of the nodes shown above::

    from kedro.pipeline import pipeline

    # Assemble nodes into a pipeline
    greeting_pipeline = pipeline([return_greeting_node, join_statements_node])

    
    node 
    
        Un node encapsule (enveloppe) une action. 
        Cette action est une fonction (un traitement) python.
        
`Voici la définition d'un node d'après kedro <https://docs.kedro.org/en/stable/get_started/kedro_concepts.html#node>`_ :

In Kedro, a node is a wrapper for a pure Python function that names the inputs and outputs of that function. Nodes are the building block of a pipeline, and the output of one node can be the input of another.

Here are two simple nodes as an example:

.. code-block:: python

    from kedro.pipeline import node

    # First node
    def return_greeting():
        return "Hello"


    return_greeting_node = node(func=return_greeting, inputs=None, outputs="my_salutation")

    # Second node
    def join_statements(greeting):
        return f"{greeting} Kedro!"


    join_statements_node = node(
        join_statements, inputs="my_salutation", outputs="my_message"