Annotating Packages¶
One big strength of the Python language is its large and active development community. This means that we have a lot of interesting and useful Python packages that are not part of the core Python distribution. These third-party packages often come with their own new APIs, and LineaPy is not guaranteed to handle all of them properly when removing extraneous code. For such "unrecognized" APIs, we can add instructions for LineaPy, which we call "package annotation". Specifically, package annotation is done by creating or updating a YAML file in lineapy/annotations/external/ (check the link for existing annotations).
Warning
Note that YAML files are sensitive to indentation. To ensure that you use correct indentation, we recommend you consult/adapt existing annotation files.
Annotating APIs that modify external states¶
A common group of APIs that LineaPy needs annotation for are those that modify external states such as the local file system and external databases.
For instance, without relevant package annotation, running
import pandas as pd
import lineapy
df = pd.read_csv("https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv")
df["variety_color"] = df["variety"].map({"Setosa": 0, "Versicolor": 1, "Virginica": 2})
df.to_csv("df_augmented.csv", index=False)
artifact = lineapy.save(lineapy.file_system, "df_augmented")
print(artifact.get_code())
would display nothing. This is because lineapy does not recognize
to_csv()
as an API that modifies external states (local file system, in this case) and hence skips it
during the relevant code cleanup.
In contrast, we would get the correct code cleanup with the following package annotation (currently in the source code):
- module: pandas.core.generic
  annotations:
    - criteria:
        class_instance: NDFrame
        class_method_names:
          - to_csv
      side_effects:
        - mutated_value:
            external_state: file_system
where
- 
modulerefers to the third-party package/module that the annotation is associated with.
- 
criteriaspecifies what each annotation is for. In this example, it instructs that the annotation is forto_csv()method ofNDFrameclass.
- 
side_effectsspecifies what each annotation is about. In this example, it instructs that the annotation is about mutation of an external state, specifically the local file system.
In sum, these instructions tell lineapy to recognize to_csv() method of NDFrame class (defined under pandas.core.generic module)
as an API that mutates the local file system, and to treat it as such in relevant downstream tasks such as code cleanup.
Note
NDFrame is a parent class of DataFrame, where the latter inherits to_csv method from the former.
Hence, the annotation above is done at a more fundamental level, which is recommended to avoid redundant annotations
among related classes (i.e., other data classes that inherit from NDFrame). In practice, identifying the right
"root" level to annotate at often involves exploring the target package's codebase for some time.
Info
Hyphen at the beginning of each keyword indicates that the indented block is an item of a list. For instance, in the example above,
- criteria means that the indented block following it, i.e.,
criteria:
  class_instance: NDFrame
  class_method_names:
    - to_csv
side_effects:
  - mutated_value:
      external_state: file_system
is an item of annotations list. Hence, we can add another annotation for to_json like so:
- module: pandas.core.generic
  annotations:
    - criteria:
        class_instance: NDFrame
        class_method_names:
          - to_csv
      side_effects:
        - mutated_value:
            external_state: file_system
    - criteria:
        class_instance: NDFrame
        class_method_names:
          - to_json
      side_effects:
        - mutated_value:
            external_state: file_system
In fact, since the two items share much in common (the only difference is in class_method_names),
we can condense the above annotation to the following:
- module: pandas.core.generic
  annotations:
    - criteria:
        class_instance: NDFrame
        class_method_names:
          - to_csv
          - to_json
      side_effects:
        - mutated_value:
            external_state: file_system
where we simply extend class_method_names list instead.
Annotating APIs that modify the caller itself¶
Another group of APIs that need annotation are those that modify their caller.
For instance, without relevant package annotation, running
from sklearn.linear_model import LinearRegression
import pandas as pd
import lineapy
df = pd.read_csv("https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv")
mod = LinearRegression()
mod.fit(X=df[["petal.width"]], y=df["sepal.width"])
artifact = lineapy.save(mod, "fitted_model")
print(artifact.get_code())
would display
from sklearn.linear_model import LinearRegression
mod = LinearRegression()
where the critical step of model fitting is missing. This is because lineapy does not recognize
fit()
as an API that modifies the caller (mod of LinearRegression type, in this case) and hence skips it
during the relevant code cleanup.
In contrast, we would get the correct code cleanup with the following package annotation (currently in the source code):
- module: sklearn.base
  annotations:
    - criteria:
        class_instance: BaseEstimator
        class_method_name: fit
      side_effects:
        - mutated_value:
            self_ref: SELF_REF
        - views:
            - self_ref: SELF_REF
            - result: RESULT
where
- 
BaseEstimatoris a parent class ofLinearModel, which is in turn a parent class ofLinearRegression.
- 
viewsspecifies entities whose modifications should be linked to one another. In this example,resultandself_refare set to be views of each other, which means that modification of thefit()call's output (e.g.,fitted_modinfitted_mod = mod.fit(...)) will modify thefit()caller itself (e.g.,modinfitted_mod = mod.fit(...)), and vice versa. Differently put,viewsestablishes links between objects that need to be tracked together because a change in one will change the other(s).
In sum, instructions above tell lineapy to recognize fit() method of BaseEstimator class (defined under sklearn.base module)
as an API that mutates the function caller itself, and to treat it as such in relevant downstream tasks such as code cleanup.