Refactoring Code ================ Data science development is characterized by nimble, iterative experimentation and exploratory data analysis. Data scientists explore many possibilities before reaching the final result. The rapid exploration process often leads to long, messy code, the majority of which has no impact on the final result. Code refactoring is hence an essential step for moving data science work into production. Yet, it often becomes a major bottleneck as it involves manual inspection of long, messy code that sometimes does not make sense even to its own author. With the complete development history stored in artifacts, LineaPy enables automatic code clean-up, facilitating code refactoring and hence transition to production. For instance, say we are a botanical data scientist modeling the relationship between different parts of the iris flower, and we end up with the following development code: .. code:: python import os import lineapy import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression # Load data df = pd.read_csv("https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv") # View data df # Plot petal vs. sepal width df.plot.scatter("petal.width", "sepal.width") plt.show() # Calculate correlation coefficient df[["petal.width", "sepal.width"]].corr(method="pearson") # Initiate the model lm_1 = LinearRegression() # Fit the model lm_1.fit( X=df[["petal.width"]], y=df["sepal.width"], ) # Check result print(f"Coefficient for petal.width: {lm_1.coef_[0]}") # Check species and their counts df["variety"].value_counts() # Create dummy variables encoding species df["d_versicolor"] = df["variety"].apply(lambda x: 1 if x == "Versicolor" else 0) df["d_virginica"] = df["variety"].apply(lambda x: 1 if x == "Virginica" else 0) # View data df # Initiate the model lm_2 = LinearRegression() # Fit the model lm_2.fit( X=df[["petal.width", "d_versicolor", "d_virginica"]], y=df["sepal.width"], ) # Check result print(f"Coefficient for petal.width: {lm_2.coef_[0]}") print(f"Coefficient for d_versicolor: {lm_2.coef_[1]}") print(f"Coefficient for d_virginica: {lm_2.coef_[2]}") # Map each species to a color color_map = {"Setosa": "green", "Versicolor": "blue", "Virginica": "red"} df["variety_color"] = df["variety"].map(color_map) # Plot petal vs. sepal width by species df.plot.scatter("petal.width", "sepal.width", c="variety_color") plt.show() # Create interaction terms for species df["i_versicolor"] = df["petal.width"] * df["d_versicolor"] df["i_virginica"] = df["petal.width"] * df["d_virginica"] # Initiate the model lm_3 = LinearRegression() # Fit the model lm_3.fit( X=df[["petal.width", "d_versicolor", "d_virginica", "i_versicolor", "i_virginica"]], y=df["sepal.width"], ) # Check result print(f"Coefficient for petal.width: {lm_3.coef_[0]}") print(f"Coefficient for d_versicolor: {lm_3.coef_[1]}") print(f"Coefficient for d_virginica: {lm_3.coef_[2]}") print(f"Coefficient for i_versicolor: {lm_3.coef_[3]}") print(f"Coefficient for i_virginica: {lm_3.coef_[4]}") As shown, the code interweaves various plots, models, and print statements, reflecting dynamic nature of the development stage of data science work. Say we are interested in productionizing the second model (``lm_2``). Normally, this would involve manually sifting through the entire code to identify relevant parts only. We can cut through such manual labor with the help of LineaPy. First, we store the model as a LineaPy artifact: .. code:: python # Save desired model as an artifact lineapy.save(lm_2, "linear_model_v2") Then, we simply ask for its cleaned-up code, like so: .. code:: python # Retrieve the model artifact artifact = lineapy.get("linear_model_v2") # Get cleaned-up code print(artifact.get_code()) And we get: .. code:: none import pandas as pd from sklearn.linear_model import LinearRegression df = pd.read_csv( "https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv" ) df["d_versicolor"] = df["variety"].apply(lambda x: 1 if x == "Versicolor" else 0) df["d_virginica"] = df["variety"].apply(lambda x: 1 if x == "Virginica" else 0) lm_2 = LinearRegression() lm_2.fit( X=df[["petal.width", "d_versicolor", "d_virginica"]], y=df["sepal.width"], ) which is more concise and manageable than what we initially had --- a long, messy collection of various operations. Note that the cleaned-up code above is a subset of the original development code. That is, LineaPy "condensed" the original code by removing extraneous operations that do not affect the artifact we care about, i.e., ``lm_2``. In practice, development scripts/notebooks by data scientists are much longer and more complicated than this simple example. Hence, LineaPy's automatic code clean-up can save considerable time for data scientists to move their work into production. .. note:: This does not mean that we lost other parts of the development code. We can still access the artifact's full session code (including comments) with ``artifact.get_session_code()``. This should come in handy when trying to remember or understand the original development context of a given artifact. .. note:: In fact, ``lineapy.save()`` itself returns the artifact object, so we could have simply executed ``artifact = lineapy.save(lm_2, "linear_model_v2")`` above. .. note:: If you want hands-on practice, check out `this `_ tutorial notebook.