Make a test set of data for GitHub Actions to work on
This data needs to be committed to DVC
What do we want to do with the GitHub Action?
Test that DVC/DagsHub works?
Seems like I need a subset of data to fit
Could also reduce what the runner does, maybe all it needs to do is show that the app is working and/or that a connection can be established with whatever large file storage of choice
What benefits does DVC actually provide?
It was intended to let large datafiles/sets live alongside code and avoid hassles with Git LFS
Combination of DVC and Mlflow to manage my projects. DVC creates the pipelines and at the modeling stage I register the model at Mlflow.
DVC is very good for creating models and structure a data pipeline, skipping stages that didn’t change. Also is a good tool to create and reproduce experiments in a teams workplace. But since you have data coming daily in a streaming like scenario, I wouldn’t recommend it, since it its main features rely on files and model training and not on retraining and model deployment.
TODO Roll API key for DagsHub
TODO refactor workflow to simplify what the runner is doing