Skip to content

Conversation

@mihem
Copy link
Contributor

@mihem mihem commented Feb 15, 2025

@b-rodrigues My inital plan was to improve my existing code to remove duplicates so that your post-processing function does even has to do anything
However, I then realized that this was 1) pretty complex 2) quite inefficient because I call get_remote to get the remote dependencies for each remote dependencies. This mean for our example with mlr3extralearners GithubAPIs call for each of the 6 remote dependencies.
That's why i decided to remove my previous added code.

So with the main branch this takes ~30 on my machine

library(rix)

my_token <- gitcreds::gitcreds_get()$password
Sys.setenv(GITHUB_PAT = my_token)

system.time(
  rix(
    date = "2023-12-30",
    r_pkgs = c(
      "tidyverse", "mlr3", "qs", "matrixStats", "prospectr", "Cubist",
      "checkmate", "mlr3misc", "paradox"
    ),
    system_pkgs = NULL,
    git_pkgs = list(
      package_name = "mlr3extralearners",
      repo_url = "https://github.com/mlr-org/mlr3extralearners/",
      commit = "6e2af9ef9ecd420d2be44e9aa2488772bb9f7080"
    ),
    ide = "none",
    project_path = ".",
    overwrite = TRUE,
    print = TRUE
  )
)
  user  system elapsed 
  2.182   0.913  30.446 

With the new branch the same code takes 23s

   user  system elapsed 
  1.623   0.777  23.639 

There are also no duplicates.

This also works fine with this code (my inital try to remove duplicates): #385

However, this leads to a double entry of seurat-data: #388

So to be explicit:

rix(
  r_ver = "4.4.1",
  r_pkgs = NULL,
  system_pkgs = NULL,
  git_pkgs = list(
    package_name = "seurat-wrappers",
    repo_url = "https://github.com/satijalab/seurat-wrappers",
    commit = "5b63c23b211002611b0ccc33da009237d0dbafb6"
  ),
  ide = "none",
  project_path = ".",
  overwrite = TRUE,
  print = TRUE
)

has seurat-data twice.

let
 pkgs = import (fetchTarball "https://github.com/rstats-on-nix/nixpkgs/archive/2024-10-01.tar.gz") {};
  
    CIPR-Package = (pkgs.rPackages.buildRPackage {
      name = "CIPR-Package";
      src = pkgs.fetchgit {
        url = "https://github.com/atakanekiz/CIPR-Package";
        rev = "8729f6bda9f9a178f77844cab9467904fb9ecdae";
        sha256 = "sha256-oeVNSIig3+1dG6sjAxdvRVzDOfIcPYSp8psaQepDoG8=";
      };
      propagatedBuildInputs = builtins.attrValues {
        inherit (pkgs.rPackages) 
          ggpubr
          gtools
          tibble
          dplyr
          rlang
          magrittr;
      };
    });


    conos = (pkgs.rPackages.buildRPackage {
      name = "conos";
      src = pkgs.fetchgit {
        url = "https://github.com/hms-dbmi/conos";
        rev = "c3c1a941548e3bca1f4731f18b2abe16d126cdf3";
        sha256 = "sha256-7H/O+zDHBlQzU+hWYESin4yHitLqIVOrQOcvrTiJz/c=";
      };
      propagatedBuildInputs = builtins.attrValues {
        inherit (pkgs.rPackages) 
          Matrix
          igraph
          abind
          cowplot
          ComplexHeatmap
          dendextend
          dplyr
          ggplot2
          ggrepel
          gridExtra
          irlba
          leidenAlg
          magrittr
          Matrix_utils
          N2R
          R6
          reshape2
          rlang
          Rtsne
          sccore
          Rcpp
          RcppArmadillo
          RcppEigen
          RcppProgress;
      };
    });


    harmony = (pkgs.rPackages.buildRPackage {
      name = "harmony";
      src = pkgs.fetchgit {
        url = "https://github.com/immunogenomics/harmony";
        rev = "ea48a6c3c67f6c4c33841b3f545412d38ddcb363";
        sha256 = "sha256-BhGY+aG5vmNNw1LRei9S010cRbyoTcwwNnPRlAElnxw=";
      };
      propagatedBuildInputs = builtins.attrValues {
        inherit (pkgs.rPackages) 
          R
          Rcpp
          dplyr
          cowplot
          tidyr
          ggplot2
          irlba
          Matrix
          tibble
          rlang
          SingleCellExperiment
          RcppArmadillo
          RcppProgress;
      };
    });


    liger = (pkgs.rPackages.buildRPackage {
      name = "liger";
      src = pkgs.fetchgit {
        url = "https://github.com/welch-lab/liger";
        rev = "8ee61cafd28cce6aaf62fd68acdc4dabb857b59f";
        sha256 = "sha256-4O9SLFZ0nad0j8sR2dKtEBcpbLWkeFAGYMtwhZuzpmU=";
      };
      propagatedBuildInputs = builtins.attrValues {
        inherit (pkgs.rPackages) 
          cowplot
          Matrix
          patchwork
          Rcpp
          plyr
          FNN
          dplyr
          ggrepel
          irlba
          ica
          Rtsne
          ggplot2
          riverplot
          foreach
          doParallel
          mclust
          psych
          RANN
          uwot
          rlang
          hdf5r
          scattermore
          RcppArmadillo
          RcppEigen
          RcppProgress;
      };
    });


    monocle3 = (pkgs.rPackages.buildRPackage {
      name = "monocle3";
      src = pkgs.fetchgit {
        url = "https://github.com/cole-trapnell-lab/monocle3";
        rev = "87f6e88760566065179ae5039d524da7d032cc15";
        sha256 = "sha256-QzGXPDXIMOE5s+cqJQhSMgFx1QXdVoWNXeXfcu/Eblc=";
      };
      propagatedBuildInputs = builtins.attrValues {
        inherit (pkgs.rPackages) 
          Biobase
          SingleCellExperiment
          assertthat
          batchelor
          BiocGenerics
          BiocParallel
          DelayedArray
          DelayedMatrixStats
          digest
          dplyr
          future
          ggplot2
          ggrastr
          ggrepel
          grr
          HDF5Array
          igraph
          irlba
          leidenbase
          limma
          lme4
          lmtest
          MASS
          Matrix
          Matrix_utils
          openssl
          pbapply
          pbmcapply
          pheatmap
          plotly
          plyr
          proxy
          pscl
          purrr
          RANN
          RColorBrewer
          Rcpp
          reshape2
          rsample
          RhpcBLASctl
          RcppAnnoy
          RcppHNSW
          Rtsne
          S4Vectors
          shiny
          slam
          spdep
          speedglm
          stringr
          SummarizedExperiment
          uwot
          terra
          tibble
          tictoc
          tidyr
          viridis;
      };
    });


    Nebulosa = (pkgs.rPackages.buildRPackage {
      name = "Nebulosa";
      src = pkgs.fetchgit {
        url = "https://github.com/powellgenomicslab/Nebulosa";
        rev = "b834a75c5dc28c88d44a640cad5e1cb88b94e138";
        sha256 = "sha256-3XyK6dcMhi15bpmE0FT/L5ssI4xTK6l8WZdVaQwhrwM=";
      };
      propagatedBuildInputs = builtins.attrValues {
        inherit (pkgs.rPackages) 
          ggplot2
          patchwork
          SingleCellExperiment
          SummarizedExperiment
          SeuratObject
          ks
          Matrix;
      };
    });


    presto = (pkgs.rPackages.buildRPackage {
      name = "presto";
      src = pkgs.fetchgit {
        url = "https://github.com/immunogenomics/presto";
        rev = "052085db9c88aa70a28d11cc58ebc807999bf0ad";
        sha256 = "sha256-aEJQThmKdFprMSZRxle+gbvd0mS+077XpuKSTJ6qbrA=";
      };
      propagatedBuildInputs = builtins.attrValues {
        inherit (pkgs.rPackages) 
          Rcpp
          data_table
          dplyr
          tidyr
          reshape2
          Matrix
          rlang
          DESeq2
          RcppArmadillo;
      };
    });


    schex = (pkgs.rPackages.buildRPackage {
      name = "schex";
      src = pkgs.fetchgit {
        url = "https://github.com/SaskiaFreytag/schex";
        rev = "031320d";
        sha256 = "sha256-fd46ZxXEEk2XsEhQExrvpx+l5GPwGSrzBLTrJQIUVvg=";
      };
      propagatedBuildInputs = builtins.attrValues {
        inherit (pkgs.rPackages) 
          SingleCellExperiment
          Seurat
          ggplot2
          hexbin
          scater
          cluster
          dplyr;
      };
    });


    seurat-data = (pkgs.rPackages.buildRPackage {
      name = "seurat-data";
      src = pkgs.fetchgit {
        url = "https://github.com/satijalab/seurat-data";
        rev = "d6a8ce61ccb21a3b204f194d07009772c822791d";
        sha256 = "sha256-DMVbzWJdgNj1sPRLwFToa6O4I0bTc/ECgtg3Cg/o+jc=";
      };
      propagatedBuildInputs = builtins.attrValues {
        inherit (pkgs.rPackages) 
          cli
          crayon
          rappdirs
          Matrix
          Seurat
          SeuratObject;
      };
    });


    seurat-data = (pkgs.rPackages.buildRPackage {
      name = "seurat-data";
      src = pkgs.fetchgit {
        url = "https://github.com/satijalab/seurat-data";
        rev = "b198ed523b1fd773c09356d30d9dfcebd62f22cf";
        sha256 = "sha256-9fGp5dRzfPO7/vh8yuVq9YbDQLRkYEBL+Z6u2J2BFnw=";
      };
      propagatedBuildInputs = builtins.attrValues {
        inherit (pkgs.rPackages) 
          cli
          crayon
          rappdirs
          Matrix
          Seurat
          SeuratObject;
      };
    });

    seurat-disk = (pkgs.rPackages.buildRPackage {
      name = "seurat-disk";
      src = pkgs.fetchgit {
        url = "https://github.com/mojaveazure/seurat-disk";
        rev = "9b89970eac2a3bd770e744f63c7763419486b14c";
        sha256 = "sha256-NqhABs5KsqWiMckG/l3fXzra+DnE7K2HpdtVtyxxr2I=";
      };
      propagatedBuildInputs = builtins.attrValues {
        inherit (pkgs.rPackages) 
          cli
          crayon
          hdf5r
          Matrix
          R6
          rlang
          Seurat
          SeuratObject
          stringi
          withr;
      } ++ [ seurat-data ];
    });


    velocyto.R = (pkgs.rPackages.buildRPackage {
      name = "velocyto.R";
      src = pkgs.fetchgit {
        url = "https://github.com/velocyto-team/velocyto.R";
        rev = "83e6ed92c2d9c9640122dcebf8ebbb5788165a21";
        sha256 = "sha256-5iVHDHsN9xvD54nbva6SC0vVR4R3xXBi2ctM88eE7+I=";
      };
      propagatedBuildInputs = builtins.attrValues {
        inherit (pkgs.rPackages) 
          Matrix
          Rcpp
          MASS
          mgcv
          pcaMethods
          abind
          cluster
          igraph
          hdf5r
          RcppArmadillo;
      };
    });

    seurat-wrappers = (pkgs.rPackages.buildRPackage {
      name = "seurat-wrappers";
      src = pkgs.fetchgit {
        url = "https://github.com/satijalab/seurat-wrappers";
        rev = "5b63c23b211002611b0ccc33da009237d0dbafb6";
        sha256 = "sha256-M7KJrP70z81U9j7wva/HWipgmEwXKv4Us8IuP1SRPUU=";
      };
      propagatedBuildInputs = builtins.attrValues {
        inherit (pkgs.rPackages) 
          BiocManager
          cowplot
          ggplot2
          igraph
          Matrix
          remotes
          rsvd
          Seurat
          rlang
          R_utils
          data_table
          reticulate;
      } ++ [ liger conos harmony presto seurat-data velocyto.R schex monocle3 seurat-disk Nebulosa CIPR-Package ];
    });
    
  system_packages = builtins.attrValues {
    inherit (pkgs) 
      glibcLocales
      nix
      R;
  };
  
in

pkgs.mkShell {
  LOCALE_ARCHIVE = if pkgs.system == "x86_64-linux" then "${pkgs.glibcLocales}/lib/locale/locale-archive" else "";
  LANG = "en_US.UTF-8";
   LC_ALL = "en_US.UTF-8";
   LC_TIME = "en_US.UTF-8";
   LC_MONETARY = "en_US.UTF-8";
   LC_PAPER = "en_US.UTF-8";
   LC_MEASUREMENT = "en_US.UTF-8";

  buildInputs = [ seurat-wrappers   system_packages   ];
  
}

Do you know why? Only one thing I realized is that the ref of the seurat-data is different. So Maybe this should only check for the

And second thing is that if packages are removed there are lots of blank lines (now the mlr3extralearners example again).
default.nix.log

@mihem
Copy link
Contributor Author

mihem commented Feb 15, 2025

The fact that the dups tests fails is only because of differences in blank lines. It would be nice if we could create a clean version without additional blank lines thatn we don't have to worry about that.

@b-rodrigues
Copy link
Contributor

if there are cases where duplicates remain, I prefer to keep the coding as is to be honest

@mihem
Copy link
Contributor Author

mihem commented Feb 15, 2025

haha i hear you. This issue is quite challenging and can be annoying.
In general I think it would be better if we can remove code, make it more efficient,and not use two half working approaches.
But sure, it's working now. I will try to see if I can fix those problems. If I don't manage it, we just keep it as it is, agree.

@b-rodrigues
Copy link
Contributor

consecutive empty lines are now removed with 4b31d10

@mihem
Copy link
Contributor Author

mihem commented Feb 15, 2025

nice.
Wow the function remove_emptry_lines is nice, so short, fast and powerful, but not easy to grasp at first sight (at least for my mind).

I am experimenting with a solution to keep track of the already fetched packages using environments (Claude Sonnet 3.5 suggestion) btw, but not there yet

@b-rodrigues
Copy link
Contributor

Wow the function remove_emptry_lines is nice, so short, fast and powerful, but not easy to grasp at first sight (at least for my mind).

we can thank chatgpt for it 😂

@mihem
Copy link
Contributor Author

mihem commented Feb 15, 2025

Yeah, we can also thank Github Copilot (Claude Sonnet 3.5) for this idea with environments.
I think that's exactly what I tried to achieve for a few weeks :)
I went throught the example of mle3extralearners with skip_post_processing = TRUE and there were not duplicates and magically I could even build the nix-shell (so no version mismatch as before), although this was probably more luck?

Also the seurat-wrappers example worked fine (no duplicates).

So if you have time, have a look, I haven't used this environment approach, it's a little awkward, but also quite cool because don't know how to carry the already fetched packages through the functions.

But don't think this is ready to merge, needs some more testing.

@mihem mihem force-pushed the remove_duplicate_third branch from f667e38 to 252d6be Compare February 15, 2025 21:48
@mihem
Copy link
Contributor Author

mihem commented Feb 15, 2025

hm, but it changed the commit hash, and now it's not the closest anymore, have to further investigate

@b-rodrigues btw the ubuntu-checks are still not showing if they fail. I know have other commit hash, so there are 2 tests failing in devtools_test (ubuntu-latest), but it says it ran succesfully. I think it would be helpful if you could fix that.

@mihem
Copy link
Contributor Author

mihem commented Feb 15, 2025

just further thinking out loud:

So mlre3xtralearners with the given commit hash 6e2af9ef9ecd420d2be44e9aa2488772bb9f7080 has the commit date 2023-10-11. And then rix fetches the commit hash a901255c26614a0ece317dc849621420f9393d42 for set6 (commit date 2023-09-01. param6 closet commit to 2023-10-11 is 0fa35771276fc05efe007a71bda466ced1e4c5eb which is very old 2022-08-28. The problem is that set6 is also a dependency of param6 so it's fetched again (note: this is even though we now have an environment that keeps track of the already fetched packages because param6 could have further dependencies that have not been fetched yet so it makes sense to in general check for that), and now the closest commit to the date 2022-08-28 is e65ffeea48d30d687482f6706d0cb43b16ba3919 of 2022-08-27.

Not sure there is a right decision here. We are probably again at the point where the user has to decide, right @b-rodrigues. But this is the reason why the tests fail and why on the other hand building the shell now works.

I think it would more sense to skip fetching the package again (so i think we need another environment that keeps track of that) because this is faster and then we should end up with all dates being close to the date of the given commit, although this might also require manual intervention. @b-rodrigues agree?

@b-rodrigues
Copy link
Contributor

I haven't had the chance to look at this if depth but I generally don't like when functions mess with the global environment, isn't it possible to keep the track in a variable within the scope of the function ?

@mihem
Copy link
Contributor Author

mihem commented Feb 15, 2025

Yeah I know that this is not super clean, but we can remove that variables again from the global environment.

I don't think it's that easy. That's because of this recursive call offetchgits within fetchgit.

  } else { # if there are remote dependencies, start over

    remote_packages_expressions <- fetchgits(remotes)

So then we are in fetchgit with e.g. param6 as a dependency of mlr3extralearners and all information about prior fetched packages is gone.

@b-rodrigues
Copy link
Contributor

wouldn't adding a variable seen_pkgs to ftchgit with a default set to NULL do the trick? I don't want rix() to be writing to the global env

@mihem
Copy link
Contributor Author

mihem commented Feb 15, 2025

I don't think so because when fetchgits is called it's a new environment then and seen_pkgs is gone.
Not an expert concerning environments, are there other environments we could use?
It's exactly the solution I was looking for and when I am debugging this rix calls it's pretty cool because it now really removes all duplicate packages and all tests have worked so far.

@b-rodrigues
Copy link
Contributor

why not cache the list of seen packages in a /tmp/?

@mihem mihem force-pushed the remove_duplicate_third branch from 080867a to fb1f1aa Compare February 16, 2025 12:07
@mihem
Copy link
Contributor Author

mihem commented Feb 16, 2025

git commits were quite a mess, not it's all rebased

okay i try to find a solution to avoid global environments. I tried several solutions, but none of them worked so far. I used s3 objects, however also tricky to really keep values across functions, that would at last mean that i have to carry this object through all functions (from fetchgits to resolve_package_commit are quite some). I also tried your solution, which seemed to work fine. But i then realized that for each loop it creates a new temporary file .. right now experimenting with environments , but not global environments, because so far i have found this easiest to handle.
Quite a lot of work, but I also learn a lot :)

@mihem
Copy link
Contributor Author

mihem commented Feb 16, 2025

Hehe, okay in the end I used your solution and saved it to a temporary file (avoids creating a new file by using fixed file name, not that complicated).
At least in my interactive tests it worked fine and no duplicates with the two tests rix calls (seurat-wrappers and mlr3extralearners).
But need to do further testing.

THe dedup tests fails because of some format stuff and because deup actually has the "wrong" commit (meaning of a date 2022-08-27 as explinaed above)

@mihem mihem changed the title Remove code to remove duplicates because of new post processing function Keeping track of fetched packages and commits to avoid duplicates and make fetching remotes faster Feb 16, 2025
@mihem
Copy link
Contributor Author

mihem commented Feb 26, 2025

I've run devtools::check_win_devel() but also got an error that pdflatex was missing, nonetheless you should have got an mail. I then installed tinytex (and tinytex within tinytex :P), and run this again, but then got the following error:

Upload failed (at start/before it took off) [win-builder.r-project.org]: Failed FTP upload: 550

@b-rodrigues
Copy link
Contributor

b-rodrigues commented Feb 27, 2025

I am pretty sure that this has nothing to do with this PR (removing duplicates by using caching), so I would prefer if we would merge this PR first and then work on that one in a a new PR. This one is already big enough I think ;).

I'm not sure yet if it's not linked but in any case I'll have to fix it first then

@b-rodrigues
Copy link
Contributor

I am pretty sure that this has nothing to do with this PR (removing duplicates by using caching), so I would prefer if we would merge this PR first and then work on that one in a a new PR. This one is already big enough I think ;).

I'm not sure yet if it's not linked but in any case I'll have to fix it first then

ok so I’ve added an action to run R CMD Check on a runner without Nix, and indeed, there is something wrong the response from the server. I thiiiink I know what’s going on:

https://github.com/ropensci/rix/actions/runs/13562423321/job/37908068713

@mihem
Copy link
Contributor Author

mihem commented Feb 27, 2025

Ah okay, and what's going on?
And you are sure that this is the problem I also encountered? Because installing R and loading rix worked, but then this failed and I could reproduce this with my local environemtn with options(rix.sri_hash = "api_server")
So I assumed that this is some problem in the nix_hash_online function. Probably a problem that occured when you first developed this entire remote dependencies code but never became evident because it was never explicitly tested?

@b-rodrigues
Copy link
Contributor

I realized testthat::expect_snapshot_file is skipped on CRAN by default, so these tests are not running through the rhub action either. So yeah, this might have been broken for a long time actually

@b-rodrigues
Copy link
Contributor

I want to deal with this issue first: #430

I think that the bug was introduced with the changes from the last PR

@b-rodrigues
Copy link
Contributor

Could you resolve conflicts?

@b-rodrigues b-rodrigues mentioned this pull request Feb 28, 2025
@mihem
Copy link
Contributor Author

mihem commented Feb 28, 2025

Will try to do this during work on GitHub directly, if it does not work I will do it this evening.

@mihem
Copy link
Contributor Author

mihem commented Feb 28, 2025

oh man this really sucks online. I resolved all conflcit on github, but then "commiting" does not work, I then tried to copy manually each file and commiting manually, which is also not great. have to do this this evening in VSCode.

@mihem mihem force-pushed the remove_duplicate_third branch from 580f422 to 870c699 Compare February 28, 2025 19:37
@b-rodrigues
Copy link
Contributor

ready to merge ?

@mihem
Copy link
Contributor Author

mihem commented Feb 28, 2025

give me 5 more mintues to manually check.

It was again quite confusing because i didn't realize that the resolving conflicts removed the ..., so ignore_remotes_cache was not passed, and this produced some weird error.

@mihem
Copy link
Contributor Author

mihem commented Feb 28, 2025

Yeah I think everything is good! Long journey! 👍

@mihem
Copy link
Contributor Author

mihem commented Feb 28, 2025

or let me write this caching a little in the docs

@b-rodrigues
Copy link
Contributor

yes that'd be great I'll wait then

@mihem
Copy link
Contributor Author

mihem commented Feb 28, 2025

Have a look if you are okay with the explaination.
Found it difficult to explain the issue, especially for the "correct" dates.

Otherwise ready to merge finally.

However, if you disable this feature, then `{rix}` will fetch package `C` twice, once for the commit of the 1st December 2024 as before
(closest date that is earlier than the commit date of package `A`) and once for a date
before the 1st November 2024 (closest date that is earlier than the commit date of package `B`).
You should be aware of this because if you are unlucky the default option `ignore_remotes_cache = FALSE` might result in a solution that does not work.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean it might not work because the commit that will be picked might be for a version of C incompatible with either A or B?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We observed this with he mlr3 example.
I mean this can always happen with GitHub packages. I just wanted the user to be aware of the fact that there a multiple solution of which commit rix should pick for package c in sich a situation and that some of them might work and others not. And that the one rix chooses is the one that is closest to package a, but not necessarily the one that works.
Maybe you find better words.

Copy link
Contributor Author

@mihem mihem Mar 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote it a little, but also happy to take suggestions

@b-rodrigues b-rodrigues merged commit 6668547 into ropensci:main Mar 1, 2025
17 of 18 checks passed
@b-rodrigues
Copy link
Contributor

Many thanks! this is a great addition. I was a bit worried at first since it introduced quite a lot of code, but it's for the better.

@mihem
Copy link
Contributor Author

mihem commented Mar 1, 2025

Welcome.
Also think this is an important addition for handling remote dependencies. It hopefully solves the duplicate problem for good. We could remove my previous two loops to remove duplicates and the post_processing function.
I hope that now, with fetching the closest commit and caching packaces, fetching remote dependencies from GitHub is more or less stable.

I saw that you edited the links for the tests workflows. However, you also need to change the branch, not only the username.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants