Package: RcppColMetric
0.1.0
Author: Xiurui Zhu
Modified: 2025-03-08 18:17:18
Compiled: 2025-03-08 18:17:53
The goal of RcppColMetric
is to efficiently compute
metrics between various vectors and a common vector. This is common in
data science, such as computing performance metrics between each feature
and a common response. Rcpp
is
used to efficiently iterate over vectors through compiled code. You may
extend its utilities by providing custom metrics that fit into the
framework.
You can install the released version of RcppColMetric
from CRAN with:
install.packages("RcppColMetric")
Alternatively, you can install the developmental version of
RcppColMetric
from github
with:
::install_github("zhuxr11/RcppColMetric") remotes
library(cbbinom)
We use cats
from MASS
to
illustrate the use of the package.
library(MASS)
data(cats)
print(head(cats))
#> Sex Bwt Hwt
#> 1 F 2.0 7.0
#> 2 F 2.0 7.4
#> 3 F 2.0 9.5
#> 4 F 2.1 7.2
#> 5 F 2.1 7.3
#> 6 F 2.1 7.6
In binary classification modelling, it is a common practice to
compute ROC-AUC of each feature (usually columns) against a common
target. RcppColMetric
provides a much faster version than
its commonly used counterparts, e.g. caTools::colAUC()
.
library(caTools)
<- microbenchmark::microbenchmark(
(col_auc_bench col_auc_r = caTools::colAUC(cats[, 2L:3L], cats[, 1L]),
col_auc_cpp = col_auc(cats[, 2L:3L], cats[, 1L]),
times = 100L,
check = "identical"
))#> Unit: microseconds
#> expr min lq mean median uq max neval
#> col_auc_r 514.600 544.2005 623.3241 613.2515 666.1505 985.601 100
#> col_auc_cpp 200.401 222.2010 244.9679 236.0010 266.8010 391.501 100
As can be seen, the median speed of computation from
RcppColMetric
is 2.599 times faster.
If there are multiple sets of features and responses, you may use the
vectorized version col_auc_vec()
, which uses compiled code
to speed up iterations and returns a list.
col_auc_vec(list(cats[, 2L:3L]), list(cats[, 1L]))
#> [[1]]
#> Bwt Hwt
#> F vs. M 0.8338451 0.759048
In classification modelling, it is another common practice to assess
mutual information between features and a response if the features are
discrete. RcppColMetric
provides a much faster version than
its commonly used counterparts,
e.g. infotheo::mutinformation()
.
library(infotheo)
<- microbenchmark::microbenchmark(
(col_mut_info_bench col_mut_info_r = sapply(round(cats[, 2L:3L]), infotheo::mutinformation, cats[, 1L]) %>%
matrix(., nrow = 1L, dimnames = list(NULL, names(.)))},
{col_mut_info_cpp = col_mut_info(round(cats[, 2L:3L]), cats[, 1L]),
times = 100L,
check = "identical"
))#> Unit: microseconds
#> expr min lq mean median uq max neval
#> col_mut_info_r 1587.200 1685.351 1884.131 1766.001 1964.9510 4803.800 100
#> col_mut_info_cpp 618.401 641.551 691.325 659.651 721.6015 943.101 100
As can be seen, the median speed of computation from
RcppColMetric
is 2.677 times faster.
If there are multiple sets of features and responses, you may use the
vectorized version col_mut_info_vec()
, which uses compiled
code to speed up iterations and returns a list.
col_mut_info_vec(list(round(cats[, 2L:3L])), list(cats[, 1L]))
#> [[1]]
#> Bwt Hwt
#> [1,] 0.1346783 0.1620514
You may implement your own metric by inheriting from
RcppColMetric::Metric
class with template arguments as
feature SEXP
(input, numeric here) and response
SEXP
(input, factor as integer here) types. For example, to
compute range of each feature, define a RangeMetric
class.
#include <RcppColMetric.h>
#include <Rcpp.h>
using namespace Rcpp;
// x: numeric (REALSXP), y: factor -> integer (INTSXP), output: numeric (REALSXP)
class
: public RcppColMetric::Metric<REALSXP, INTSXP, REALSXP>
RangeMetric{
public:
// Constructor
(const RObject& x, const IntegerVector& y, const Nullable<List>& args = R_NilValue) {
RangeMetric// This parameter is inherited from `Metric`, determining output dimension (number of rows)
// For RangeMetric, the output dimension is 2 (min & max)
= 2;
output_dim }
virtual Nullable<CharacterVector> row_names(const RObject& x, const IntegerVector& y, const Nullable<List>& args = R_NilValue) const override {
// Determine the row names
// If not used, it may return R_NilValue
= {"min", "max"};
CharacterVector out return out;
}
virtual NumericVector calc_col(const NumericVector& x, const IntegerVector& y, const R_xlen_t& i, const Nullable<List>& args = R_NilValue) const override {
// Derive output value for each feature and the common response
// For RangeMetric, the output is min & max
= {min(x), max(x)};
NumericVector out return out;
}
};
Then, define the main function calling
RcppColMetric::col_metric()
, with corresponding feature
SEXP
(input, numeric here), response SEXP
(input, factor as integer here) and output SEXP
(output,
numeric here) types.
(const RObject& x, const IntegerVector& y, const Nullable<List>& args = R_NilValue) {
NumericMatrix col_range(x, y, args);
RangeMetric range_metric= RcppColMetric::col_metric<REALSXP, INTSXP, REALSXP>(x, y, range_metric, args);
NumericMatrix out return out;
}
Test this function with cats
:
col_range(cats[, 2L:3L], cats[, 1L])
#> Bwt Hwt
#> min 2.0 6.3
#> max 3.9 20.5
To define vectorized version of the function, a wrapper function is
defined to generate RangeMetric
object (taking only
x
, y
and args
), and then passed
on to the workhorse RcppColMetric::col_metric_vec()
.
(const RObject& x, const IntegerVector& y, const Nullable<List>& args = R_NilValue) {
RangeMetric gen_range_metric(x, y, args);
RangeMetric outreturn out;
}
// [[Rcpp::export]]
(const List& x, const List& y, const Nullable<List>& args = R_NilValue) {
List col_range_vec= RcppColMetric::col_metric_vec<REALSXP, INTSXP, REALSXP>(x, y, &gen_range_metric, args);
List out return out;
}
Test the vectorized function with cats
:
col_range_vec(list(cats[, 2L:3L]), list(cats[, 1L]))
#> [[1]]
#> Bwt Hwt
#> min 2.0 6.3
#> max 3.9 20.5