Title: | Training Set Determination For Genomic Selection |
---|---|
Description: | We propose an optimality criterion to determine the required training set, r-score, which is derived directly from Pearson's correlation between the genomic estimated breeding values and phenotypic values of the test set <doi:10.1007/s00122-019-03387-0>. This package provides two main functions to determine a good training set and its size. |
Authors: | Jen-Hsiang Ou [aut, cre] |
Maintainer: | Jen-Hsiang Ou <[email protected]> |
License: | GPL (>= 3) |
Version: | 2.4.2 |
Built: | 2025-03-10 03:03:18 UTC |
Source: | https://github.com/oumarkme/tsdfgs |
This function calculate CD-score <doi:10.1186/1297-9686-28-4-359> by given training set and test set.
cd_score(X, X0)
cd_score(X, X0)
X |
A numeric matrix. The training set genotypic information matrix can be given as genotype matrix (coded as -1, 0, 1) or principle component matrix (row: sample; column: marker). |
X0 |
A numeric mareix. The test set genotypic information matrix can be given as genotype matrix (coded as -1, 0, 1) or principle component matrix (row: sample; column: marker). |
A floating-point number, CD score.
Jen-Hsiang Ou
data(geno) ## Not run: cd_score(geno[1:50, ], geno[51:100])
data(geno) ## Not run: cd_score(geno[1:50, ], geno[51:100])
A function for fitting logisti growth model
FGCM(geno, nt = NULL, n_iter = NULL, multi.threads = FALSE)
FGCM(geno, nt = NULL, n_iter = NULL, multi.threads = FALSE)
geno |
Genotype information saved as a dataframe. Columns represent variants (SNPs or PCs). |
nt |
A numerical vector of training set sample size for estimating logistic growth curve parameters |
n_iter |
Number of simulation of each training set size. Automatically gave a suitable number by default. |
multi.threads |
Default: FALSE. Multi-thread function is only avalyble for mac or linux systems. |
Estimation of parameters.
data(geno) ## Not run: FGCM(geno)
data(geno) ## Not run: FGCM(geno)
A PCA matrix of rice genotype information. This data was published by Zhao et al. (2011) <doi:10.1038/ncomms1467>
geno
geno
A numeric matrix (PCA) with 404 rows (sample) and 404 columns (PCs).
http://www.ricediversity.org/data/
data(geno)
data(geno)
Calculate r-scores (un-target) by in parallel.
nt2r(geno, nt, n_iter = 30, multi.threads = FALSE)
nt2r(geno, nt, n_iter = 30, multi.threads = FALSE)
geno |
A numeric dataframe of genotype, column represent sites (genotype coding as 1, 0, -1) |
nt |
Numeric. Number of training set size |
n_iter |
Times of iteration. (default = 30) |
multi.threads |
Default: FALSE. Multi-thread function is only avalyble for mac or linux systems. |
A vector of r-scores of each iteration
data(geno) ## Not run: nt2r(geno, 50)
data(geno) ## Not run: nt2r(geno, 50)
This function is designed for determining optimal training set.
optTrain( geno, cand, n.train, subpop = NULL, test = NULL, method = "rScore", min.iter = NULL, console = TRUE )
optTrain( geno, cand, n.train, subpop = NULL, test = NULL, method = "rScore", min.iter = NULL, console = TRUE )
geno |
A numeric matrix of principal components (rows: individuals; columns: PCs). |
cand |
An integer vector of which rows of individuals are candidates of the training set in the geno matrix. |
n.train |
The size of the target training set. This could be determined with the help of the ssdfgp function provided in this package. |
subpop |
A character vector of sub-population's group name. The algorithm will ignore the population structure if it remains NULL. |
test |
An integer vector of which rows of individuals are in the test set in the geno matrix. The algorithm will use an un-target method if it remains NULL. |
method |
Choices are rScore, PEV and CD. rScore will be used by default. |
min.iter |
Minimum iteration of all methods can be appointed. One should always check if the algorithm is converged or not. A minimum iteration will set by considering the candidate and test set size if it remains NULL. |
console |
Default: TRUE. Set it to FALSE if you don't want the function printing out the number count of each iteration. |
This function will return 3 information including OPTtrain (a vector of chosen optimal training set), TOPscore (highest scores of before iteration), and ITERscore (criteria scores of each iteration).
Jen-Hsiang Ou
data(geno) ## Not run: optTrain(geno, cand = 1:404, n.train = 100)
data(geno) ## Not run: optTrain(geno, cand = 1:404, n.train = 100)
This function calculate prediction error variance (PEV) score <doi:10.1186/s12711-015-0116-6> by given training set and test set.
pev_score(X, X0)
pev_score(X, X0)
X |
A numeric matrix. The training set genotypic information matrix can be given as genotype matrix (coded as -1, 0, 1) or principle component matrix (row: sample; column: marker). |
X0 |
A numeric mareix. The test set genotypic information matrix can be given as genotype matrix (coded as -1, 0, 1) or principle component matrix (row: sample; column: marker). |
A floating-point number, PEV score.
Jen-Hsiang Ou
data(geno) ## Not run: pev_score(geno[1:50, ], geno[51:100])
data(geno) ## Not run: pev_score(geno[1:50, ], geno[51:100])
This function calculate r-score <doi:10.1007/s00122-019-03387-0> by given training set and test set.
r_score(X, X0)
r_score(X, X0)
X |
A numeric matrix. The training set genotypic information matrix can be given as genotype matrix (coded as -1, 0, 1) or principle component matrix (row: sample; column: marker). |
X0 |
A numeric mareix. The test set genotypic information matrix can be given as genotype matrix (coded as -1, 0, 1) or principle component matrix (row: sample; column: marker). |
A floating-point number, r-score.
Jen-Hsiang Ou
data(geno) ## Not run: r_score(geno[1:50, ], geno[51:100])
data(geno) ## Not run: r_score(geno[1:50, ], geno[51:100])
This function is designed to generate an operating curve for sample size determination
SSDFGS(geno, nt = NULL, n_iter = NULL, multi.threads = FALSE)
SSDFGS(geno, nt = NULL, n_iter = NULL, multi.threads = FALSE)
geno |
A numeric data frame carried genotype information (column: PCs, row: sample) |
nt |
A numeric vector carried training set sizes for r-score simulation. |
n_iter |
Number of iterations for estimating parameters. |
multi.threads |
Default: False. If TRUE, this function will use 75% of threads if the computer has more than 4 threads. Multi-thread computing is only avalible in mac and linux environments. |
An operating curve and its information.
Jen-Hsiang Ou & Po-Ya Wu
data(geno) ## Not run: SSDFGS(geno)
data(geno) ## Not run: SSDFGS(geno)
Sub-population information of samples. This data was published by Zhao et al. (2011) <doi:10.1038/ncomms1467>
subpop
subpop
A character vector.
http://www.ricediversity.org/data/
data(subpop)
data(subpop)
We propose an optimality criterion to determine the required training set, r-score, which is derived directly from Pearson's correlation between the genomic estimated breeding values and phenotypic values of the test set <doi:10.1007/s00122-019-03387-0>. This package provides two main functions to determine a good training set and its size.