Is this good or not? To answer this question we need to compare those scores to some data set. In
fCite, the ORCID data are used for this purpose.
As mentioned in the
FAQ the problem of identifying the author is not trivial and it is actually very hard to automate if we want to be very accurate (e.g., people change surnames, and sometimes they use initials only, or use special characters not present in the English alphabet). The data from the ORCID database make the problem easier to solve, but still some care is needed (we have already seen that people have begun to game the ORCID system by adding publications not belonging to them, e.g., John Smith's records "enriched" by Jane Smith's records abbreviated by the initial only). Nevertheless, parsing out the author from the publication author list is a doable task because ORCID provides the name and surname of the user. In the simplest scenario, having
"name+surname" allows us to calculate a socalled
Levenshtein distance between two strings (the minimum number of singlecharacter edits: insertions, deletions or substitutions, required to change one word into the another). This allows us to identify the potential position of a given author on the authorship list (the algorithm, of course, is not perfect, and when there is some ambiguity, e.g., two authors with perfect matches are identified, for instance two John Smiths in the list, then the record is not taken into considerations).
If none of the authors match the ORCID names and surnames, then a slightly more complicated procedure is used:
 a set of possible names and surnames is generated from ORCID's 'name'+'surname' i.e. ([name+' '+ surname, surname+' '+name, name[0]'+' '+surname, surname[0]+' '+name, name[0]'+' X '+surname, surname[0]+' X'+name]). Remark: all names and surnames are case insensitive because people frequently mix/overuse capital letters)
 for a given set a
Jaro–Winkler distance is calculated for all authors and, then normalized, and the one with the highest score (which must be above 0.65) is chosen
As a result, the average portfolios used to calculate the fractional versions of metrics are usually shorter than the original portfolios (those for which the author position could not be identified with high probability). Consequently, the percentiles presented in
fCite for fractional metrics can be considered overly optimistic. For instance, the 95
^{th} percentile could actually be the 90
^{the} percentile. However, currently, better data do not exist, and those estimates will be more accurate than comparing such scores by eye. Moreover, the percentile thresholds are updated yearly when new ORCID data appear.
Below, you can find detailed data for given percentiles depending on the score and the type:
Note: the score must be > 0 (this is an important remark because ~1/4 of the records have RCR, citations, etc. equal to 0, which means that
they did not (have time yet to) show any importance), and the ORCID portfolio
must contain at least one item assigned to the author with a >0.65 Jaro–Winkler distance.
Returning to John Smith, 100 citations and a RCR of 5 gives him:
RCR Citation
5 100
Research only 50.2 63.8
All 48.5 62.7
Now, let us divide the publications in the ORCID portfolio into four categories:
++++++
  Single  First  Middle  Last 
++++++
  128455+1207  1427848+3213  3964752+11425  1487607+6908 
 %  1.833+0.016  20.373+0.044  56.569+0.059  21.225+0.059 
++++++
Authorship patterns for 572,910 ORCID users having at least one publication above the cut off (7,008,012 unique publications in total).
The mean and standard deviation were calculated by bootstrapping the data 1000 times, and the Jaro–Winkler distance >0.65 was used as a cutoff.
Conclusions:
 In most research publications, the researcher is a middle author (57% of cases)
 Every fifth publication is either a first or last author contribution
 Single author research publications are extremely rare
The above table was calculated cumulatively for all publications, but to analyse individual researchers, we should bin the data per portfolio. To do so, we analysed 2471 ORCID users having 30 papers (otherwise small portfolios having just a few items would introduce considerable noise into the model).
++++++
  Single  First  Middle  Last 
++++++
 %  1.54+5.03  21.07+15.49  58.00+20.70  19.40+18.57 
++++++
Conclusions:
 The means are very similar in both tables
 There is substantial variance in what can be considered normal (on the other hand, this type of information can be used to statistically judge whether given portfolio is enriched or depleted by given types of publications, e.g., a portfolio having 40% of first author papers is quite unexpected)
Now let us check how this changes across the number of items in the portfolio. It is expected that smaller portfolios (most likely younger researchers at the beginning of their careers, aka PhD students, postdocs) will have more first author publications than people with dozens of publications (principal investigators). The raw data for the plots below are available
here and
here.
All articles

Research only articles

Conclusions:
 As hypothesized above, we clearly see that the more publications there are in the portfolio, the fewer first author publications there are in the portfolio (and the opposite is true for the last author publications). This simply reflects that at some point in time, a successful scientist becomes the head of the lab/group leader/principal investigator, and then she/he ends up in the last position on the author list as a corresponding author. *
 The percentage of single and middle author publications is fairly stable across the lifetime of the researcher (~2% and ~60%).
 The standard deviation for first, last and middle author publications can be as high as ~20% which means that there is considerable variance, but still as mentioned above, a portfolio having 40% of first author papers is quite unexpected.
 "The tipping point" at which the proportion of first and last author publication is similar at ~2731 papers. If we consider that last author papers are a sign of creating a new laboratory or beginning to become an independent researcher, on average ~30 item portfolios already have six first and six last author publications.
 Single author papers are very rare (additionally, it is more likely that single author papers will be nonresearch articles, for instance, an editorial, comment, or review, rather than a research paper).
 On average, there are more middle author papers in the research only fraction, which again is expected because research papers have more authors on average; (un)surprisingly, it seems that writing nonresearch items requires fewer authors.
* This trend is expected and actually confirms two things: a) most of the life science (PubMed) publications use some kind of FLAE model; b) the Jaro–Winkler distance threshold we used to surmise the authorship position is reasonable. If both conditions a) and b) were not be met, then we would not be able to observe such a swap between first and last author position vs the portfolio size.
Number of research vs. nonresearch publications in PUBMED since 1995
Research vs. nonresearch items in the period 19952018 (based on 17,787,016 PMIDs) (
the raw format)
Conclusions:
 The number of publications in PUBMED grows every year (it doubled in last 25 years)
 Most of the publications are "research" items, and they constitute ~80% of portfolios
Number of authors vs. research_nonresearch items
It is very interesting to also investigate the number of authors for individual papers. This may differ across fields of science (e.g., in mathematics, publications usually tend to have fewer authors than in medicine); nevertheless, it is crucial to analyse such patterns. We can try to answer a number of questions. For instance, how many authors does the average paper have? What is the fraction of papers for single author, two author, or three author papers? Is there any relationship between the number of authors and type of publication (research vs. nonresearch)?
Average number of authors (whole PUBMED, 17 M items,
raw format)
Conclusions:
 The average number of authors increased over time from 3.5 in 1995 to 5.9 in 2018
 The average number of authors for research papers is constantly larger than for nonresearch papers (by approx. two authors)
 The average nonresearch paper in the 1990s had 2 authors, while now it has 4 authors per paper
 The average research paper in the 1990s had 4 authors, while now it has over 6 authors per paper
Number of authors vs research_nonresearch items (whole PUBMED, 17 M items, 19952018
raw format)
Conclusions:
 Most of the research papers have less than 10 authors (usually 35 authors), with a long tail of the papers authored by >15 people
 Over half (51.4%) of single author papers are nonresearch items
 The fewer authors, the more probable it is nonresearch paper
The last observation is quite unexpected; thus, it is worth checking this relationship more closely. From one of the previous plots, we learned that on the average number of nonresearch papers is ~20% (18.9 for all years to be exact). Let us normalize the data based on the number of authors.
The fraction of nonresearch papers vs. number of authors (
the raw format)
Conclusions:
 Most of nonresearch papers (editorials, reviews, commentaries, etc.) are written by 14 persons
 A single author paper is approximately three times more likely to be a nonresearch work than the average (51.4% vs 18.9%)
 A tenauthor paper is a than fourtimes more likely to be a research item than average (4.4% vs 18.9%)
Let us now check the trends over the time
Number of authors vs research_nonresearch items over time (
raw format)
Conclusions:
 Over time, the single/fewer author papers decline relative to multiple author papers
 In 1995, the mode was a single author publication; in 2004, it was a three author publication; in 2007, it was a four author publication
 While the papers with more than 15 authors were almost unheard of in 1995, they have become increasingly popular, and in 2018 they represented 2.7% of all items (for >10 author papers, the statistics are 0.3% in 1995 vs. 9.8% in 2018)
Note that the fraction of single author publications in PUBMED overall is different than for ORCID portfolios. This can be explained by the fact that for some people or in some fields (e.g., mathematics), it is common to publish alone; thus this changes the pattern if you compare statistics between the portfolio and the global value, but the trend is the same: over time, a single item has more and more authors, and fewer and fewer publications are single authored.
Fractional metrics (FLAE, FLAE2, FLAE3, EC)
The weights for the first, the middle and the last author up to ten authors for the FLAE, FLAE2, FLAE3, EC models.
Conclusions:
 The FLAE model assigns the greatest importance to the first author and then later the last author
 The EC model penalises the last and first authors
 The FLAE3 model has weightings between those of the FLAE and FLAE2 models
Fractional metrics vs total metrics (RCR or Citations) with respect to portfolio size
Conclusions:
 All models are highly correlated and produce similar results
 There is an almost linear correlation between portfolio size and the scores
 On average every 10 papers awards an RCR of ~2.5
 On average every 10 papers receives ~3540 citations
The correlations of fractional models
The lower triangular portion of the matrices (green) correspond to the ORCID portfolios with 250 items (394,189 portfolios)
and the upper triangular portion of the matrices correspond to all ORCID portfolios with at least a single item (600,755 portfolios)
Conclusions:
 All models are correlated
 The fractional models are extremely positively correlated with each other (>0.9)
 The fractional models are moderately positively correlated with global metrics (RCR, citations) at ~0.60.8
QUESTION:
Thus, since the models are so well correlated, why should we bother considering more in the first place? Can we not just recalculate one model to another?
ANSWER:
No, you cannot, especially if you are in the business of looking for outliers. When you analyse one, particular portfolio, you can use the averages only as a baseline (no matter how good the correlations are, unless the correlation is 1.0). Actually you are looking for what is odd (e.g., ultra high FLAE_{RCR} in comparison to the portfolio size, the difference between FLAE_{RCR} and EC_{RCR} to highlight the importance of first author papers, etc.)
Look at the spreads:
Conclusions:
 The spread among portfolios increases as portfolios became larger
 Regardless of size, the spread is always significant
 The averages are placed closer to the upper bounds, as the lower bonds are limited by zero
Now, let us examine the ratio of averages
RCR

Citations

Conclusions:
 The ratio of any fractional model and the total score is larger for small portfolios (more first author papers)
 For portfolios with up to 10 items it is approximately 2025%
 At approximately 4060 items, all fractional models start to produce virtually identical results
 The ratio for RCR scores is usually lower than for citations
Next, let us take a quick look at the spreads:
RCR

Citations

Conclusions:
 There is a massive spread among portfolios
 As expected, the more items there are, the smaller the spread
 Regardless of size, the spread is always at least 20%
 On average, the ratio should be approximately 1820%, which is consistent with the average number of authors on the average paper