Abstract

The Australian Age Assurance Technology Trial (‘the AAATT’) was a A$6.5-million-dollar project funded by the Australian government. The authors assert that deployment of age estimation and other methods is viable, based on their own results. This technical report presents a preliminary re-analysis of the available data to evaluate these claims, focusing upon age estimation. It is demonstrated that the AAATT did not find a single commercially viable system that meets real-world performance requirements. These failures range from inadequate real-world recognition performance (e.g. recommending a system that would treat over 40% of 10-year-olds as being over 16), a failure to work on first-nations people of color, no evidence in respect of disabled people, and a failure to conduct an effective analysis of privacy risks. The AAATT also appears not to have been conducted in line with the Australian National Statement on Ethical Conduct in Human Research, despite claiming to have properly considered it. Based on these findings, the AAATT report should not be relied upon by policy makers.

A. Introduction

The Australian Age Assurance Technology Trial (‘the AAATT’) was conducted over the course of 2024 and 2025. The purpose of the AAATT sits within a legislative context: Australia then proposed and has now enacted a world-first blanket ban on under 16’s accessing social media services, which is due to come into force on the 10 December 2025. The legislation requires the taking of (objectively) reasonable steps by any social media provider, ranging from TikTok onto bulletin boards, unless they geo-block Australia.1 The Australian eSafety Commissioner has since sought to extend this to other areas of online activity, such as user accounts on search engines and online pornography.2

Based on the results of the AAATT, the Australian Government has claimed that it is feasible for these measures to be applied. The eSafety Commissioner has even issued guidance relying on the AAATT results, saying that these are ‘reasonable steps’.3 It is alleged that most people can engage in ‘facial estimation’, with adults being subject to an exercise which would not involve any intrusion of their privacy. The AATTT report, approved by those remaining after some members resigned in protest (e.g. John Pane from Electronic Frontiers Australia4), makes claims such as ‘age assurance can be done in Australia – our analysis of age assurance systems in the context of Australia demonstrates how they can be private, robust and effective’.5 Despite being 1200 pages long, it is not a conventional peer reviewed scientific document, with only one ‘peer reviewer’ having been appointed by the trial team itself. No anonymous review appears to have been conducted in line with the normal scientific process.6 Nevertheless, some data was released alongside the report, including a selection of trial data, as well as individual reports that involve an analysis of individual systems. The AAATT involved a mix of synthetic experiments and real-world trials.

Unfortunately, their methodology is not fully clear from reading the final report and associated documentation. The AAATT report is presented more as a scientific communication exercise, rather than a traditional scientific report, despite purporting to have a complete and rigorous methodology.7 The substantial gaps in the reported methodology means that we cannot be sure as to whether the data was collected appropriately and that there were not serious errors in that regard.8 Making the (potentially untrue) assumption that the data was accurately recorded and there were no material sampling errors, we can do a reanalysis of the AAATT trial data. A closer inspection reveals that most systems were not fully evaluated even within the trial protocol, and the real-world performance is not as it is claimed. The truth is that existing systems have not been demonstrated to be sufficiently reliable for the general population, let alone for minority groups who are particularly vulnerable to AI based discrimination. Similarly, the evidence in respect of usability suggests many users had a poor experience. There is also insufficient evidence that privacy standards were met by these systems.

What is most worrying is that individuals and companies could be subjected to multi-million-dollar fines based upon failing to implement what the eSafety Commissioner asserts is objectively ‘reasonable’.9 It appears also taking unreasonable steps could also include taking those which are too aggressive in shutting out over-16s, making this situation particularly troubling.10 This would not be the first time that pseudo-computer-science and official power has been misused to ruin people’s lives: the United Kingdom has had the ‘Post Office Scandal’ (now under an official inquiry), and Australia had Robodebt (which necessitated a Royal Commission).11 Unfortunately, these lessons do not appear to have been learned. Most of the AAATT research team do not appear to have relevant qualifications to conduct evaluations of machine learning research, the claimed results are not supported by the evidence and there even appears to be several breaches of the Australian National Statement on Ethical Conduct in Human Research.

Given that the AAATT report was only recently released to the public, this Technical Report is necessarily narrow in scope.12 It follows that only a subset of potential scientific flaws were explored. Nevertheless, the number and the nature of the errors already identified in this Technical Report means that the entire AAATT report should be considered suspect unless and until it has been shown otherwise. To put it another way, it should be retracted in its present form and possibly permanently. This remainder of this report first addresses the Age Estimation Problem in Section B, before moving to consider the non-compliance with Australian Research Ethics Standards in Section C, followed by a conclusion in Section D.

B. Age Estimation

(i) What would a commercially deployable system to comply with the Online Safety Act require?

The AAATT was funded and conducted to enable policy decisions to be made, albeit the trials authors took the position that the final policy outcomes were not for them. Rather ‘the report aims to inform these stakeholders about the current state of age assurance technologies, supporting evidence-based discussions and decision-making in this rapidly evolving field’.13 It does not claim to be a ‘a final or exhaustive assessment of the long-term performance or policy implications of age assurance systems and technology solutions.’14 Nevertheless, the AAATT report claims that ‘the core aim was to understand if age assurance can be done without compromising Australian citizens privacy and security, as well as to inform consideration of best practice and potential regulatory approaches’.15

The asserted findings are of importance, because they relay the criteria that the AAATT’s authors seem to accept is a necessary conceptual minimum. Their headline findings include that ‘Age assurance can be done in Australia … our analysis … demonstrates how they can be private, robust and effective’ and that their ‘evaluation did not reveal any substantial technological limitations’.16 They also said that they ‘found no substantial difference in the outcomes for First Nations and Torres Strait Islander Peoples and other multi-cultural communities using the age assurance systems.’17

In respect of Age Estimation itself, it is asserted by the AAATT’s authors that ‘age estimation can be done in Australia and is being deployed effectively’.18 They say that ‘age estimation offers a frictionless, privacy-conscious way to implement age-based access controls’.19 They also say that ‘most solutions demonstrated low-friction user experiences, fast estimation times (typically under 20 seconds) and high accuracy outside threshold “buffer zones” (e.g. 13+, 16+, 18+).’20 In terms of a specific finding, ‘some systems achieved mean absolute errors (MAE) of approximately one year in controlled conditions and provided reliable threshold classification’.21 It was also said that ‘while systems generally performed well across diverse user groups, some showed reduced accuracy for older adults, nonCaucasian users and female-presenting individuals near policy thresholds’.22 The authors also claim that ‘while age estimation is highly effective for real-time, contextual age checks, it is not currently suitable for generating verifiable digital credentials’.23 Their overall claim is that Age Estimation is a ‘mature, secure and adaptable tool for enforcing age-based access in a wide range of digital and physical contexts’ and serves as a ‘key component of modern, privacy-respecting age assurance infrastructure’.24

In their report, the authors presented a list of ‘key attributes’. In their words, these included:

  • ‘Accuracy’ – with the criteria being ‘correct classification around thresholds (13+, 16+, 18+)’.25

  • ‘Bias minimisation’ – ‘Performance parity across demographics (age, sex, ethnicity, skin tone).’26

  • ‘Privacy and Security’ – ‘Data minimisation, no retention of biometric inputs, secure design.’27

  • ‘Usability’ – ‘Effectiveness across age groups, clarity of user journey.’28

  • ‘Technology Readiness Level (TRL)’ – ‘Maturity of the solution for real-world deployment.’29

Taking the wider claims of the AAATT authors and also considering the Australia legislation, it would be fair to say that the conceptual minimal requirements for a system include:

  1. Classification performance for whether someone is below or above age 16 in respect of the general population. The system should reliably reject individuals who are under 16, especially by a significant margin (e.g. 10-year-olds and perhaps 13-year-olds should not be able to defeat it), whilst not presenting difficulties for people above the age of 16, especially those substantially above that threshold (e.g. 18+). Whilst other ‘age gates’ were evaluated; Age 16 is the one which relates to the Australian policy context concerning the under-16 social media ban and that is the one which should be tested.30 A system that provides a false sense of security is unlikely to be reasonable to deploy in any event, both from a harm perspective, and because it imposes a disproportionate burden on a platform due to the lack of benefit.

  2. Classification performance does not substantially discriminate based on protected characteristics. These characteristics include ‘sex’, ‘race’ (including being from a First-Nations or Indigenous background and ‘disability’. These are requirements of Australian anti-discrimination law, as well as internationally. It is unclear how a system that substantially discriminated on this basis to the extent of subjecting an end user to detriment could meet the standard of reasonableness.

  3. Available and deployable in a commercial context. These systems need to be reliable and to be available on the open market. It would not be reasonable for a platform to adopt an unreliable prototype. This would impose unacceptable risks, including to the reliability and security of a service. Rather, there is a need for the highest standard of software engineering to the implemented, including with respect to testing.

  4. Demonstrably Private and Secure. It is important that these systems demonstrably meet privacy standards. Whilst Australia has relatively weak privacy legislation relative to other Westernised countries, it must be remembered that most social media companies are also subjected to the EU General Data Protection Regulation. Furthermore, Australia has recently enacted a new statutory tort that addresses serious invasions of privacy, whilst it is likely that Australia will enact further legislation that moves Australia closer to European standards over time.31

  5. Easily usable by the wider Australian population. This was recognised within the AAATT documentation and encompasses diverse user groups, including people with a disability. This is why references to concerns such as ‘frictionless’ usage were made in the AAATT report. Reinforcing existing digital divides is unlikely to be reasonable, nor lawful under existing anti-discrimination laws, both in Australia and beyond.

To be successful in upholding its claim that age estimation can be deployed, the AAATT is required to identify at least one commercially available system that meets all of those criteria. It is not sufficient to find a patchwork of systems that each meet some of them: the test is the system as a whole. For example, one system might be privacy sensitive, but have such poor real-world recognition performance that it cannot be effectively used to provide any meaningful protection of children (thus providing a false sense of security and thus being legally unreasonable to deploy on that basis).32 This is important, because the final report presents selected averages of existing systems performances, rather than addressing the question of whether one meets all the criteria.33

(ii) The evaluation conducted by the AAATT of Age Estimation Systems

The AAATT conducted what it called a Technology Readiness Level (TRL) assessment on systems entered into the trial for potential evaluation. Only those which met the standard of TRL7 or above were considered by the AAATT at all. The investigated systems thus fell into three categories:

  • TRL 9 = ‘Actual system proven through successful operation in an operating environment, ready for full commercial deployment’.

  • TRL 8 = ‘Actual system/ process completed and qualified through test and demonstration (pre-commercial demonstration)’

  • TRL 7 = ‘System/process prototype demonstration in an operational environment (integrated pilot system level)’.34

These TRL’s were said to be calculated using a ‘tool’ from the ‘New South Wales (NSW) Government’s Invest NSW initiative’.35 This tool essentially provides a somewhat more detailed descriptor of the categories used in the report.36 Some points are of relevant – TRL 8 is said to include systems where the ‘product performance delta to plan needs to be highlighted and plans to close the gap will need to be developed’, whilst it is only at TRL 9 where the system is in its ‘final form and operated under the full range of operating conditions.’37

The AAATT considered 5 Age Estimation systems at TRL 9, as well as one further system which was said to be between TRL8 and TRL9. As these are systems that may meet the standard of commercial readiness (it would presumably not be reasonable to require a provider to implement an immature system, given the security, privacy and business risk in doing so), the analysis that follows focusses upon whether any of these systems are deployable.

Even these systems said to be of the ‘highest technology readiness level’ were not fully investigated. One of these systems has no individual report available, whilst a further three were not subjected to a mystery shopping study to analyse usability. The investigations conducted and their limitations are provided below in Table 1.

Table 1. Only two systems out of the six has a published full evaluation, with one system having no publicly available evaluation report, and three having had no mystery shopping study conducted. The mystery shopping study for Yoti had a very small number of participants.
Provider
Criteria Yoti Unissey Privately VerifyMy PrivateID Luciditi
Individual Test Report
Mystery Shopping Study Conducted ✓¹
¹ NB: only 25 participants.

(iii) 16+ Age Gate Classification Performance on the General Population

The analysis reported by the AAATT Authors and other relevant materials

The authors of the study have published a dataset of individual datapoints in respect of recognition performance. The age estimation data (n = 28514) is divided into three groups: school (n = 4728), mystery shopper (n = 746) and batch (n = 23040). The batch testing is somewhat synthetic, with the images being of unclear provenance: it also makes up the majority of the available data points. When presenting the analysis in their final report, the authors seemingly provided an average of all the data points of those providers they selected, thus biasing the results towards the (unrealistic) batch dataset.38 It can be observed that for people aged 17 and over, this is extremely biased towards the ‘batch’ dataset, because there are only 833 records in the entire dataset of real-world tests compared to 12411 batch records, (including those providers not selected), compared to 7467 total records used. For people aged 17 and above, at best 11% of the records used were from real-world tests and at worse, this could be as low as 6%. This means that the asserted results for adults cannot be relied upon at all.

The problem with using the batch records is that they are likely to overestimate performance, being tests not conducted in the real-world. The source of these records is unclear. If they were taken from a publicly available dataset (as they seem to have been39), then this would be a ‘rookie error’ in this case, because the Age Estimation providers could have used this data for training their own systems (creating a situation somewhat akin to setting an examination when the answers were given in advance).40 The fact that the Batch data (which is supposed to test confounding factors), even when adjusting to reduce the age to align with the school performance, still has a lower Mean Absolute Error41 markedly lower than the real-world data also suggests one or more errors were made. This can be seen in Figure 1.

Figure 1. This shows the Mean Absolute Error in age values for the ‘batch’ dataset adjusted to remove over 18’s, compared to the real-world School Dataset. The results are not what would be expected if the batch dataset was correctly prepared to be suitably challenging: rather it should perform worse than the real-world usage results.

Separately to the official report, the AAATT authors also provided reports on individual providers, of which five were published on the official website. These can be used as a starting basis for a valid analysis.

What the AAATT actually found

In this case the individual reports are adequate evidence to suggest that most of the systems operate validly in the real world. The key statistics are as follows in respect of an age gate of 16+:

  • Yoti. Based on the school testing, allows (at least42) 30% of 13-year-olds and 56.4% of 15 year olds to access age-gated content.43

  • Luciditi. Based on the school testing, allows for (at least) 45% of 10-year-olds and 70% of 15-year-olds to access age-gated content.44

  • VerifyMy. Based on the school testing, allows for (at least) 30% of 13-year-olds and 68% of 15-year-olds to access age-gated content (there were only four samples below the age of 13, so the performance could be worse).45

  • Unissey. Based on the school testing, allows for (at least) 34% of 12-year-olds and 84% of 15-year-olds to access age-gated content.46

  • Privately. Based on the school testing, allows (at least) 43% of 10-year-olds and 87% of 15 years to access age-gated content.47 Despite this, the Age Assurance Trial Website says in its ‘summary of results’ that it is a ‘strong example of a privacy-preserving, operationally ready technology that eliminates the need for identity documents or server infrastructure’.48

Based on those reported performance figures alone, Lucidity and Privately should be rejected as offering a viable system. A system that would respectively allow 34% AND 43% of (real) 10-year-olds through an 16+ age gate is neither effective or reasonable to deploy in any sense. This performance is little better than chance and is likely a best-case scenario, as a more rigorous testing approach could make this even worse.49 The position in respect of VerifyMy is unclear, due to a lack of samples being collected. Yoti was able to reject all 10-year-olds tested upon, but only by setting the age gate to 18, and thus having a high rate of false positives amongst people aged between 18-21 (in the case of Age Gate 16, 7.14% of 10-year-olds were let through). It follows that only Yoti was shown to be potentially an effective system on this criterion.

(iv) Indigenous People of Color

The Analysis of the AAATT

The approach of the AAATT was to only report a test purely based on skin-tone on a batch dataset. Based on this, they then computed the ‘mean absolute error’ and compared them for three different bands of skin-tone, measured using the Fitzpatrick Skin Tone scale.

There are at least three flaws with this approach which require a more focussed analysis:

  1. Skin color is not the only potential confounding factor. The AAATT authors assume that skin color is the only potential confounding factor in respect of First-Nations people and not other factors, such as facial geometry. This is inconsistent with existing evidence, which suggests indigenous people have highly heterogenous facial geometry, even within a single tribe.50 The more heterogenous the facial geometry, the less likely a system were to succeed (unless it happened to have seen very similar examples from non-indigenous training data).51 At the least, this factor should have been investigated.
  2. Subgroup disadvantage. The approach overlooks the risk of subgroups within the indigenous community being placed at a disadvantage. If a substantial subgroup were to be disadvantaged, then it would still be unreasonable to deploy such a system. Given it is unclear how this dataset was collected, there was also no control for potential ‘box-tickers’ (a concern of many within the First Nations community52), which presents a further risk to be mitigated, especially where the risk of unfair discrimination by these systems arises from the degree of genetic indigenous heritage (i.e. ancestry prior to European settlement), not the cultural aspect. The AIATSIS Code of Ethics for Aboriginal and Torres Strait Islander Research] which the AAATT treated as mandatory, requires that ‘generalisation or extrapolation of findings that masks diversity can do harm and such risks should be considered in the analysis of data’.53 As ethnic diversity is well known to be an confounding factor for AI systems, it is important that this be fully evaluated.54
  3. Lack of real-world analysis. In this case, only the batch photos were considered in this analysis, yet these are likely to overestimate performance. Real world performance can be assumed to be worse.

Fortunately, further data was collected (but not seemingly analysed) by the AAATT Trial, with the entire dataset having labelled each entry as to whether the subject was from a First Nations background. The AAATT Trial simply did not conduct a more detailed investigation. Yet even using the existing data, it is possible to control for two of these factors together. This Technical Report therefore presents some of the further analysis that the AAATT should have already done itself.

Analysis of Batch Data for Indigenous People of Color

Skin tone data was only reported in respect of the Batch Data, which was placed into three bands, as summarised by Figure 2. If there were to be poor performance on Fitzpatrick scores V and VI, then a system would still be unacceptable.55 Accordingly, the subset of the datapoints being First Nations people with Fitzpatrick scale value of V and VI was extracted for analysis. After filtering for the cases where no definite age was recorded or outputted, we were left with 220 datapoints. Although purportedly anonymised in the dataset, we were able to identify Privately as being ‘provider_G’ in the dataset, and Yoti as being ‘provider_N’. We therefore present separate analyses for these systems as well as the overall analysis across all recorded data.

Figure 2: Distribution of FitzPatrick Scores for the Batch data coded as being from First Nations people.

It should be borne in mind that as the ‘batch’ dataset was used, these results still represent an overestimate of classification performance. Nevertheless, these findings are sufficient to show a wholly inadequate classification performance in respect of Indigenous People of Color. The confusion matrices provide the raw results as Figure 3. The most striking case is Privately, who would wrongly allow 84% of the tested First Nations people under 16 through the age-gate, whilst Yoti allowed 48% through. The overall average case with all available data allowed 59.5% of under 16’s through the age gate.

Figure 3. Confusion Matrices for Privately, Yoti and across all systems where data was collected (the ‘overall case’). The green coloured entries are those where the system was correct, where as the red ones provide the number of errors.

The underlying issue is that existing systems tend to considerably overestimate the age of First Nations People of Color. This can be seen in Figure 4. These similarly illustrate that these systems are unworkable for First Nations people, with there being considerable a bias of overestimating age (in a dataset where most individuals are under 16 and the mean value is just over 13 years old). Whilst Privately’s performance is particularly poor, Yoti still has some cases that are serious overestimates (such as a 13-year-old which it believed was over 24) and would have allowed multiple 12-year-olds through the age gate. This is despite it being a less challenging scenario than the real-world case, given that the ‘batch’ data was used.

Figure 4. Histogram of Age Estimation Errors for Privately, Yoti and across all systems where data was collected (the ‘overall case’). The red bars indicate an underestimate of age, whilst a blue bar indicates an overestimate.

(v) Real World Indigneous Results

This analysis involves effectively swapping out one potential confounding factor for another. The AAATT did not provide any Fitzpatrick scale values for the mystery shoppers and school children who participated in real-world testing. Most participants would have come from built up areas, thus presenting a skewed subpopulation of First Nations people, and potentially even including ‘box tickers’. Yet an analysis of the real-world scores can still be potentially informative as long as it is considered with due caution. It simply presents a different source of potential overestimation of performance whose magnitude is unknown.

As Privately does not have specific age estimates reported in the real-world data, only Yoti is considered as an individual system in this case, as well as the overall average results. The histograms are worth considering. Whilst Yoti appears to perform reasonably well, the sample size is small and the failure to limit the data to First Nations People of Color may have confounded the analysis, meaning no positive conclusions should be drawn. It should be noted that Yoti still erroneously treated 34.6% of under 16’s as being over 16. Rather, further investigations should be done of the Yoti system in the real world. The overall histogram shows that some systems routinely make gross overestimates of the age of First Nations people, in extremis including cases such as a 15-year-old that one system marked as being 62.83 years old, and another 15-year-old that a different provider labelled as being 57 years old. Age overestimations of 10 or more years were not uncommon.

Another relevant point of comparison is the relevant performance across different groups. Overall, the data for First Nations generally was considerably worse than Sub Saharan Africans and people from Oceania, as can be seen from the mean and distribution of Figure 5 (although it is possible that the Oceania category also includes some First Nations people and those of other a non-white ethnic origin. The mean error is considerably higher for the First Nations grouping (at 4.12) compared to 2.45 for the African group and 2.56 for the Oceania group: see Figure 6. The result for the First Nations groups is also considerably higher than the Mean Absolute Error reported for people with dark skin (i.e. Fitzpatrick Scores V or VI) in the trial report, which arrived at 3.01 (and the Mean Average Error is inherently higher than the Mean Error to begin with).56

Figure 5. Histograms of age estimation error based on the real-world data collected in the trial for Yoti and the ‘Overall’ case in respect of First Nations. The red bars indicate an underestimate of age, whilst a blue bar indicates an overestimate.

Figure 6: Histograms of age estimation errors in the ‘Overall’ case for data coded as South-Saharan African and Oceania or Antartica in the dataset. The red bars indicate an underestimate of age, whilst a blue bar indicates an overestimate.

(vi) Disabled People

Despite disabled people making up over 20% of Australians, the AAATT did not investigate this issue.57 Instead, the AAATT conflated accessibility and the WCAG web accessibility guidelines with full disability inclusion.58 The WCAG guidelines are useful for designing websites in a manner so that they can be navigated by disabled people, and provide some assistance for mobile experiences, e.g. by ensuring that buttons are of an appropriate size and contrast, that a screen reader can consider the content and so forth. The provider of WCAG is clear that it is ‘not sufficient by itself to ensure accessibility in mobile applications’.59

A facial age estimation system is likely to have particularly poor performance for people with certain disabilities. One particularly cruel example is the nearly 1% of Australians who live with a facial disfigurement, who already encounter serious problems with AI: now imagine being subjected repeatedly to intrusive verification requirements each time one wishes to access social media and being reminded of not just one’s disability, but perhaps a horrific accident that led to it.60 There are other disabilities which impact appearance but are not facial disfigurements but merely differences in appearance: these include conditions such as various forms of dwarfism, cerebral palsy (which can impact facial movements), Down syndrome and even having a vison impairment (as this impacts eye movement). These are all potentially confounding factors which are likely to reduce the confidence and accuracy of any AI system which was not designed and proven to work with these groups. The failure to test for these concerns is a really serious error.

There is another more subtle error. Some disabilities can impact device usage: a person with a mobility impairment might not be able to hold a camera still, thereby reducing the quality of the video (the authors did not test for this issue, even in synthetic analysis: instead they were concerned for other techniques of evading the system). 61

It follows that there is no evidence that existing systems do not discriminate on the basis of disability. Unless and until such evidence is provided in respect of real-world classification performance, with each relevant disability subgroup considered separately, it would not be reasonable to require a platform to use such a service.

(vii) Privacy Evaluation

Unfortunately, Privacy and Security has not been evaluated to a sufficient standard, being based on what the Trial termed a ‘static review’.62 The main method was the use of ‘Practice Statements’, which were said to be ‘structured self-declarations’ provided by technology companies.63 They also reviewed the technology companies own privacy policies, which were evaluated based on a checklist.64 They also conducted vendor interviews which were said to be a basis for ‘substantiat[ing] claims made in written submissions’.65]

The authors admit in their methodology that ‘cryptographic penetration testing and full security validation of vendor systems were out of scope’.66 Nor do these systems appear to be open source, preventing other experts from doing a thorough evaluation, including of any changes deployed in the future. Nevertheless, the authors felt able to make striking claims, such as ‘Yoti excelled in frictionless, privacy-focused AV [and] was ‘one of the most privacy-forward platforms in the trial’.67 The evaluation of the trial asserted that there was ‘no data retained post-verification’, without taking any steps to validate this.68 A similar approach towards exaggerated findings was made for all the other systems.

Given the AAATT does not provide sufficient evidence as to the privacy and security standards being met for any system, it follows that it is not a reasonable step for a social media service to integrate these services, unless or until such evidence is provided. A technology company is not required to disregard to the privacy rights and legitimate expectations of end users, even if that is on the alleged basis of protecting children (who themselves have privacy needs).

(viii) Usability Evaluation

There are significant problems with the analysis and investigation conducted by the trial authors. As a headline claim, the AAATT authors assert that age estimation would be ‘typically 40 secs or under to complete estimation’.69 But this claim is misleading in two respects:

  • Increased classification performance takes longer. Their claim avoids addressing this trade off, with those systems that having the weakest classification performance taking the shortest time (which is why Privately is quicker than Yoti). A system that is likely to be deployed will take longer.

  • Usability is needed for all end users, not just some of them. It also overlooks the problem of relying on the median when the goal is widespread usability across the population. It is not good enough for a system to be adequate for only 50% of the population: it needs to work well over 90% of people.

Figure 7 provides an illustration of this point. For instance, Yoti had a median estimation time of 55 seconds, but at least 25% of people would wait at least 75 seconds and 10% of people would wait nearly 2 minutes, with some extreme cases waiting considerably longer than that.

Figure 7 – Histograms of the Time Taken for returning an age estimation result for Yoti, Privately and the ‘overall case’.

The wider user experience trial also has many problems. Whilst the data covers a range of platforms, unfortunately it is not possible to separate them out, nor is it possible to know which platform was used (unlike with the other published dataset). Most systems were not evaluated at all by mystery shoppers. Nevertheless, one can presume that a substantial proportion of cases cover age estimation, especially based on the comments and criticism. There was no demographic breakdown, meaning that we do not know the experience of vulnerable members of the public, or whether each system was consistently tested (e.g. there was no adjustment based on the overall experience of the person in question with other technologies, as some people are more likely to score systems positively than others). Nor was there any apparent adjustment for the fact that the most dissatisfied users would not continue with the study.

Even with this deeply flawed investigation, we can conclude that the usability standards of existing systems are poor. Many users in the mystery shopper trial did not have a satisfactory experience, based on their self-reports. This is consistent with the statistics, where for adults excluding those who indicated neutral, 15.7% were dissatisfied and 17.4% were very dissatisfied, whilst nearly 30% of Adults rated the activity as either hard or very hard.70 Over 50% of the missions in question either did not work at all, or there were at least some significant issues encountered by the participant.71 A system which does not provide a satisfactory experience for a significant proportion of the population is not reasonable to require the public to use.

The range of causes of dissatisfaction included:

  • Technical issues (e.g. inability to access camera on a laptop, and then failing when a second attempt was made on a persons phone, frequent crashing of the software).
  • Lack of clarity around privacy. For example, it was not clear to some users how the information would be stored and secured.
  • Appearing to only work on certain devices (e.g. only on phones, rather than laptops).
  • Not having the required (identity) documentation to make the system operate.

Despite the AAATT’s highly skewed approach in favour of providers, even those systems said to be at the highest level of technology readiness did not provide a uniformly satisfactory user experience.72 It follows that there is no evidence that these systems are presently reasonable to deploy from a usability perspective.

(ix) Summary of Age Estimation Findings

There is no system which has been demonstrated to meets the ‘reasonable step’ standard in the legislation. Of those systems said to be at the highest level of commercial readiness, most do not work at all as an age gate for 16-year-olds. The very young were routinely treated as being over 16 years old. The systems have not been demonstrated to work for most Australians, be it from a classification performance perspective (e.g. very poor performance for First Nations people of color and a large number of people finding them unusable). Even the most promising system – Yoti – requires considerable further investigation to see if it can be deployed in practice.

None of this is to say that a future system cannot be made to work with further research and improvements. For example, if Yoti could be run on device, then it might be a more viable solution, which as devices become more powerful, could well be possible. This is perhaps especially so if a multi-modal system can be developed, which uses more than facial recognition to improve performance. But this can only be acceptable if the data does not leave a users device, rather than being based on trusting technology companies with the data in question. Accordingly, a future system that would be a reasonable step is no more than a theoretical possibility that is not yet commercially available and might never be.

C. Compliance with Australian Research Ethics Standards

The failings above would themselves amount to breaches of the Australian norms around research ethics.73 Yet the main issue is deeper and relates to an apparent lack of compliance with the expected research governance standards in Australia.

In Australia, the expectation is that asides for ‘low risk’ research, a Human Research Ethics Committee is convened to review and approve the research.74 The AAATT trial authors appear to have not classified the work as being ‘low risk’, which is unsurprising given the nature of it. The study authors claimed that they had given due weight to the ‘National Statement on Ethical Conduct in Human Research 2023’ (‘the National Statement’): this included the advertisements provided to participants.75 Yet at least the following provisions of the 2023 National Statement were seemingly not complied with.

5.1.11 If a research project is assessed as having more than low risk, it must be reviewed by an HREC [Human Research Ethics Committee]. The HREC review should include consideration of any proposed approaches to minimising or mitigating any risks associated with the research. …

5.1.14 Where institutions establish non-HREC pathways for ethics review of lower risk research, that review must: (a) be carried out by people who are familiar with the National Statement … (b) be informed by guidance provided in other sections of the National Statement; (c) include clear criteria for referring review to an HREC where risk that is greater than low risk is identified during non-HREC review.

It appears that the research was assessed as being more than ‘low risk’. There was no mention of any consideration of low risk research assessment, or any process being followed. Rather the project formed what it called an ‘Ethics Committee’, which appears to be a purported ‘HREC’. The final report claimed that:

‘The Project Ethics Committee was established to ensure that the Trial upheld the highest standards of ethical practice across its design, delivery and reporting phases. Its core function was to act as the primary governance body for ethical scrutiny and guidance, with a particular focus on safeguarding children, respecting privacy, managing impartiality and ensuring transparency and fairness throughout the Trial.’76

This was followed by a relative extensive set of claims about addressing the types of issues that a HREC ordinary would address and include the establishment of an Ethics Committee that purported to mirror the structure of a HREC.77

Minimum membership of an HREC

5.1.30 The minimum membership of an HREC is eight and must include the following categories:

  1. a chairperson with suitable experience, including previous membership of an HREC, whose other responsibilities will not impair the HREC’s capacity to carry out its obligations under the National Statement;

  1. a qualified lawyer, who may or may not be currently practicing and, where possible, is not engaged to advise the institution on research-related or any other matters

  2. two people with current research experience that is relevant to research proposals to be considered at the meetings they attend.

5.1.31 No individual may represent more than one of the categories listed in 5.1.30 at any individual meeting, but may fill a different category at a separate meeting, so long as all minimum membership categories are represented at each meeting.

The provision in question seems not to have been complied with. The Committee did have eight people, including an ‘Apprentice’ who served as its secretary.78 The full membership was listed in the Ethics Handbook as being:79

For some reason, Mr Hammond (an individual whose degree was in ‘equine studies’) was not listed as a member of the Committee in the final report.80 It is unclear who are said to be the people with ‘current research experience’, asides Dr Ali. Mr Billinge is a relatively recent (2020) graduate of a Classics and Classical Languages, Literature and Linguistics Degree from Durham University and prior to the trial, worked as a Policy Manager for OFCOM (a UK Regulator). It is unclear how Mr Billinge met the criteria to be the Chair of a HREC: he has no PhD, or academic research publications, and seemingly no experience of being a member of HREC. Under the National Statement, the Trial Legal Counsel also should not be on the ethics committee, unless it was not possible to find an independent lawyer.

5.1.40 Members should be appointed to an HREC using open and transparent processes. Institutions should consider reviewing appointments to the HREC at least every three years.

5.1.41 Members should be appointed as individuals for their knowledge, qualities and experience, and not as representatives of any organisation, department or group. Individuals that represent the institution (i.e. ex officio) may attend HREC meetings as observers, but are not to be appointed as members or be involved in the deliberations or decision-making of the HREC.

These provisions do not appear to have been complied with. There appears to be no open appointment process: how the people came to be appointed is wholly unclear. KJR were the Australian party conducting the research: Dr Ali is even listed as an ‘eminent specialist scientist’ responsible for ‘the development of the approach to evaluation’.81 Similarly, it is not appropriate for any members to be ‘representatives’ of KJR (let alone three of them), but the committee is supposed to be independent of them. It is remarkable for an individual responsible for conducting the research to serve as a HREC member for the same research. Nor is it clear as to what ‘knowledge, qualities and experience’ held which served as the basic upon which these members were appointed.

If the norms of ethical research governance were to have been met, then it is possible that the serious scientific errors that have been identified would not have been made by the AAATT authors.

D. Conclusion

This Technical Report does not present an exhaustive investigation. Yet the findings set out above are sufficient to place the entire AAATT into jeopardy. No one can be sure that the other aspects of their investigation were done correctly or impartially. The scale and nature of the errors already identified would lead to the retraction of an ordinary scientific paper and one would expect that the AAATT report should be similarly retracted.

The investigations of this Technical Report demonstrate that none of the age estimation systems evaluated can be considered to be ‘reasonable steps’ for a service provider to be expected to employ. As the justification for other systems such as age assurance is only based on the alleged performance of age estimation, those approaches cannot be justified either.

This is not to say that children cannot be protected. Measures such as parental controls on devices at least have a proven track record, as does parental supervision and limiting access to smartphones. The problem is that speculative and unevidenced alternatives are being promoted instead. This can only undermine rather than advance child safety. It can only be hoped that the Australian government puts in place proper safeguards to prevent a repeat of the AAATT and the pseudo-computer-science it proffers to the Australian people.


  1. Online Safety Act 2021 (Cth) Pt 4A; Explanatory Memorandum, Online Safety Amendment (Social Media Minimum Age) Bill 2024 (Cth) p 3.↩︎

  2. This includes the Internet Search Engine Services Online Safety Code (Class 1C and Class 2 Material) at p5, which requires reasonable steps, but only in relation to account-holders on internet search engines, as well as the Consolidated Industry Codes of Practice for the Online Industry (Class 1C and 2 Material) at pp 14- 5, which requires reasonable age assurance steps in relation to pornography.↩︎

  3. eSafety Commissioner, September 2025, Social Media Minimum Age Regulatory Guidance (see e.g. p,4, 12-13). As we will see, these claims are misrepresentations of the underlying evidence.↩︎

  4. https://ia.acs.org.au/article/2025/-big--red-f---age-assurance-advisor-hits-out-at-final-report.html↩︎

  5. AAATT Report, Part A, p14.↩︎

  6. For example, the IEEE requires at least two independent reviewers appointed by the editor of the journal in question and who must not be identified to the authors of the research (see the IEEE Operations Manual at 8.2.2 https://pspb.ieee.org/images/files/PSPB/opsmanual.pdf).↩︎

  7. For instance, they claim to have ensured a ‘rigorous, transparent and replicable approach’ (AAATT Report, Part A, p39).↩︎

  8. By way of examples, it is unclear how the trial determined if participants were from a first nations background, or how individual systems in the trial were allocated to participants (with many systems not being subject to a full evaluation at all).↩︎

  9. Online Safety Act 2021 (Cth) s 63D; Regulatory Powers (Standard Provisions) Act 2014 (Cth) s 92.↩︎

  10. Whilst the test says they must take ‘reasonable steps’, this would necessarily be a holistic assessment and consider the approach of the social media provider as a whole. The Commissioner’s guidance purports to identify unreasonable steps (eSafety Commissioner, September 2025, Social Media Minimum Age Regulatory Guidance, p 28).↩︎

  11. Volume 1 of the Post Office Horizon IT Inquiry’s final report at [1.8]-[1.9] (available at https://www.postofficehorizoninquiry.org.uk/sites/default/files/2025-07/Post%20Office%20Horizon%20IT%20Inquiry%20Final%20Report%20Volume%201_0.pdf); The Royal Commission into the Robodebt Scheme, Final Report, pp 458-60; 488.↩︎

  12. The report was publicly released on 1st September 2025 (https://www.infrastructure.gov.au/department/media/publications/age-assurance-technology-trial-final-report ).↩︎

  13. AAATT Report, Part A, p10 (emphasis added).↩︎

  14. AAATT Report, Part A, p10.↩︎

  15. AAATT Report, Part A, p12 (emphasis added). ↩︎

  16. AAATT Report, Part A, p14.↩︎

  17. AAATT Report, Part A, p17.↩︎

  18. AAATT Report, Part A, p70. (emphasis added)↩︎

  19. AAATT Report, Part A, p73 (emphasis added)↩︎

  20. AAATT Report, Part A, p75↩︎

  21. AAATT Report, Part A, p75.↩︎

  22. AAATT Report, Part A, p76.↩︎

  23. AAATT Report, Part A, p77.↩︎

  24. AAATT Report, Part A, p77. ↩︎

  25. AAATT Report, Part A, p24. It should be noted that ‘accuracy’ is an unfortunate term to use for evaluating any classification system, as it is possible for a system to have a high reported accuracy, but in practice operate no better than chance↩︎

  26. AAATT Report, Part A, p24.↩︎

  27. AAATT Report, Part A, p24.↩︎

  28. AAATT Report, Part A, p24.↩︎

  29. AAATT Report, Part A, p24.↩︎

  30. Online Safety Act 2021 (Cth) s 5 (definition of age-restricted user).↩︎

  31. Privacy Act 1988 (Cth) sch 2. ↩︎

  32. This appears to be the case with Privately, based on the evaluation that follows.↩︎

  33. For example, the Age Gate at 16+ (AAATT report, Part D, p52) bundles 6 unspecified providers together, whilst the Age Gate at 13+ (AAATT report, Part D, p50) bundles 7 unspecified providers together. Another diagram (AAATT report, Part D, p80) presents a Mean Absolute Error analysis, where it is unclear which providers were bundled together. This is not appropriate way of presenting a scientific analysis, and requires a closer consideration of the data. ↩︎

  34. AAATT report, Part A, p49.↩︎

  35. AAATT report, Part A, p48; AAATT Evaluation Proposal p39.↩︎

  36. https://www.dst.defence.gov.au/sites/default/files/basic_pages/documents/TRL%20Explanations_1.pdf↩︎

  37. https://www.dst.defence.gov.au/sites/default/files/basic_pages/documents/TRL%20Explanations_1.pdf ↩︎

  38. AAATT Report, Part D, p 52. This can be readily concluded from the number of datapoints reported in the sample row, which considerably exceeds the total number of those available across the school and mystery shopper data (with n=12623 datapoints used).↩︎

  39. No clear data collection methodology is provided. By way of an example, the Yoti Test Report (p 6) asserts that ‘To facilitate automated testing, a dataset comprising over 1,100 selfie portraits was assembled. These portraits represent individuals aged between 14 and 23 years. Additional images were sourced from the school-based component of the trial to expand the testing dataset.’ It is wholly unclear as to where the initial selfie portrait dataset was sourced, or how, but it appears this was taken from publicly available sources.↩︎

  40. Japkowicz, Nathalie, and Zois Boukouvalas. Machine Learning Evaluation: Towards Reliable and Responsible AI. Cambridge University Press, 2024, p 52.↩︎

  41. The Mean Absolute Error (MAE) is a measure of the distance in years from a person’s real biological age, as estimated by the system in question. A MAE of 0 would therefore be the best possible result. The higher the MAE, the worse the system.↩︎

  42. Whilst the sample sizes are small, one imagines a more rigorous test may end up producing even worse results.↩︎

  43. AAATT Yoti Test Report 1, p16.↩︎

  44. [AAATT Luciditi Test Report 1]((https://ageassurance.com.au/wp-content/uploads/2025/08/IndividualTestReport-Arissian_AE.pdf), p10.↩︎

  45. AAATT VerifyMy Test Report 1, p12.↩︎

  46. AAATT Unissey Test Report, p14.↩︎

  47. AAATT Privately test report p10.↩︎

  48. https://ageassurance.com.au/v/prv/↩︎

  49. The attempts to confound the system appear to have been conducted as part of the ‘batch’ testing, not the real-world testing.↩︎

  50. For example, consider de Souza, Bento Sousa, et al. “Occlusal and facial features in Amazon indigenous: An insight into the role of genetics and environment in the etiology dental malocclusion.” Archives of Oral Biology 60.9 (2015): 1177-1186.↩︎

  51. Unlike with Maori, there is not a longstanding tradition of facial tattoos within Australia’s indigenous community. This said, there will still be some Maori who now live in Australia with such tattoos (given that over 170,000 Maori live in Australia - https://www.abs.gov.au/census/find-census-data/quickstats/2021/1201_AUS ), and this would be an additional confounding factor were it to arise.↩︎

  52. See e.g. Suzanne Ingram, ‘The creative story behind ‘box tickets’, Australian, July 10 2021; Dechlan Brennan, ‘”Box tickers,” bureaucracy and sexual abuse must be targeted to close the gap – Price’, National Indigenous Times, February 11 2025; Julie Nimmo, ‘Community Leaders warn many who claim to be Indigenous could be ‘fakes’’, SBS News, 18 October 2022.↩︎

  53. AIATSIS Code of Ethics for Aboriginal and Torres Strait Islander Research at [1.2(c)]; AAATT Evaluation Proposal p81; AAATT Report p51.↩︎

  54. For example, ‘racial categories with unclear assumptions and little justification can lead to varying datasets that poorly represent groups obfuscated or unrepresented by the given racial categories and models that perform poorly on these groups’ (Jennifer Mickel. 2024. Racial/Ethnic Categories in AI and Algorithmic Fairness: Why They Matter and What They Represent. In The 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’24), June 03–06, 2024, Rio de Janeiro, Brazil. ACM, New York, NY, USA). This is another way of putting the error made by the AAATT trial.↩︎

  55. As noted in Ashling Courtney et al. “Burden of disease and unmet needs in the diagnosis and management of atopic dermatitis in diverse skin types in Australia.” Journal of Clinical Medicine 12, no. 11 (2023): 3812, ‘The skin of Australian First Nations People varies greatly between regions due to variations in parental lineage and parents of mixed races are common. Darker skin pigmentation, Fitzpatrick type 5 and 6, is more commonly seen in certain regions including the Northern Territory, North Queensland, and the Pilbara region of Western Australia’.↩︎

  56. AAATT Reprort, Part D, p 72.↩︎

  57. https://www.abs.gov.au/media-centre/media-releases/55-million-australians-have-disability↩︎

  58. This is apparent from the test reports, which use ‘Designed to meet WCAG 2.2 AA standards’ as the criterion. The other materials simply make vague references to accessibility.↩︎

  59. https://www.w3.org/TR/wcag2mobile-22/ (at 1.2). ↩︎

  60. https://faceequalityinternational.org/FEI_2024_survey_results.pdf↩︎

  61. According to the AAATT report, there was some testing with ‘blurred, low-resolution or altered photos’, but the concern seems to be false positives (rather than false negatives for those over the age of 16) – see Report Part A, p32. This is a very different type of test to what was needed.↩︎

  62. AAATT Report, Part B, p33.↩︎

  63. AAATT Report, Part B, pp 16-7.↩︎

  64. AAATT Report, Part B, pp 20-21.↩︎

  65. AAATT Report, Part B, p24.↩︎

  66. AAATT Report, Part B, p33↩︎

  67. https://ageassurance.com.au/v/yot/↩︎

  68. AAATT Report, Yoti Test Report 1, p4.↩︎

  69. AAATT Report, Part A, p78.↩︎

  70. A focus upon adults is important, because the consequences are different. An adult who has been locked out has been wrongly locked out. By contrast, a child is not supposed to have access to begin with, so their user satisfaction is not of concern.↩︎

  71. 36% said they did not complete it all, with a further 16% saying they only completed the Mission in question ‘with some issues’. Only 48% of trials worked without problems.↩︎

  72. For example Privately, which was relatively extensively tested was not satisfactory to nearly 14% of those who tried it, with 8% saying the task was either Hard or Very Hard (Privately Test Report, pp18-19) and nearly 25% encountering at least some issues in completing the task.↩︎

  73. As an example, consider the principles of ‘Research Merit and integrity’ in the National Statement on Ethical Conduct in Human Research 2023 at [1.1] - [1.3]. There is little purpose in rehearsing all the errors and how they are inconsistent with the statement.↩︎

  74. National Statement at [5.1.11].↩︎

  75. For examples, the ‘Human Test Subjects Protocol’ says that ‘The protocols set out in this document are informed by the National Statement on Ethical Conduct in Human Research 2023’ (see page 5), whilst the School Information Pack says ‘for more information on the trial’s ethical approach, you can read the Ethics Handbook’ (page 7) and that handbook in turn says ‘Australian Government’s National Statement on Ethical Conduct in Human Research (the “National Statement”) has been a key source for this Handbook, and team members are encouraged to refer to the National Statement in addition to this document.’ (see page 8, with similar assertions appearing on pages 9 and 13).↩︎

  76. AAATT Report, Part B, p 48.↩︎

  77. AAATT Report, Part B, pp 48-55.↩︎

  78. This refers to ‘Abby Solway’, who is listed as an ‘Apprentice’ in the AAATT Report (Part B, p 49,83).↩︎

  79. Ethics Handbook, p22. ↩︎

  80. AAATT Report, Part B, p49.↩︎

  81. AAATT Project Plan, p13. Dr Ali’s available publication record does not appear to be consistent with a claim of being an eminent scientist, with few publications or citations: https://scholar.google.com/citations?user=SkW8zfIAAAAJ&hl=en.↩︎