We need to define some "flashlight benchmark" which would take into account most important flashlight characteristics, for example, that factor could be derived from equations that look something like this:

Flashlight lumen benchmark = Light_output_max [lm] * Run_time_till_90%_of_max [sec] / Flashlight_mass_with_batteries [g]

Flashlight candela benchmark = Luminous_intensity_max [cd] * Run_time_till_90%_of_max [sec] / ( Flashlight_mass_with_batteries [g] * (head diameter)2 [mm2] )

If you ask yourself what is ideal flashlight, you would quickly find out answer: ideal flashlight is extremely bright/extremely throwy for extremely long period of time, while its mass is close to zero, and dimensions/volume is very small.

Equations above are basically mathematical form of this "ideal flashlight" statement, and flashlight with higher calculated benchmark should be closer to "ideal flashlight".

Note that I put Run-time as time till max. initial output drops to 90% or lower (in case of thermal kick-down). This will prevent small lumen monsters to get too high scores because despite the high initial lumen output, run-time will be relatively small,so total score would be no better than some light with lower max. output but with longer run-time till <90%.