Automated Theorem Finding


Originally, these are schemes of axioms: each formula represents infinite family of axioms obtained by substitution of formulas for A, B, C. Here, these formulas are individual axioms.
Common rules of inference are modus ponens and substitution.
Modus ponens or detachment. If F and (F → G) are theorems of propositional calculus, where F and G are propositional formulas then G is the theorem of propositional calculus. For instance, (A → A) and ((A → A) → (B → (A → A))) are theorems of propositional calculus. By modus ponens, the formula (B → (A → A)) is also theorem of propositional calculus. The formulas F and (F → G) are called premise minor and premise major respectively.
Substitution. If F is a theorem of propositional calculus, A_{1},..., A_{n} are variables and F_{1}, ..., F_{n} are formulas of propositional calculus then formula F(F_{1}/A_{1}, ..., F_{n}/A_{n}) obtained by simultaneous substitution of F_{1} for A_{1}, ..., F_{n} for A_{n} in F is also the theorem of propositional calculus. For instance, (A → A) is a theorem of propositional calculus; by substitution of (A → B) for A, the formula ((A → B) → (A → B)) is also theorem of propositional calculus.
Substitution can be applied on any theorem and used for derivation of infinitely many theorems. All theorems derived by substitution are less interesting than original one, however, they can be used in some application of modus ponens. Because of that, these two rules are integrated in one, more suitable for automated theorem finding.
Combined rule. If F and (G → H) are theorems of propositional calculus, where F, G and H are formulas of propositional calculus and there is a substitution u, such that u is unification of F and G, i.e. u(F) = u(G), then u(H) is also theorem of the propositional calculus. Without loss of generality it is required that u(H) is in canonic form, i.e. the variables occur in alphabetical order. For instance, the formula ((B → (A → B)) is not in canonic form and the formula ((A → (B → A)) is in canonic form.
The theorems of propositional calculus are derived sequentially.
The algorithm is arguably the simplest possible algorithm for automated theorem finding in propositional calculus. It is complete: all theorems of the propositional calculus (in canonic form) will occur at some place in infinite sequence. It can be said even more: all theorems will occur infinitely many times in that sequence: if F is a theorem, then for every theorem G, formula (G → F) is also theorem and F can be derived from these two. However, duplicates can be simply omitted from the sequence of theorems.
The algorithm is implemented in newLISP dialect of Lisp. Propositional formulas are naturally represented as Lisp symbolic expressions and some frequently needed operations like unification, are supported.
The program is executed until first 100 000 theorems are derived. Application of the combined rule is attempted 1 127 490 times. Application of combined rule failed in 50% of attempts because premise major wasn't implication. In 34% of attempts unification of antecedent in premise major and premise minor wasn't possible. In 16% of attempts unification succeeded and conclusion is developed. In 7% of attempts, conclusion was already derived earlier during executions of the program. In 9% of all attempts, conclusion was previously unknown.
Following table contains the list of all theorems of the size (defined as number of operator and variable occurrences) five or less derived by program, with some related data.
i  T_{i}  d(T_{i})  T_{i}  proof(T_{i})  proof(T_{i})  c(T_{i})  
1  86  (A → A)  410  3  33  5  3.21 
2  1  (A → (B → A))  1679  5  5  1  1.38 
3  3  ((A and B) → A)  1679  5  5  1  1.38 
4  4  ((A and B) → B)  1679  5  5  1  1.38 
5  6  (A → (A or B))  1678  5  5  1  1.38 
6  7  (A → (B or A))  1678  5  5  1  1.38 
7  10  ((not (not A)) → A)  1074  5  5  1  1.38 
8  1190  (not (A and (not A)))  1  5  37  5  2.06 
9  1210  (not ((not A) and A))  1  5  37  5  2.06 
10  2222  (A → (A and A))  2  5  67  9  2.32 
11  2230  (A or (B → B))  1  5  43  7  2.12 
12  2231  ((A → A) or B)  1  5  43  7  2.12 
13  2234  (A → (B → B))  3  5  43  7  2.12 
The data in columns of the table are:
i  ordinal number of derived theorem,  
T_{i}  theorem,  
d(T_{i})  number of times the theorem is derived during execution of the program,  
T_{i}  size of the theorem, defined as number of operators and variables,  
proof(T_{i})  size of the proof, defined as sum of the sizes of the theorems in proof, including derived theorem,  
proof(T_{i})  number of steps in proof of the theorem (1 for axioms) and  
c(T_{i})  coefficient defined with T_{i}^{c(Ti)} = proof(T_{i}). 
Coefficient c(T_{i}) is attempt for quantification of the intuition that interesting theorems are "deep" theorems, i.e. short theorems with long proofs.
For instance, the shortest theorem of propositional calculus, (A → A) is derived as 86. theorem; the proof is printed out in following form:
tm2. ((A → (B → C)) → ((A → B) → (A → C))), axiom
tm1. (A → (B → A)), axiom tm12. ((A → B) → (A → A)), M:tm2, m:tm1 tm86. (A → A), M:tm12, m:tm11 
Premise major and premise minor used in proving each steps are denoted with M and m respectively. Notation tm is used for internal representation of theorem in program; for this particular algorithm, indexes are same. Original output of the program contains respective unifications as well.
Some disadvantages of the Program 1. are obvious. Some of derived theorems are instances of the already derived theorems. For example, theorem 89., ((A → B) → (A → B)) is an instance of the theorem 86. The algorithm doesn't prefer short theorems, although it appears that shorter theorems are of greater interest, from perspective of human reader and, perhaps, from computational complexity perspective.
Fig 1. Size of the theorems derived by Program 1.
Described algorithm can be improved by rejection of instances and addition of preference for a shorter theorems, i.e. standard "best first" strategy.
The program is developed in newLISP and executed for nearly two weeks on few years old personal computer. Application of the combined rule is attempted 10 000 000 001 times. It failed in 51% of attempts because premise major wasn't implication. In 23% of attempts unification of antecendent in premise major and premise minor wasn't possible. In 26% of attempts unification succeeded and conclusion is developed. In 13% of attempts, conclusion was too long to be added to the theory in any reasonable time; it is unknown how many duplicates were among these. Nearly 13% attempts resulted in duplication of already derived theorems. Finally, 0.0065% attempts were successeful and 0.001% attempts resulted with 100 001 shortest theorems.
Following table contains the list of all theorems of the size (defined as number of operator and variable occurences) less or equal to five, derived by program, with some related data.
i  T_{i}  d(T_{i})  T_{i}  proof(T_{i})  proof(T_{i})  c(T_{i})  
1  13  (A → A)  431576  3  33  5  3.207534 
2  38384  ((not A) or A)  173362  4  468  57  4.651162 
3  38385  (A or (not A))  173362  4  416  51  4.516202 
4  1  (A → (B → A))  154928  5  5  1  1.37973 
5  3  ((A and B) → A)  150188  5  5  1  1.37973 
6  4  ((A and B) → B)  150188  5  5  1  1.37973 
7  6  (A → (A or B))  150630  5  5  1  1.37973 
8  7  (A → (B or A))  150630  5  5  1  1.37973 
9  10  ((not (not A)) → A)  149718  5  5  1  1.37973 
10  14  (A or (B → B))  210089  5  43  7  2.121747 
11  15  ((A → A) or B)  210089  5  43  7  2.121747 
12  16  (A → (B → B))  212851  5  43  7  2.121747 
13  50  (A → (A and A))  158696  5  67  9  2.318542 
14  80  ((A or A) → A)  158513  5  93  13  2.475692 
15  91  (not (not (A → A)))  204846  5  101  15  2.51689 
16  708  (not (A and (not A)))  156264  5  37  5  2.058924 
17  763  (not ((not A) and A))  156284  5  37  5  2.058924 
18  11142  (A → (not (not A)))  125517  5  214  27  2.92471 
For instance, T_{38384}, ((not A) or A) has the following proof, as printed out by program:
tm2. ((A → (B → C)) → ((A → B) → (A → C))), axiom tm1. (A → (B → A)), axiom tm11. (A → (B → (C → B))), M:tm1, m:tm1 tm61. ((A → B) → (A → (C → B))), M:tm2, m:tm11 tm10. ((not (not A)) → A), axiom tm56. (A → ((not (not B)) → B)), M:tm1, m:tm10 tm171. ((A → (not (not B))) → (A → B)), M:tm2, m:tm56 tm3769. ((A → (not (not B))) → (C → (A → B))), M:tm61, m:tm171 tm12. ((A → B) → (A → A)), M:tm2, m:tm1 tm63. (A → A), M:tm12, m:tm1 tm79. (((A → B) → A) → ((A → B) → B)), M:tm2, m:tm63 tm80. (A → (B → B)), M:tm1, m:tm63 tm20224. (((A → A) → B) → B), M:tm79, m:tm80 tm20448. (A → (((B → B) → C) → C)), M:tm1, m:tm20224 tm21078. ((A → ((B → B) → C)) → (A → C)), M:tm2, m:tm20448 tm9. ((A → B) → ((A → (not B)) → (not A))), axiom tm46593. (((not A) → A) → (not (not A))), M:tm21078, m:tm9 tm47470. (A → (((not B) → B) → B)), M:tm3769, m:tm46593 tm48685. ((A → ((not B) → B)) → (A → B)), M:tm2, m:tm47470 tm7. (A → (B or A)), axiom tm36. (A → (B → (C or B))), M:tm1, m:tm7 tm157. ((A → B) → (A → (C or B))), M:tm2, m:tm36 tm2963. ((not (not A)) → (B or A)), M:tm157, m:tm10 tm3583. (A → ((not (not B)) → (C or B))), M:tm1, m:tm2963 tm13238. ((A → (not (not B))) → (A → (C or B))), M:tm2, m:tm3583 tm58849. (((not (A or B)) → (not (not B))) → (A or B)), M:tm48685, m:tm13238 tm6. (A → (A or B)), axiom tm46. ((A → (not (A or B))) → (not A)), M:tm9, m:tm6 tm780. (A → ((B → (not (B or C))) → (not B))), M:tm1, m:tm46 tm23987. ((A → (B → (not (B or C)))) → (A → (not B))), M:tm2, m:tm780 tm203206. ((not (A or B)) → (not A)), M:tm23987, m:tm1 tm203408. ((not A) or A), M:tm58849, m:tm203206 
Notation tm stands for internal representation of theorems in program. For instance, T_{38384} is tm203408.
Fig 2. Size of the theorems derived by Program 2.
The graph shows that derived theorems are much shorter than those derived with Program 1. It is clearly visible which theorems are relatively short, yet they are derived relatively lately. Theorems T_{38384} and T_{38385}, for instance, are among these. Some short theorems are used in futher derivation of short theorems. It is seen on graph as dent in graph; for instance, between T_{38383} and T_{54482}. The width, depth and area of the dent suggest the importance of the theorem a posteriori. It appears that there is a correlation with size of proof as measure of the importance of theorem a priori.
Fig. 3. Size of the proofs of theorems derived by Program 2.
A few weaknesses of Program 2. are obvious. Although derived theorems are not instances of a previously derived theorems, human can recognize that many of these theorems are weak. For instance, theorems containing double negation; theorems of the form (F or G), where at least one of the formulas F and G is a theorem; theorems with subformulas of the form (F or F), (F and F), where F is any formula; thorems that could be obtained by application of commutativity and associativity of operations or and and and so forth.
The program should be extended to reject more trivial variations of already derived theorems, as humans do. The results obtained so far appear to be interesting enough to justify further experiments.