In Bayes.pm, there's a list of common words to skip during tokenization, aka the "stop-list". Use something like Regexp::Trie to generate an optimized regexp to catch them all. ala:

perl -MRegexp::Trie -nle 'BEGIN{$r=Regexp::Trie->new;} $r->add($_); END {print $r->regexp,"\n"}'

Original list:

able
all
already
and
any
are
because
both
can
come
each
email
even
few
first
for
from
give
has
have
http
information
into
it's
just
know
like
long
look
made
mail
mailing
mailto
make
many
more
most
much
need
not
now
number
off
one
only
out
own
people
place
right
same
see
such
that
the
this
through
time
using
web
where
why
with
without
work
world
year
years
you
you're
your


If it's any use, a listing of the top 500 most common English words is available at: http://www.world-english.org/english500.htm

(words with length < 3 removed)

able
about
above
act
add
after
again
against
age
ago
air
all
also
always
among
and
animal
answer
any
appear
are
area
ask
back
base
beauty
bed
been
before
began
begin
behind
best
better
between
big
bird
black
blue
boat
body
book
both
box
boy
bring
brought
build
busy
but
call
came
can
car
care
carry
cause
center
certain
change
check
children
city
class
clear
close
cold
color
come
common
complete
contain
correct
could
country
course
cover
cross
cry
cut
dark
day
decide
deep
develop
did
differ
direct
does
dog
don't
done
door
down
draw
drive
dry
during
each
early
earth
ease
east
eat
end
enough
even
ever
every
example
eye
face
fact
fall
family
far
farm
fast
father
feel
feet
few
field
figure
fill
final
find
fine
fire
first
fish
five
fly
follow
food
foot
for
force
form
found
four
free
friend
from
front
full
game
gave
get
girl
give
gold
good
got
govern
great
green
ground
group
grow
had
half
hand
happen
hard
has
have
head
hear
heard
heat
help
her
here
high
him
his
hold
home
horse
hot
hot
hour
house
how
hundred
idea
inch
interest
island
just
keep
kind
king
knew
know
land
language
large
last
late
laugh
lay
lead
learn
leave
left
less
let
letter
life
light
like
line
list
listen
little
live
long
look
lot
love
low
machine
made
main
make
man
many
map
mark
may
mean
measure
men
might
mile
mind
minute
miss
money
moon
more
morning
most
mother
mountain
move
much
music
must
name
near
need
never
new
next
night
north
note
nothing
notice
noun
now
number
numeral
object
off
often
old
once
one
only
open
order
other
our
out
over
own
page
paper
part
pass
pattern
people
perhaps
person
picture
piece
place
plain
plan
plane
plant
play
point
port
pose
possible
pound
power
press
problem
produce
product
pull
put
question
quick
rain
ran
reach
read
ready
real
record
red
remember
rest
right
river
road
rock
room
round
rule
run
said
same
saw
say
school
science
sea
second
see
seem
self
sentence
serve
set
several
shape
she
ship
short
should
show
side
simple
since
sing
sit
six
size
sleep
slow
small
snow
some
song
soon
sound
south
special
spell
stand
star
start
state
stay
step
still
stood
stop
story
street
strong
study
such
sun
sure
surface
table
tail
take
talk
teach
tell
ten
test
than
that
the
their
them
then
there
these
they
thing
think
this
those
though
thought
thousand
three
through
time
together
told
too
took
top
toward
town
travel
tree
true
try
turn
two
under
unit
until
use
usual
very
voice
vowel
wait
walk
want
war
warm
was
watch
water
way
week
weight
well
went
were
west
what
wheel
when
where
which
while
white
who
whole
why
will
wind
with
wonder
wood
word
work
world
would
write
year
yes
yet
you
young
your

History

(this part by jm, who selected the original set)

The original selection of words in the stop-list was based on words that scored around 0.4 to 0.6 in Bayes score after learning a "typical" spam/nonspam corpus, with a very large number of hits – in other words, they were both (a) very common and (b) likely to always be ignored by the Bayes code anyway, since they were always going to fall within the $MIN_PROB_STRENGTH range. They weren't chosen just as "common" words.

However, it's arguable that the stop-list makes an assumption that everyone speaks English – in some non-English-language countries, a nonspam corpus may contain no English terms while the spam corpus is mostly-English, in which case those stop-words would actually make good spam signs.

Hence, I don't think it's a good idea to increase the stop-list with additional "common" english words.

  • No labels