Operations FAQ

This is a FAQ for common questions that occur when debugging the operations of a running Flume cluster.

How can I tell if data is arriving at the collector?

Data in hdfs doesn't "arrive" until the file is closed or certain size thresholds are met. To see progress, you can look at the status web page on the collector node on port 35862. (http://<collector>:35862). If data is arriving if the the ...Collector.GunzipDecorator.UnbatchingDecorator.AckChecksumChecker.InsistentAppend.appendSuccessses metric should be incrementing.

I am getting a lot of duplicated event data. Why is this happening and what can I do to make this go away?

tail/multiTail have been reported to restart file reads from the beginning of files if the modification rate reaches a certain rate. This is a fundamental problem with a non-native implementation of tail. A work around is to use the OS's tail mechanism in an exec source (exec("tail -n +0 -F filename")). Alternately many people have modified their applications to push to a Flume agent with an open rpc port such as syslogTcp or thriftSource, avroSource.

In E2E mode, agents will attempt to retransmit data if no acks are recieved after flume.agent.logdir.retransmit milliseconds have expried (this is a flume-site.xml property). Acks do not return until after the collector's roll time, flume.collector.roll.millis , expires (this can be set in the flume-site.xml file or as an argument to a collector) . Make sure that the retry time on the agents is at least 2x that of the roll time on the collector.

If that was in E2E mode goes down, it will attempt to recover and resend data that did not receive acknowledgements on restart. This may result in some duplicates.

I have encountered a "Could not increment version counter" error message.

This is a zookeeper issue that seems related to virtual machines or machines that change IP address while running. This should only occur in a development environment – the work around here is to restart the master.

I have encountered a IllegalArgumentException related to checkArgument and EventImpl.

Here's an example stack trace:

2011-07-11 01:12:34,773 ERROR
com.cloudera.flume.core.connector.DirectDriver: Driving src/sink
failed! LazyOpenSource | LazyOpenDecorator because null
java.lang.IllegalArgumentException
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:75)
at com.cloudera.flume.core.EventImpl.<init>(EventImpl.java:97)
at com.cloudera.flume.core.EventImpl.<init>(EventImpl.java:87)
at com.cloudera.flume.core.EventImpl.<init>(EventImpl.java:71)
at com.cloudera.flume.handlers.syslog.SyslogWireExtractor.buildEvent(SyslogWireExtractor.java:120)
at com.cloudera.flume.handlers.syslog.SyslogWireExtractor.extract(SyslogWireExtractor.java:192)
at com.cloudera.flume.handlers.syslog.SyslogWireExtractor.extractEvent(SyslogWireExtractor.java:89)
at com.cloudera.flume.handlers.syslog.SyslogUdpSource.next(SyslogUdpSource.java:88)
at com.cloudera.flume.handlers.debug.LazyOpenSource.next(LazyOpenSource.java:57)
at com.cloudera.flume.core.connector.DirectDriver$PumperThread.run(DirectDriver.java:89)

This indicates an attempt to create an event body that is larger than the maximum allowed body size (default 32k). You can increase the size of the max event by setting flume.event.max.size.bytes in your flume-site.xml file to a larger value. We are addressing this with issue FLUME-712.

Child pages

Operations FAQ

How can I tell if data is arriving at the collector?

I am getting a lot of duplicated event data. Why is this happening and what can I do to make this go away?

I have encountered a "Could not increment version counter" error message.

I have encountered a IllegalArgumentException related to checkArgument and EventImpl.