Version 33 (modified by 14 years ago) ( diff ) | ,
---|
James is working on a second generation inventory script.
Currently on: Restructuring writer and correcting table assumptions
Versions:
- gather.rb:0.91
- writer.rb:1.02
- logcopy:0.04
- inventoryv2.rb:0.05
It's plan is to be simpler and less ambitious than it's predecessor, but still respect the sql table structure ("as much as possible.")
There are 4 parts to this process:
- inventoryv2.rb: execs scripts on nodes using orbit framework
- gatherer.rb: collects information using only operating system based facilities (dmesg, lsusb, lspci, ifconfig, /sys, lshw).
- writer.rb: checks the mysql repository for changes from the current state. If different changes them.
- logcopy.rb: check the log file for errors, if present copies the Logs to a specfied location
The sql structure is a bit of a bit mess, the major tables of interest are:
- motherboards - List of things that can be connected to, has its own id used to tie other tables to it
- devices - List of deviced "connected" to mother boards
- device_kinds - type identifier for connected devices (an attribute of a device).
- locations - Converts x,y,testbed_id coordinates to a single integer
- nodes - maps motherboard to locations, also has an id for said mapping
- inventories - records the start and stop time of the inventory pass.
- testbeds - gives a testbed id for the specific domain, thus disambiguating node1-1
A lot of the tables are full of unused colums. I guess we'll just ignore them for now. The basic crux of an update should be the following:
- examine our IP (and hostname) to determine our current location
- We gather information about the mother board:
- Gatherer:
- Disk Size (dmesg)
- Memory Size (dmesg)
- Cpu number (dmesg)
- motherboard serial number (lshw)
- Gather information about attached devices:
- 2 wired Ethernet addresses (ifconfig, /sys)
- 2 wireless Ethernet addresses (ifconfig, /sys)
- any usb devices (lsusb, /sys)
- export to xml
- Writer:
- import xml output from gatherer
- collect identifiers from mysql based on gathered infromation (domain ⇒ testbed_id ;x,y,testbed_id ⇒ loc_id; mb_ser ⇒ mb_id ; loc_id ⇒ node_id ; )
- update mother board information if different, and stamp with current inventory number
- add kinds if they don't exist already
- update devices if diffrent and stamp with inventory number
- update node mb_id if loc, mb pair don't match
- profit.
Require Tools / Libraries
- lsusb (usbutils.deb)
- lspci (native)
- dmesg (native)
- ifconfig (native)
- libxml-simple-ruby.deb
- libmysql-ruby.deb
- lshw (lshw.deb)
- logger (ruby standard)
- ftools (ruby standard)
Gatherer: The disk size and memory size are a quick scan from dmesg. The disk size matches, but the memory size is a little off. It probably has to do with the way dmesg reports memory vs /sys reports memeory. It would be nice to find the /sys entry for consistency.
In /sys/devices/pci0000:00 are the sub directories correlated with the specific Ethernet hardware. In each directory that correlated to an Ethernet device there will be a symbolic link with the operating system name of the device. This will allow us to match up the pci address(name of the subdirectory of /sys/devices/pci0000:00) to the mac address (from ifconfig). lspci can tell us the associated pci address and a hardware identifier string.
lsusb on the otherhand offers a direct correlation to the device kind table, the ordered pair of numbers xxxx:yyyy directly correlated to the tables vendor and device ids. And the Bus xxx Device yyy number fits into the addres category of the device table.
9/29/09
I may have discovered the cause of the device / vendor discrepancy. Joe seems to be looking at /sys/class/net/devincename/device… perhaps this points to a different device id. I'll have to check it out.
That being said I have a working Gahterer protoype:
ssugrim@external2:~/scripts$ ruby gatherer.rb ssugrim@external2:~/scripts$ more /tmp/external2.xml <external2> <ip_adds> <10.50.0.12 iface='eth1' host='external2.orbit-lab.org'/> <127.0.0.1 iface='' host=''/> </ip_adds> <motherboard mem_size='1048512' disk_size='156301488' cpu_num='4'/> <Devices> <pci> <eth0 device='1229' bus_add='01:03.0' mac='00:e0:81:26:70:16' str='Ethernet controller: Intel Corporation 82546EB Gigabit Ethernet Controller (Copper) (rev 01)' vendor='8086'/> <eth1 device='1010' bus_add='04:01.0' mac='00:e0:81:26:76:9c' str='Ethernet controller: Intel Corporation 82546EB Gigabit Ethernet Controller (Copper) (rev 01)' vendor='8086'/> <eth2 device='1010' bus_add='04:01.1' mac='00:e0:81:26:76:9d' str='Ethernet controller: Intel Corporation 82546EB Gigabit Ethernet Controller (Copper) (rev 01)' vendor='8086'/> </pci> <usb> <0 device='0001' bus_add='001:001' str='Linux Foundation 1.1 root hub' vendor='1d6b'/> </usb> </Devices> </external2> ssugrim@external2:~/scripts$
10/2/09
Minus error checking for failed commands, the gatherer is complete. I'm now moving onto writer. I'm going to keep them in the same script for now, so I don't have to deal with reimporting the data and extracting it from xml, at some point that'll be a todo, so that way we can call just the gatherer if we want to.
Fow now, I need to determine what node I am based on the resolved host name. The scheme is nodex-y.testbedname# I can extract the x and y cooridnates from the node part, and then The testbed name will have to be a lookup. (this should probably be in gatherer as parameters.
Once I have that I can look up my unique mysql id from the mysql databse. This id will then allow me to correlate devices with the ones I have.
Following the instructions on http://support.tigertech.net/mysql-duplicate
I copied the mysql database from invetory1 to inventory2.
One Caveat is noted on http://forums.digitalpoint.com/showthread.php?t=259486
In the top of the database file you are trying to dump you will see that : CREATE DATABASE `gunit_pimpjojo` DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci; Just remove this from the dump ( Notepad or wherever you have the dump) Then re paste the file You just need to remove that line....and you will be good to go
10/5/09
I've revamped the internal data types to facilitate the way xml-simple outputs, and re imports. Any multi argument results (eth, usb, ip) return and array of hashes. This creates clean xml. I also unfolded the cords has to single instance variables, they all get wrapped up into a single attribute.
The new xml format looks like so.
<opt x="1" y="1" disk_size="156301488" domain="sb7" cpu_num="1" mem_size="491456"> <pci device="0013" name="ath0" bus_add="00:09.0" mac="00:60:b3:ac:2b:92" str="Ethernet controller: Atheros Communications, Inc. AR5212/AR5213 Multiprotocol MAC/baseband processor (rev 01)" vendor ="168c" /> <pci device="0013" name="ath1" bus_add="00:0a.0" mac="00:60:b3:ac:2b:66" str="Ethernet controller: Atheros Communications, Inc. AR5212/AR5213 Multiprotocol MAC/baseband processor (rev 01)" vendor="168c" /> <pci device="4320" name="eth0" bus_add="00:0b.0" mac="00:0f:ea:4a:8b:56" str="Ethernet controller: Marvell Technology Group Ltd. 88E8001 Gigabit Ethernet Controller (rev 13)" vendor="11ab" /> <pci device="4320" name="eth1" bus_add="00:0c.0" mac="00:0f:ea:4a:8b:57" str="Ethernet controller: Marvell Technology Group Ltd. 88E8001 Gigabit Ethernet Controller (rev 13)" vendor="11ab" /> <usb device="6001" name="usb" bus_add="001:009" str="Future Technology Devices International, Ltd FT232 USB-Serial (UART) IC" vendor="0403" /> <ip ip="10.17.1.1" host="node1-1.sb7.orbit-lab.org" iface="eth1" /> <ip ip="127.0.0.1" host="" iface="" /> </opt>
I've also gone to the original two script model. Gatherer is "feature complete".
Working on the writer I've created a internal data type called Xmldata, it's got exactly the same fields as Info, but populates them from the generated xml file.
Working on the mysql part of I have to examine the code that lives in
ssugrim@internal1:/opt/gridservices2-2.2.0/lib/ogs_inventory$
NOTE: mysql query strings should be crafted prior to the actual execusiton of the query, since they don't always form the way you think they do. Also the %&string& formulation of strings is very helpfull in getting the quotes correct.
10/08/09
The writer script is now equipped with two classes Xmldata, and Identify. Both can only be instantiated by the create command, making them singletons (create will only run new if one does not already exist). Identify instantiates an Xmldata class, and then uses the x and y coordinates and the domain to determine a location id (the global unique identifier that ties all the tables together.) I also get the max id from the Inventory ids, assuming that the highest number is the latest.
10/12/09
Quick edit to gatherer to convert the device and vendor tags to decimal instead of hex. The reason they didn't match before was because in the sql database, they are stored as decimal (I guess cuz you can't store hex in mysql).
10/18/09
Writer is "feature complete". The mail (non-data) class is Check_sql. Besides new, it's main methods are check and update. They respectively compare the xmldata against sql and update the database if the data doesn't match. I'd like to be more "indpendent" of the form of the xmldata, but that would involve a lot more dummy varibles and searching of hashes.
Big TODO is mostly rescuing errors. First on the list is connect retries. Class interface descriptions to follow soon.
10/20/09
Modified both gatherer and writer to take parameters. The paramters are as follows:
Writer: --server = #server hostname (default: internal1.orbit-lab.org) --user = #username for mysql --pass = #password --db = #database name (default: inventory2) --input = #input file name (default: foo.xml) Gatherer: --output = #name of outputfile (defualt: stdout)
Also now writer only checks vendor and device id. If no match is found it will add it with the description string.
10/26/09
Modifying gather to use lshw to collect uuid (motherboard serial number) also changing the internal data types to more closely match the table contents e.g devices and motherboards.
Originally I thought to just used lshw to gather most of the information, but this doesn't really gain us any thing since I would have to resort to the other tools (lsusb and lspci) to find the relevant entries in lshw output (and would require a huge rewrite of the gatherer). Lshw can output in a few diffrent ways. I'm currently using the direct line by line approach to search for the uuid. I did however experiment with the -xml output. When imported with XmlSimple.xml_in(), we get a massive hash/array structure that contains all the data elements as either a value in a hash or a item in an array. To find what we're looking for we need a recursive tool to extract the relevant data structures. An example was written in titled lshw_recursion_example.rb the main recursive tool keeps calling the each method of every sub-element (hashes and arrays both have an each method, they behave differently, thus a check for class is needed first).
One snag that was "hacked" in was that if we find the keyword in an array if we push the containing data type, all we get is an array with that keyword. I resolved this by passing the previous data structure as a parameter. If the keyword is found in a hash I store the hash, if it's found in an array, I store the previous hash. I opted to hunt for an list of words instead of a single one. Its more efficient than iterating across the entire structure multiple times for each word. We don't iterate through the words for every element, just the ones that are the termination of a branch. This saves a lot of computation and prevents a few type errors. Its assumed that the word list is significantly smaller than the size of the hash. Some example code:
found = Hash.new() def hunt (dat,words,found,prev=nil) #check the type if dat.kind_of?(Array): dat.each do |v| #iterate over the current type and check for an instance the words words.each {|w| found.store(w,prev) if /#{w}/.match(v)} if v.kind_of?(String) #recursively call the function on the children of this data structre #note the parent is passed as a parameter as the array branch needs to store the container hunt(v,words,found,dat) end elsif dat.kind_of?(Hash) dat.each do |k,v| #same deal as the array cept we have a key,value combo, and we can store the current data #data structure. We still need to pass the parent as a parameter since we don't know #what type the child is words.each {|w| found.store(w,dat) if /#{w}/.match(v)} if v.kind_of?(String) hunt(v,words,found,dat) end end
11/4/09
I'll need to revisit the use of recursion for lshw. I have some working ideas on how to do it. Ivan suggest multi tier iterations where I hunt for keywords following some kind of "path of keywords". Using the "hunt" multiple times with a sequence of keywords (examining keys as well as values), we should be able to iteratively extract smaller and smaller data structures that contain more relevant information.
More immediately are the changes that need to be made to write to reflect the table structure in the mysql they are:
- Need to get mother board id from table matching against serial number
- Update node to correlate mother board to location (when they move)
- motherboard updates should only modify disk and memory (the mother board id should not change)
- If a motherboard is not found then we insert it.
- should get node id from sql table matching against location
11/17/09
Modifications on writer have been completed (preliminary checks worked).
- reverted Db.hupdate to only update. The calling functions should decide whether to insert or update.
- Mb_id nows checks against serial instead of location in the Identify class
- update_mb now checks for mb_id. If the ID is present it will update the record otherwise it will insert WITHOUT specifying and ID since SQL should autoincremt the ids
- Nodes are uniquely identified by a triple of (node_id, location_id, motherboar_id). Its assumed that the (node_id,location_id) portion is constant. Thus the only change/ update we should check for and preform is to ensure that the motherboard id matches for a given node_id, location_id pair. the update_node function only modifies motherboard_id's
Things that need to be done:
- move all the "checks" into the get methods (depreciate the get methods). check() should simply call the sub_check methods and retain a hash of matches for specific data structures (table entries).
- update can then iterate that hash and call the respective update functions for each give table.
- to that end the update_device method needs to be split in two to reflect the data structures
- the data structure design paradigm should be to have one data structure for each table that needs to be checked / update. It's pretty close to this, but needs a little work.
11/23/09
The previous was completed. There was a two bugs that needed tweaking
- Update node did not update the inventory when it updated a node info.
- Had to add a hack to prevent unknown dev_ids from getting double entered in update_adds when the id is unknown. If the device has
multiple instances of a pice of unknown hardware (like 2 new Ethernet cards), the current routine will double add them.
- this hack should be re-visited for efficiency, currently it double checks for a kind (in case one was added after the adds_array was populated). This is very wasteful as the missing kinds should be a rare event. I should probably switch to a different function or something if I've entered the rare "never seen it before" scenario.
11/24/09
Fixed a few bugs from previous edits:
- I added a kind check/update that precededs the device update so that kinds are always populated before devices
- each update now calls check (except for devices as they're the last). Update mb also repopulates the mb_id.
- the mb_id information was moved from Identify to check_sql since it's dynamic and properly belongs there. Identify no longer has a mb_id method.
5/19/2010
Adding Error handling and logging to the writer script:
- the require 'logger' + code to actually log Info and errors is now in place
- For the Db.connect method I've added a begin/rescue/end block. It sleeps for 60 + rand(60) seconds then tries to connect again, but'll I'll have to fine tune its behavior to only reconnect when It can't reach instead of reconecting every time. it Also logs how long it's going to wait until it tries again.
- Jack is stripping things for gatherer but we'll eventually merge that code.
- there are 5 places where I need to try and catch exceptions, most of the other exceptions I want to go unhanded so it terminates: I'll need to put in a bunch of debug loggin to make sure my data is what I think it is.
5/22/2010
I've put a bunch of logging and error handling code into the writer version 0.99. It should now log appropriately. To generate a bunch of meaning full results I've taken these steps:
- Writer now logs to /tmp/writer.log
- Wrote a new script called logcopy.rb
- It checks the log for errors
- if errors are found I mount command the tmp directory on repo1
repository1:/export/orbit/image/tmp /mnt as my mount
- In the tmp directory is a new directory called logs
- the /tmp/writer.log file is copied to this logs directory and stamped with the name of the node it came from
- I've created a new image called inventoryV2.ndz which has all the updated scripts (writer and logcopy)
- I've modified the inventoryV2.rb script to call logcopy as the final step. it's now named inventoryV2-1.rb
6/24/2010
I forgot to log a bunch of changes:
- found a bug because I reduced the wait time for the experiment to complete. Since connect retry was rand(60)+(60) my log copies would copy over incomplete files, since it waited only 30 seconds to copy. I've redone the numbers, I wait only rand(20) to attempt to reconnect but now try 3 times. I wait 90 seconds before trying to copy the log file, this should give me enough time to capture the retries.
- There was a cascade of query failures due to the fact that I was searching for the testbed id with the short domain name, instead of the FQDN in the test bed table. This value was given to me by the gatherer, I've since modified the gatherer to use the FQDN. All the other queries depended on this number so the error propagated down.
- This error however demonstrated a specific flaw in how I handled empty query results. I indicate Failed queries by returning nil, instead of an array. The only reason I caught the error was because I tried to flatten the nil return. I've updated this to raise an exception if a nil query occurs for any of the members of the Identify class. Not being able to identify the node should be a fatal error. This exception is unhandled so it will propagate up to main block where it gets logged and terminates the script.
- Also added a little bit of logging to gatherer but not much. I should really fix it's error handeling
- Made Check_sql.check_in a public method and had it called last in the MAIN.
- Noticed that sometimes I get a xml create object error. I'll have to figure out why that's happening. It's probably due to gatherer not completing properly. But Now I should be able to find the nodes where it happens.
- Trying to stick to the logging/exception rasing convention of setting the error text to: "class.method - error"
6/25/2010
We're going live! I'll actively update inventory52.
Few minor tweaks:
- Edited logcopy to wait only 30 seconds before unmounting
- set default to be inventory52, since this is now going to be our main table.
Since I now checkin last and call it externally. I should include number of changed lines in my checkin. Would be helpful for diagnostics.
6/29/2010
After some thought, I realized that Writer should call logcopy as it's last action. This ensures that log copy copies a complete file. It avoids a timing problem where the Inventory script would have to guess a reasonable time for writer to complete. Logcopy is ensure a properly closed file as writer controls when logcopy is called. I could have put this in a at_Exit block, bit I just left it in the MAIN ensure block. I used the system call:
system("/usr/bin/ruby /root/logcopy.rb")
Note the explicit paths. The OMF exec subshells don't understand relative paths. I could have used exec, but it replaces the current process with the one to be execed. While this could have worked it would have prematurely terminated writer with out closing out all the objects. That probably should matter, but it's not neat. While checking on the exection I noted that at the point where writer invokes logcopy,
The 3 mains competeing stratgies are system, exec, and %x[]. Where the last one is very similar to back ticks command
. I guess there is also psopen and ps3open. System is good enough for these purposes since I only care about it execution, not out put.
I've created a new inventory image and inventory script to reflect this change: james_inv_2.ndz and inventory2-2.rb are a testing pair. They'll replace the last known good pair: james_inv.ndz and inventory2-1.rb
TODO Have logcopy do sanity check for files and replace them